| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.searching">
- <title>Searching an Index</title>
- <sect2 id="zend.search.lucene.searching.query_building">
- <title>Building Queries</title>
- <para>
- There are two ways to search the index. The first method uses
- query parser to construct a query from a string. The second is
- to programmatically create your own queries through the <classname>Zend_Search_Lucene</classname> API.
- </para>
- <para>
- Before choosing to use the provided query parser, please consider
- the following:
- <orderedlist>
- <listitem>
- <para>
- If you are programmatically creating a query string and then parsing
- it with the query parser then you should consider building
- your queries directly with the query API. Generally speaking, the query
- parser is designed for human-entered text, not for program-generated text.
- </para>
- </listitem>
- <listitem>
- <para>
- Untokenized fields are best added directly to queries and not through
- the query parser. If a field's values are generated programmatically
- by the application, then the query clauses for this field should also
- be constructed programmatically.
- An analyzer, which the query parser uses, is designed to convert
- human-entered text to terms. Program-generated values, like dates,
- keywords, etc., should be added with the query API.
- </para>
- </listitem>
- <listitem>
- <para>
- In a query form, fields that are general text should use the query parser.
- All others, such as date ranges, keywords, etc., are better added directly
- through the query API. A field with a limited set of values that can be
- specified with a pull-down menu should not be added to a query string
- that is subsequently parsed but instead should be added as a TermQuery clause.
- </para>
- </listitem>
- <listitem>
- <para>
- Boolean queries allow the programmer to logically combine two or more queries into new one.
- Thus it's the best way to add additional criteria to a search defined by
- a query string.
- </para>
- </listitem>
- </orderedlist>
- </para>
- <para>
- Both ways use the same API method to search through the index:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open('/data/my_index');
- $index->find($query);
- ]]></programlisting>
- <para>
- The <classname>Zend_Search_Lucene::find()</classname> method determines the input type automatically and
- uses the query parser to construct an appropriate <classname>Zend_Search_Lucene_Search_Query</classname> object
- from an input of type string.
- </para>
- <para>
- It is important to note that the query parser uses the standard analyzer to tokenize separate parts of query string.
- Thus all transformations which are applied to indexed text are also applied to query strings.
- </para>
- <para>
- The standard analyzer may transform the query string to lower case for case-insensitivity, remove stop-words, and stem among other transformations.
- </para>
- <para>
- The API method doesn't transform or filter input terms in any way. It's therefore more suitable for
- computer generated or untokenized fields.
- </para>
- <sect3 id="zend.search.lucene.searching.query_building.parsing">
- <title>Query Parsing</title>
- <para>
- <classname>Zend_Search_Lucene_Search_QueryParser::parse()</classname> method may be used to parse query strings
- into query objects.
- </para>
- <para>
- This query object may be used in query construction API methods to combine user entered queries with
- programmatically generated queries.
- </para>
- <para>
- Actually, in some cases it's the only way to search for values within untokenized fields:
- <programlisting language="php"><![CDATA[
- $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
- $pathTerm = new Zend_Search_Lucene_Index_Term(
- '/data/doc_dir/' . $filename, 'path'
- );
- $pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm);
- $query = new Zend_Search_Lucene_Search_Query_Boolean();
- $query->addSubquery($userQuery, true /* required */);
- $query->addSubquery($pathQuery, true /* required */);
- $hits = $index->find($query);
- ]]></programlisting>
- </para>
- <para>
- <classname>Zend_Search_Lucene_Search_QueryParser::parse()</classname> method also takes an optional encoding parameter,
- which can specify query string encoding:
- <programlisting language="php"><![CDATA[
- $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr,
- 'iso-8859-5');
- ]]></programlisting>
- </para>
- <para>
- If the encoding parameter is omitted, then current locale is used.
- </para>
- <para>
- It's also possible to specify the default query string encoding with
- <classname>Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding()</classname> method:
- <programlisting language="php"><![CDATA[
- Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-5');
- ...
- $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
- ]]></programlisting>
- </para>
- <para>
- <classname>Zend_Search_Lucene_Search_QueryParser::getDefaultEncoding()</classname> returns the current default query
- string encoding (the empty string means "current locale").
- </para>
- </sect3>
- </sect2>
- <sect2 id="zend.search.lucene.searching.results">
- <title>Search Results</title>
- <para>
- The search result is an array of <classname>Zend_Search_Lucene_Search_QueryHit</classname> objects. Each of these has
- two properties: <code>$hit->document</code> is a document number within
- the index and <code>$hit->score</code> is a score of the hit in
- a search result. The results are ordered by score (descending from highest score).
- </para>
- <para>
- The <classname>Zend_Search_Lucene_Search_QueryHit</classname> object also exposes each field of the <classname>Zend_Search_Lucene_Document</classname> found in the search
- as a property of the hit. In the following example, a hit is returned with two fields from the corresponding document: title and author.
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open('/data/my_index');
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- echo $hit->score;
- echo $hit->title;
- echo $hit->author;
- }
- ]]></programlisting>
- <para>
- Stored fields are always returned in UTF-8 encoding.
- </para>
- <para>
- Optionally, the original <classname>Zend_Search_Lucene_Document</classname> object can be returned from the
- <classname>Zend_Search_Lucene_Search_QueryHit</classname>.
- You can retrieve stored parts of the document by using the <code>getDocument()</code>
- method of the index object and then get them by
- <code>getFieldValue()</code> method:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open('/data/my_index');
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- // return Zend_Search_Lucene_Document object for this hit
- echo $document = $hit->getDocument();
- // return a Zend_Search_Lucene_Field object
- // from the Zend_Search_Lucene_Document
- echo $document->getField('title');
- // return the string value of the Zend_Search_Lucene_Field object
- echo $document->getFieldValue('title');
- // same as getFieldValue()
- echo $document->title;
- }
- ]]></programlisting>
- <para>
- The fields available from the <classname>Zend_Search_Lucene_Document</classname> object are determined at
- the time of indexing. The document fields are either indexed, or
- index and stored, in the document by the indexing application
- (e.g. LuceneIndexCreation.jar).
- </para>
- <para>
- Note that the document identity ('path' in our example) is also stored
- in the index and must be retrieved from it.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.results-limiting">
- <title>Limiting the Result Set</title>
- <para>
- The most computationally expensive part of searching is score calculation. It may take several seconds for large result sets (tens of thousands of hits).
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> gives the possibility to limit result set size with <code>getResultSetLimit()</code> and
- <code>setResultSetLimit()</code> methods:
- <programlisting language="php"><![CDATA[
- $currentResultSetLimit = Zend_Search_Lucene::getResultSetLimit();
- Zend_Search_Lucene::setResultSetLimit($newLimit);
- ]]></programlisting>
- The default value of 0 means 'no limit'.
- </para>
- <para>
- It doesn't give the 'best N' results, but only the 'first N'<footnote><para>Returned hits are still ordered by score or by the the specified order, if given.</para></footnote>.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.results-scoring">
- <title>Results Scoring</title>
- <para>
- <classname>Zend_Search_Lucene</classname> uses the same scoring algorithms as Java Lucene.
- All hits in the search result are ordered by score by default. Hits with greater score come first, and
- documents having higher scores should match the query more precisely than documents having lower scores.
- </para>
- <para>
- Roughly speaking, search hits that contain the searched term or phrase more frequently
- will have a higher score.
- </para>
- <para>
- A hit's score can be retrieved by accessing the <code>score</code> property of the hit:
- </para>
- <programlisting language="php"><![CDATA[
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- echo $hit->id;
- echo $hit->score;
- }
- ]]></programlisting>
- <para>
- The <classname>Zend_Search_Lucene_Search_Similarity</classname> class is used to calculate the score for each hit.
- See <link linkend="zend.search.lucene.extending.scoring">Extensibility. Scoring Algorithms</link> section for details.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.sorting">
- <title>Search Result Sorting</title>
- <para>
- By default, the search results are ordered by score. The programmer can change this behavior by setting a sort field (or a list of fields), sort type
- and sort order parameters.
- </para>
- <para>
- <code>$index->find()</code> call may take several optional parameters:
- <programlisting language="php"><![CDATA[
- $index->find($query [, $sortField [, $sortType [, $sortOrder]]]
- [, $sortField2 [, $sortType [, $sortOrder]]]
- ...);
- ]]></programlisting>
- </para>
- <para>
- A name of stored field by which to sort result should be passed as the <code>$sortField</code> parameter.
- </para>
- <para>
- <code>$sortType</code> may be omitted or take the following enumerated values:
- <code>SORT_REGULAR</code> (compare items normally- default value),
- <code>SORT_NUMERIC</code> (compare items numerically),
- <code>SORT_STRING</code> (compare items as strings).
- </para>
- <para>
- <code>$sortOrder</code> may be omitted or take the following enumerated values:
- <code>SORT_ASC</code> (sort in ascending order- default value),
- <code>SORT_DESC</code> (sort in descending order).
- </para>
- <para>
- Examples:
- <programlisting language="php"><![CDATA[
- $index->find($query, 'quantity', SORT_NUMERIC, SORT_DESC);
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- $index->find($query, 'fname', SORT_STRING, 'lname', SORT_STRING);
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- $index->find($query, 'name', SORT_STRING, 'quantity', SORT_NUMERIC, SORT_DESC);
- ]]></programlisting>
- </para>
- <para>
- Please use caution when using a non-default search order;
- the query needs to retrieve documents completely from an index, which may dramatically reduce search performance.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.highlighting">
- <title>Search Results Highlighting</title>
- <para>
- Zend_Search_Lucene provides two options for search results highlighting.
- </para>
- <para>
- The first one is utilizing <classname>Zend_Search_Lucene_Document_Html</classname> class
- (see <link linkend="zend.search.lucene.index-creation.html-documents">HTML documents section</link> for details)
- using the following methods:
- <programlisting language="php"><![CDATA[
- /**
- * Highlight text with specified color
- *
- * @param string|array $words
- * @param string $colour
- * @return string
- */
- public function highlight($words, $colour = '#66ffff');
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- /**
- * Highlight text using specified View helper or callback function.
- *
- * @param string|array $words Words to highlight. Words could be organized using the array or string.
- * @param callback $callback Callback method, used to transform (highlighting) text.
- * @param array $params Array of additionall callback parameters passed through into it
- * (first non-optional parameter is an HTML fragment for highlighting)
- * @return string
- * @throws Zend_Search_Lucene_Exception
- */
- public function highlightExtended($words, $callback, $params = array())
- ]]></programlisting>
- </para>
- <para>
- To customize highlighting behavior use <code>highlightExtended()</code> method with specified callback, which takes
- one or more parameters<footnote><para>The first is an HTML fragment for highlighting and others are callback behavior
- dependent. Returned value is a highlighted HTML fragment.</para></footnote>, or extend
- <classname>Zend_Search_Lucene_Document_Html</classname> class and redefine <code>applyColour($stringToHighlight, $colour)</code>
- method used as a default highlighting callback.
- <footnote>
- <para>
- In both cases returned HTML is automatically transformed into valid XHTML.
- </para>
- </footnote>
- </para>
- <para>
- <link linkend="zend.view.helpers">View helpers</link> also can be used as callbacks in context of view script:
- <programlisting language="php"><![CDATA[
- $doc->highlightExtended('word1 word2 word3...', array($this, 'myViewHelper'));
- ]]></programlisting>
- </para>
- <para>
- The result of highlighting operation is retrieved by <code>Zend_Search_Lucene_Document_Html->getHTML()</code> method.
- </para>
- <note>
- <para>
- Highlighting is performed in terms of current analyzer. So all forms of the word(s) recognized by analyzer
- are highlighted.
- </para>
- <para>
- E.g. if current analyzer is case insensitive and we request to highlight 'text' word, then 'text', 'Text', 'TEXT'
- and other case combinations will be highlighted.
- </para>
- <para>
- In the same way, if current analyzer supports stemming and we request to highlight 'indexed', then 'index',
- 'indexing', 'indices' and other word forms will be highlighted.
- </para>
- <para>
- On the other hand, if word is skipped by corrent analyzer (e.g. if short words filter is applied to the analyzer),
- then nothing will be highlighted.
- </para>
- </note>
- <para>
- The second option is to use
- <code>Zend_Search_Lucene_Search_Query->highlightMatches(string $inputHTML[, $defaultEncoding = 'UTF-8'[, Zend_Search_Lucene_Search_Highlighter_Interface $highlighter]])</code>
- method:
- <programlisting language="php"><![CDATA[
- $query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
- $highlightedHTML = $query->highlightMatches($sourceHTML);
- ]]></programlisting>
- </para>
- <para>
- Optional second parameter is a default HTML document encoding. It's used if encoding is not specified using
- Content-type HTTP-EQUIV meta tag.
- </para>
- <para>
- Optional third parameter is a highlighter object which has to implement
- <code>Zend_Search_Lucene_Search_Highlighter_Interface</code> interface:
- <programlisting language="php"><![CDATA[
- interface Zend_Search_Lucene_Search_Highlighter_Interface
- {
- /**
- * Set document for highlighting.
- *
- * @param Zend_Search_Lucene_Document_Html $document
- */
- public function setDocument(Zend_Search_Lucene_Document_Html $document);
- /**
- * Get document for highlighting.
- *
- * @return Zend_Search_Lucene_Document_Html $document
- */
- public function getDocument();
- /**
- * Highlight specified words (method is invoked once per subquery)
- *
- * @param string|array $words Words to highlight. They could be organized using the array or string.
- */
- public function highlight($words);
- }
- ]]></programlisting>
- Where <classname>Zend_Search_Lucene_Document_Html</classname> object is an object constructed from the source HTML
- provided to the <classname>Zend_Search_Lucene_Search_Query->highlightMatches()</classname> method.
- </para>
- <para>
- If <code>$highlighter</code> parameter is omitted, then <classname>Zend_Search_Lucene_Search_Highlighter_Default</classname>
- object is instantiated and used.
- </para>
- <para>
- Highlighter <code>highlight()</code> method is invoked once per subquery, so it has an ability to differentiate
- highlighting for them.
- </para>
- <para>
- Actually, default highlighter does this walking through predefined color table. So you can implement
- your own highlighter or just extend the default and redefine color table.
- </para>
- <para>
- <code>Zend_Search_Lucene_Search_Query->htmlFragmentHighlightMatches()</code> has similar behavior. The only differenece
- is that it takes as an input and returns HTML fragment without <>HTML>, <HEAD>, <BODY> tags.
- Nevertheless, fragment is automatically transformed to valid XHTML.
- </para>
- </sect2>
- </sect1>
|