| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.searching">
- <title>Searching an Index</title>
- <sect2 id="zend.search.lucene.searching.query_building">
- <title>Building Queries</title>
- <para>
- There are two ways to search the index. The first method uses
- query parser to construct a query from a string. The second is
- to programmatically create your own queries through the
- <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>.
- </para>
- <para>
- Before choosing to use the provided query parser, please consider
- the following:
- <orderedlist>
- <listitem>
- <para>
- If you are programmatically creating a query string and then parsing
- it with the query parser then you should consider building your queries
- directly with the query <acronym>API</acronym>. Generally speaking, the
- query parser is designed for human-entered text, not for program-generated
- text.
- </para>
- </listitem>
- <listitem>
- <para>
- Untokenized fields are best added directly to queries and not through
- the query parser. If a field's values are generated programmatically
- by the application, then the query clauses for this field should also
- be constructed programmatically.
- An analyzer, which the query parser uses, is designed to convert
- human-entered text to terms. Program-generated values, like dates,
- keywords, etc., should be added with the query <acronym>API</acronym>.
- </para>
- </listitem>
- <listitem>
- <para>
- In a query form, fields that are general text should use the query parser.
- All others, such as date ranges, keywords, etc., are better added directly
- through the query <acronym>API</acronym>. A field with a limited set of
- values that can be specified with a pull-down menu should not be added to a
- query string that is subsequently parsed but instead should be added as a
- TermQuery clause.
- </para>
- </listitem>
- <listitem>
- <para>
- Boolean queries allow the programmer to logically combine two or more
- queries into new one. Thus it's the best way to add additional criteria to a
- search defined by a query string.
- </para>
- </listitem>
- </orderedlist>
- </para>
- <para>
- Both ways use the same <acronym>API</acronym> method to search through the index:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open('/data/my_index');
- $index->find($query);
- ]]></programlisting>
- <para>
- You can also search multiple indexes simultaneously using MultiSearcher, which operates
- using the same <acronym>API</acronym> as searching on a single index:
- </para>
- <programlisting language="php"><![CDATA[
- $multi = new Zend_Search_Lucene_MultiSearcher();
- $multi->addIndex(Zend_Search_Lucene::open('/data/my_index_one');
- $multi->addIndex(Zend_Search_Lucene::open('/data/my_index_two');
- $multi->find($query);
- ]]></programlisting>
- <para>
- The <methodname>Zend_Search_Lucene::find()</methodname> method determines the input type
- automatically and uses the query parser to construct an appropriate
- <classname>Zend_Search_Lucene_Search_Query</classname> object from an input of type
- string.
- </para>
- <para>
- It is important to note that the query parser uses the standard analyzer to tokenize
- separate parts of query string. Thus all transformations which are applied to indexed
- text are also applied to query strings.
- </para>
- <para>
- The standard analyzer may transform the query string to lower case for
- case-insensitivity, remove stop-words, and stem among other transformations.
- </para>
- <para>
- The <acronym>API</acronym> method doesn't transform or filter input terms in any way.
- It's therefore more suitable for computer generated or untokenized fields.
- </para>
- <sect3 id="zend.search.lucene.searching.query_building.parsing">
- <title>Query Parsing</title>
- <para>
- <methodname>Zend_Search_Lucene_Search_QueryParser::parse()</methodname> method may
- be used to parse query strings into query objects.
- </para>
- <para>
- This query object may be used in query construction <acronym>API</acronym> methods
- to combine user entered queries with programmatically generated queries.
- </para>
- <para>
- Actually, in some cases it's the only way to search for values within untokenized
- fields:
- </para>
- <programlisting language="php"><![CDATA[
- $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
- $pathTerm = new Zend_Search_Lucene_Index_Term(
- '/data/doc_dir/' . $filename, 'path'
- );
- $pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm);
- $query = new Zend_Search_Lucene_Search_Query_Boolean();
- $query->addSubquery($userQuery, true /* required */);
- $query->addSubquery($pathQuery, true /* required */);
- $hits = $index->find($query);
- ]]></programlisting>
- <para>
- <methodname>Zend_Search_Lucene_Search_QueryParser::parse()</methodname> method also
- takes an optional encoding parameter, which can specify query string encoding:
- </para>
- <programlisting language="php"><![CDATA[
- $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr,
- 'iso-8859-5');
- ]]></programlisting>
- <para>
- If the encoding parameter is omitted, then current locale is used.
- </para>
- <para>
- It's also possible to specify the default query string encoding with
- <methodname>Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding()</methodname>
- method:
- </para>
- <programlisting language="php"><![CDATA[
- Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-5');
- ...
- $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
- ]]></programlisting>
- <para>
- <methodname>Zend_Search_Lucene_Search_QueryParser::getDefaultEncoding()</methodname>
- returns the current default query string encoding (the empty string means "current
- locale").
- </para>
- </sect3>
- </sect2>
- <sect2 id="zend.search.lucene.searching.results">
- <title>Search Results</title>
- <para>
- The search result is an array of
- <classname>Zend_Search_Lucene_Search_QueryHit</classname> objects. Each of these has two
- properties: <code>$hit->id</code> is a document number within the index and
- <code>$hit->score</code> is a score of the hit in a search result. The results are
- ordered by score (descending from highest score).
- </para>
- <para>
- The <classname>Zend_Search_Lucene_Search_QueryHit</classname> object also exposes each
- field of the <classname>Zend_Search_Lucene_Document</classname> found in the search as a
- property of the hit. In the following example, a hit is returned with two fields from
- the corresponding document: title and author.
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open('/data/my_index');
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- echo $hit->score;
- echo $hit->title;
- echo $hit->author;
- }
- ]]></programlisting>
- <para>
- Stored fields are always returned in UTF-8 encoding.
- </para>
- <para>
- Optionally, the original <classname>Zend_Search_Lucene_Document</classname> object can
- be returned from the <classname>Zend_Search_Lucene_Search_QueryHit</classname>.
- You can retrieve stored parts of the document by using the
- <methodname>getDocument()</methodname> method of the index object and then get them by
- <methodname>getFieldValue()</methodname> method:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open('/data/my_index');
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- // return Zend_Search_Lucene_Document object for this hit
- echo $document = $hit->getDocument();
- // return a Zend_Search_Lucene_Field object
- // from the Zend_Search_Lucene_Document
- echo $document->getField('title');
- // return the string value of the Zend_Search_Lucene_Field object
- echo $document->getFieldValue('title');
- // same as getFieldValue()
- echo $document->title;
- }
- ]]></programlisting>
- <para>
- The fields available from the <classname>Zend_Search_Lucene_Document</classname> object
- are determined at the time of indexing. The document fields are either indexed, or
- index and stored, in the document by the indexing application
- (e.g. LuceneIndexCreation.jar).
- </para>
- <para>
- Note that the document identity ('path' in our example) is also stored
- in the index and must be retrieved from it.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.results-limiting">
- <title>Limiting the Result Set</title>
- <para>
- The most computationally expensive part of searching is score calculation. It may take
- several seconds for large result sets (tens of thousands of hits).
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> gives the possibility to limit result set size
- with <methodname>getResultSetLimit()</methodname> and
- <methodname>setResultSetLimit()</methodname> methods:
- </para>
- <programlisting language="php"><![CDATA[
- $currentResultSetLimit = Zend_Search_Lucene::getResultSetLimit();
- Zend_Search_Lucene::setResultSetLimit($newLimit);
- ]]></programlisting>
- <para>
- The default value of 0 means 'no limit'.
- </para>
- <para>
- It doesn't give the 'best N' results, but only the 'first N'
- <footnote>
- <para>
- Returned hits are still ordered by score or by the specified order, if given.
- </para>
- </footnote>.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.results-scoring">
- <title>Results Scoring</title>
- <para>
- <classname>Zend_Search_Lucene</classname> uses the same scoring algorithms as Java
- Lucene. All hits in the search result are ordered by score by default. Hits with greater
- score come first, and documents having higher scores should match the query more
- precisely than documents having lower scores.
- </para>
- <para>
- Roughly speaking, search hits that contain the searched term or phrase more frequently
- will have a higher score.
- </para>
- <para>
- A hit's score can be retrieved by accessing the <code>score</code> property of the hit:
- </para>
- <programlisting language="php"><![CDATA[
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- echo $hit->id;
- echo $hit->score;
- }
- ]]></programlisting>
- <para>
- The <classname>Zend_Search_Lucene_Search_Similarity</classname> class is used to
- calculate the score for each hit. See <link
- linkend="zend.search.lucene.extending.scoring">Extensibility. Scoring
- Algorithms</link> section for details.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.sorting">
- <title>Search Result Sorting</title>
- <para>
- By default, the search results are ordered by score. The programmer can change this
- behavior by setting a sort field (or a list of fields), sort type and sort order
- parameters.
- </para>
- <para>
- <code>$index->find()</code> call may take several optional parameters:
- </para>
- <programlisting language="php"><![CDATA[
- $index->find($query [, $sortField [, $sortType [, $sortOrder]]]
- [, $sortField2 [, $sortType [, $sortOrder]]]
- ...);
- ]]></programlisting>
- <para>
- A name of stored field by which to sort result should be passed as the
- <varname>$sortField</varname> parameter.
- </para>
- <para>
- <varname>$sortType</varname> may be omitted or take the following enumerated values:
- <constant>SORT_REGULAR</constant> (compare items normally- default value),
- <constant>SORT_NUMERIC</constant> (compare items numerically),
- <constant>SORT_STRING</constant> (compare items as strings).
- </para>
- <para>
- <varname>$sortOrder</varname> may be omitted or take the following enumerated values:
- <constant>SORT_ASC</constant> (sort in ascending order- default value),
- <constant>SORT_DESC</constant> (sort in descending order).
- </para>
- <para>
- Examples:
- </para>
- <programlisting language="php"><![CDATA[
- $index->find($query, 'quantity', SORT_NUMERIC, SORT_DESC);
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- $index->find($query, 'fname', SORT_STRING, 'lname', SORT_STRING);
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- $index->find($query, 'name', SORT_STRING, 'quantity', SORT_NUMERIC, SORT_DESC);
- ]]></programlisting>
- <para>
- Please use caution when using a non-default search order; the query needs to retrieve
- documents completely from an index, which may dramatically reduce search performance.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.searching.highlighting">
- <title>Search Results Highlighting</title>
- <para>
- <classname>Zend_Search_Lucene</classname> provides two options for search results
- highlighting.
- </para>
- <para>
- The first one is utilizing <classname>Zend_Search_Lucene_Document_Html</classname> class
- (see <link linkend="zend.search.lucene.index-creation.html-documents">HTML documents
- section</link> for details) using the following methods:
- </para>
- <programlisting language="php"><![CDATA[
- /**
- * Highlight text with specified color
- *
- * @param string|array $words
- * @param string $colour
- * @return string
- */
- public function highlight($words, $colour = '#66ffff');
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- /**
- * Highlight text using specified View helper or callback function.
- *
- * @param string|array $words Words to highlight. Words could be organized
- using the array or string.
- * @param callback $callback Callback method, used to transform
- (highlighting) text.
- * @param array $params Array of additionall callback parameters passed
- through into it (first non-optional parameter
- is an HTML fragment for highlighting)
- * @return string
- * @throws Zend_Search_Lucene_Exception
- */
- public function highlightExtended($words, $callback, $params = array())
- ]]></programlisting>
- <para>
- To customize highlighting behavior use <methodname>highlightExtended()</methodname>
- method with specified callback, which takes one or more parameters
- <footnote>
- <para>
- The first is an <acronym>HTML</acronym> fragment for highlighting and others are
- callback behavior dependent. Returned value is a highlighted
- <acronym>HTML</acronym> fragment.
- </para>
- </footnote>
- , or extend <classname>Zend_Search_Lucene_Document_Html</classname> class and redefine
- <methodname>applyColour($stringToHighlight, $colour)</methodname> method used as a
- default highlighting callback.
- <footnote>
- <para>
- In both cases returned <acronym>HTML</acronym> is automatically transformed into
- valid <acronym>XHTML</acronym>.
- </para>
- </footnote>
- </para>
- <para>
- <link linkend="zend.view.helpers">View helpers</link> also can be used as callbacks in
- context of view script:
- </para>
- <programlisting language="php"><![CDATA[
- $doc->highlightExtended('word1 word2 word3...', array($this, 'myViewHelper'));
- ]]></programlisting>
- <para>
- The result of highlighting operation is retrieved by
- <code>Zend_Search_Lucene_Document_Html->getHTML()</code> method.
- </para>
- <note>
- <para>
- Highlighting is performed in terms of current analyzer. So all forms of the word(s)
- recognized by analyzer are highlighted.
- </para>
- <para>
- E.g. if current analyzer is case insensitive and we request to highlight 'text'
- word, then 'text', 'Text', 'TEXT' and other case combinations will be highlighted.
- </para>
- <para>
- In the same way, if current analyzer supports stemming and we request to highlight
- 'indexed', then 'index', 'indexing', 'indices' and other word forms will be
- highlighted.
- </para>
- <para>
- On the other hand, if word is skipped by current analyzer (e.g. if short words
- filter is applied to the analyzer), then nothing will be highlighted.
- </para>
- </note>
- <para>
- The second option is to use
- <code>Zend_Search_Lucene_Search_Query->highlightMatches(string $inputHTML[,
- $defaultEncoding = 'UTF-8'[,
- Zend_Search_Lucene_Search_Highlighter_Interface $highlighter]])</code> method:
- </para>
- <programlisting language="php"><![CDATA[
- $query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
- $highlightedHTML = $query->highlightMatches($sourceHTML);
- ]]></programlisting>
- <para>
- Optional second parameter is a default <acronym>HTML</acronym> document encoding. It's
- used if encoding is not specified using Content-type HTTP-EQUIV meta tag.
- </para>
- <para>
- Optional third parameter is a highlighter object which has to implement
- <classname>Zend_Search_Lucene_Search_Highlighter_Interface</classname> interface:
- </para>
- <programlisting language="php"><![CDATA[
- interface Zend_Search_Lucene_Search_Highlighter_Interface
- {
- /**
- * Set document for highlighting.
- *
- * @param Zend_Search_Lucene_Document_Html $document
- */
- public function setDocument(Zend_Search_Lucene_Document_Html $document);
- /**
- * Get document for highlighting.
- *
- * @return Zend_Search_Lucene_Document_Html $document
- */
- public function getDocument();
- /**
- * Highlight specified words (method is invoked once per subquery)
- *
- * @param string|array $words Words to highlight. They could be
- organized using the array or string.
- */
- public function highlight($words);
- }
- ]]></programlisting>
- <para>
- Where <classname>Zend_Search_Lucene_Document_Html</classname> object is an object
- constructed from the source <acronym>HTML</acronym> provided to the
- <classname>Zend_Search_Lucene_Search_Query->highlightMatches()</classname> method.
- </para>
- <para>
- If <varname>$highlighter</varname> parameter is omitted, then
- <classname>Zend_Search_Lucene_Search_Highlighter_Default</classname> object is
- instantiated and used.
- </para>
- <para>
- Highlighter <methodname>highlight()</methodname> method is invoked once per subquery, so
- it has an ability to differentiate highlighting for them.
- </para>
- <para>
- Actually, default highlighter does this walking through predefined color table. So you
- can implement your own highlighter or just extend the default and redefine color table.
- </para>
- <para>
- <code>Zend_Search_Lucene_Search_Query->htmlFragmentHighlightMatches()</code> has similar
- behavior. The only difference is that it takes as an input and returns
- <acronym>HTML</acronym> fragment without <>HTML>, <HEAD>, <BODY> tags.
- Nevertheless, fragment is automatically transformed to valid <acronym>XHTML</acronym>.
- </para>
- </sect2>
- </sect1>
|