| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.query-language">
- <title>Query Language</title>
- <para>
- Java Lucene and <classname>Zend_Search_Lucene</classname> provide quite powerful query
- languages.
- </para>
- <para>
- These languages are mostly the same with some minor differences, which are mentioned below.
- </para>
- <para>
- Full Java Lucene query language syntax documentation can be found
- <ulink url="http://lucene.apache.org/java/2_3_0/queryparsersyntax.html">here</ulink>.
- </para>
- <sect2 id="zend.search.lucene.query-language.terms">
- <title>Terms</title>
- <para>
- A query is broken up into terms and operators. There are three types of terms: Single
- Terms, Phrases, and Subqueries.
- </para>
- <para>
- A Single Term is a single word such as "test" or "hello".
- </para>
- <para>
- A Phrase is a group of words surrounded by double quotes such as "hello dolly".
- </para>
- <para>
- A Subquery is a query surrounded by parentheses such as "(hello dolly)".
- </para>
- <para>
- Multiple terms can be combined together with boolean operators to form complex queries
- (see below).
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.fields">
- <title>Fields</title>
- <para>
- Lucene supports fields of data. When performing a search you can either specify a field,
- or use the default field. The field names depend on indexed data and default field is
- defined by current settings.
- </para>
- <para>
- The first and most significant difference from Java Lucene is that terms are searched
- through <emphasis>all fields</emphasis> by default.
- </para>
- <para>
- There are two static methods in the <classname>Zend_Search_Lucene</classname> class
- which allow the developer to configure these settings:
- </para>
- <programlisting language="php"><![CDATA[
- $defaultSearchField = Zend_Search_Lucene::getDefaultSearchField();
- ...
- Zend_Search_Lucene::setDefaultSearchField('contents');
- ]]></programlisting>
- <para>
- The <constant>NULL</constant> value indicated that the search is performed across all
- fields. It's the default setting.
- </para>
- <para>
- You can search specific fields by typing the field name followed by a colon ":" followed
- by the term you are looking for.
- </para>
- <para>
- As an example, let's assume a Lucene index contains two fields- title and text- with
- text as the default field. If you want to find the document entitled "The Right Way"
- which contains the text "don't go this way", you can enter:
- </para>
- <programlisting language="querystring"><![CDATA[
- title:"The Right Way" AND text:go
- ]]></programlisting>
- <para>
- or
- </para>
- <programlisting language="querystring"><![CDATA[
- title:"Do it right" AND go
- ]]></programlisting>
- <para>
- Because "text" is the default field, the field indicator is not required.
- </para>
- <para>
- Note: The field is only valid for the term, phrase or subquery that it directly
- precedes, so the query
- </para>
- <programlisting language="querystring"><![CDATA[
- title:Do it right
- ]]></programlisting>
- <para>
- Will only find "Do" in the title field. It will find "it" and "right" in the default
- field (if the default field is set) or in all indexed fields (if the default field is
- set to <constant>NULL</constant>).
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.wildcard">
- <title>Wildcards</title>
- <para>
- Lucene supports single and multiple character wildcard searches within single terms (but
- not within phrase queries).
- </para>
- <para>
- To perform a single character wildcard search use the "?" symbol.
- </para>
- <para>
- To perform a multiple character wildcard search use the "*" symbol.
- </para>
- <para>
- The single character wildcard search looks for string that match the term with the "?"
- replaced by any single character. For example, to search for "text" or "test" you can
- use the search:
- </para>
- <programlisting language="querystring"><![CDATA[
- te?t
- ]]></programlisting>
- <para>
- Multiple character wildcard searches look for 0 or more characters when matching strings
- against terms. For example, to search for test, tests or tester, you can use the search:
- </para>
- <programlisting language="querystring"><![CDATA[
- test*
- ]]></programlisting>
- <para>
- You can use "?", "*" or both at any place of the term:
- </para>
- <programlisting language="querystring"><![CDATA[
- *wr?t*
- ]]></programlisting>
- <para>
- It searches for "write", "wrote", "written", "rewrite", "rewrote" and so on.
- </para>
- <para>
- Starting from ZF 1.7.7 wildcard patterns need some non-wildcard prefix. Default prefix
- length is 3 (like in Java Lucene). So "*", "te?t", "*wr?t*" terms will cause an
- exception
- <footnote>
- <para>
- Please note, that it's not a
- <classname>Zend_Search_Lucene_Search_QueryParserException</classname>, but a
- <classname>Zend_Search_Lucene_Exception</classname>. It's thrown during query
- rewrite (execution) operation.
- </para>
- </footnote>.
- </para>
- <para>
- It can be altered using
- <methodname>Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength()</methodname>
- and
- <methodname>Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength()</methodname>
- methods.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.modifiers">
- <title>Term Modifiers</title>
- <para>
- Lucene supports modifying query terms to provide a wide range of searching options.
- </para>
- <para>
- "~" modifier can be used to specify proximity search for phrases or fuzzy search for
- individual terms.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.range">
- <title>Range Searches</title>
- <para>
- Range queries allow the developer or user to match documents whose field(s) values are
- between the lower and upper bound specified by the range query. Range Queries can be
- inclusive or exclusive of the upper and lower bounds. Sorting is performed
- lexicographically.
- </para>
- <programlisting language="querystring"><![CDATA[
- mod_date:[20020101 TO 20030101]
- ]]></programlisting>
- <para>
- This will find documents whose mod_date fields have values between 20020101 and
- 20030101, inclusive. Note that Range Queries are not reserved for date fields. You could
- also use range queries with non-date fields:
- </para>
- <programlisting language="querystring"><![CDATA[
- title:{Aida TO Carmen}
- ]]></programlisting>
- <para>
- This will find all documents whose titles would be sorted between Aida and Carmen, but
- not including Aida and Carmen.
- </para>
- <para>
- Inclusive range queries are denoted by square brackets. Exclusive range queries are
- denoted by curly brackets.
- </para>
- <para>
- If field is not specified then <classname>Zend_Search_Lucene</classname> searches for
- specified interval through all fields by default.
- </para>
- <programlisting language="querystring"><![CDATA[
- {Aida TO Carmen}
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.fuzzy">
- <title>Fuzzy Searches</title>
- <para>
- <classname>Zend_Search_Lucene</classname> as well as Java Lucene supports fuzzy searches
- based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use
- the tilde, "~", symbol at the end of a Single word Term. For example to search for a
- term similar in spelling to "roam" use the fuzzy search:
- </para>
- <programlisting language="querystring"><![CDATA[
- roam~
- ]]></programlisting>
- <para>
- This search will find terms like foam and roams. Additional (optional) parameter can
- specify the required similarity. The value is between 0 and 1, with a value closer to 1
- only terms with a higher similarity will be matched. For example:
- </para>
- <programlisting language="querystring"><![CDATA[
- roam~0.8
- ]]></programlisting>
- <para>
- The default that is used if the parameter is not given is 0.5.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.matched-terms-limitations">
- <title>Matched terms limitation</title>
- <para>
- Wildcard, range and fuzzy search queries may match too many terms. It may cause
- incredible search performance downgrade.
- </para>
- <para>
- So <classname>Zend_Search_Lucene</classname> sets a limit of matching terms per query
- (subquery). This limit can be retrieved and set using
- <methodname>Zend_Search_Lucene::getTermsPerQueryLimit()</methodname> and
- <methodname>Zend_Search_Lucene::setTermsPerQueryLimit($limit)</methodname> methods.
- </para>
- <para>
- Default matched terms per query limit is 1024.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.proximity-search">
- <title>Proximity Searches</title>
- <para>
- Lucene supports finding words from a phrase that are within a specified word distance in
- a string. To do a proximity search use the tilde, "~", symbol at the end of the phrase.
- For example to search for a "Zend" and "Framework" within 10 words of each other in a
- document use the search:
- </para>
- <programlisting language="querystring"><![CDATA[
- "Zend Framework"~10
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.boosting">
- <title>Boosting a Term</title>
- <para>
- Java Lucene and <classname>Zend_Search_Lucene</classname> provide the relevance level of
- matching documents based on the terms found. To boost the relevance of a term use the
- caret, "^", symbol with a boost factor (a number) at the end of the term you are
- searching. The higher the boost factor, the more relevant the term will be.
- </para>
- <para>
- Boosting allows you to control the relevance of a document by boosting individual terms.
- For example, if you are searching for
- </para>
- <programlisting language="querystring"><![CDATA[
- PHP framework
- ]]></programlisting>
- <para>
- and you want the term "PHP" to be more relevant boost it using the ^ symbol along with
- the boost factor next to the term. You would type:
- </para>
- <programlisting language="querystring"><![CDATA[
- PHP^4 framework
- ]]></programlisting>
- <para>
- This will make documents with the term <acronym>PHP</acronym> appear more relevant. You
- can also boost phrase terms and subqueries as in the example:
- </para>
- <programlisting language="querystring"><![CDATA[
- "PHP framework"^4 "Zend Framework"
- ]]></programlisting>
- <para>
- By default, the boost factor is 1. Although the boost factor must be positive,
- it may be less than 1 (e.g. 0.2).
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.boolean">
- <title>Boolean Operators</title>
- <para>
- Boolean operators allow terms to be combined through logic operators.
- Lucene supports AND, "+", OR, NOT and "-" as Boolean operators.
- Java Lucene requires boolean operators to be ALL CAPS.
- <classname>Zend_Search_Lucene</classname> does not.
- </para>
- <para>
- AND, OR, and NOT operators and "+", "-" defines two different styles to construct
- boolean queries. Unlike Java Lucene, <classname>Zend_Search_Lucene</classname> doesn't
- allow these two styles to be mixed.
- </para>
- <para>
- If the AND/OR/NOT style is used, then an AND or OR operator must be present between all
- query terms. Each term may also be preceded by NOT operator. The AND operator has higher
- precedence than the OR operator. This differs from Java Lucene behavior.
- </para>
- <sect3 id="zend.search.lucene.query-language.boolean.and">
- <title>AND</title>
- <para>
- The AND operator means that all terms in the "AND group" must match some part of the
- searched field(s).
- </para>
- <para>
- To search for documents that contain "PHP framework" and "Zend Framework" use the
- query:
- </para>
- <programlisting language="querystring"><![CDATA[
- "PHP framework" AND "Zend Framework"
- ]]></programlisting>
- </sect3>
- <sect3 id="zend.search.lucene.query-language.boolean.or">
- <title>OR</title>
- <para>
- The OR operator divides the query into several optional terms.
- </para>
- <para>
- To search for documents that contain "PHP framework" or "Zend Framework" use the
- query:
- </para>
- <programlisting language="querystring"><![CDATA[
- "PHP framework" OR "Zend Framework"
- ]]></programlisting>
- </sect3>
- <sect3 id="zend.search.lucene.query-language.boolean.not">
- <title>NOT</title>
- <para>
- The NOT operator excludes documents that contain the term after NOT. But an "AND
- group" which contains only terms with the NOT operator gives an empty result set
- instead of a full set of indexed documents.
- </para>
- <para>
- To search for documents that contain "PHP framework" but not "Zend Framework" use
- the query:
- </para>
- <programlisting language="querystring"><![CDATA[
- "PHP framework" AND NOT "Zend Framework"
- ]]></programlisting>
- </sect3>
- <sect3 id="zend.search.lucene.query-language.boolean.other-form">
- <title>&&, ||, and ! operators</title>
- <para>
- &&, ||, and ! may be used instead of AND, OR, and NOT notation.
- </para>
- </sect3>
- <sect3 id="zend.search.lucene.query-language.boolean.plus">
- <title>+</title>
- <para>
- The "+" or required operator stipulates that the term after the "+" symbol must
- match the document.
- </para>
- <para>
- To search for documents that must contain "Zend" and may contain "Framework" use the
- query:
- </para>
- <programlisting language="querystring"><![CDATA[
- +Zend Framework
- ]]></programlisting>
- </sect3>
- <sect3 id="zend.search.lucene.query-language.boolean.minus">
- <title>-</title>
- <para>
- The "-" or prohibit operator excludes documents that match the term after the "-"
- symbol.
- </para>
- <para>
- To search for documents that contain "PHP framework" but not "Zend Framework" use
- the query:
- </para>
- <programlisting language="querystring"><![CDATA[
- "PHP framework" -"Zend Framework"
- ]]></programlisting>
- </sect3>
- <sect3 id="zend.search.lucene.query-language.boolean.no-operator">
- <title>No Operator</title>
- <para>
- If no operator is used, then the search behavior is defined by the "default boolean
- operator".
- </para>
- <para>
- This is set to 'OR' by default.
- </para>
- <para>
- That implies each term is optional by default. It may or may not be present within
- document, but documents with this term will receive a higher score.
- </para>
- <para>
- To search for documents that requires "PHP framework" and may contain "Zend
- Framework" use the query:
- </para>
- <programlisting language="querystring"><![CDATA[
- +"PHP framework" "Zend Framework"
- ]]></programlisting>
- <para>
- The default boolean operator may be set or retrieved with the
- <classname>Zend_Search_Lucene_Search_QueryParser::setDefaultOperator($operator)</classname>
- and
- <classname>Zend_Search_Lucene_Search_QueryParser::getDefaultOperator()</classname>
- methods, respectively.
- </para>
- <para>
- These methods operate with the
- <classname>Zend_Search_Lucene_Search_QueryParser::B_AND</classname> and
- <classname>Zend_Search_Lucene_Search_QueryParser::B_OR</classname> constants.
- </para>
- </sect3>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.grouping">
- <title>Grouping</title>
- <para>
- Java Lucene and <classname>Zend_Search_Lucene</classname> support using parentheses to
- group clauses to form sub queries. This can be useful if you want to control the
- precedence of boolean logic operators for a query or mix different boolean query styles:
- </para>
- <programlisting language="querystring"><![CDATA[
- +(framework OR library) +php
- ]]></programlisting>
- <para>
- <classname>Zend_Search_Lucene</classname> supports subqueries nested to any level.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.field-grouping">
- <title>Field Grouping</title>
- <para>
- Lucene also supports using parentheses to group multiple clauses to a single field.
- </para>
- <para>
- To search for a title that contains both the word "return" and the phrase "pink panther"
- use the query:
- </para>
- <programlisting language="querystring"><![CDATA[
- title:(+return +"pink panther")
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.query-language.escaping">
- <title>Escaping Special Characters</title>
- <para>
- Lucene supports escaping special characters that are used in query syntax. The current
- list of special characters is:
- </para>
- <para>
- + - && || ! ( ) { } [ ] ^ " ~ * ? : \
- </para>
- <para>
- + and - inside single terms are automatically treated as common characters.
- </para>
- <para>
- For other instances of these characters use the \ before each special character you'd
- like to escape. For example to search for (1+1):2 use the query:
- </para>
- <programlisting language="querystring"><![CDATA[
- \(1\+1\)\:2
- ]]></programlisting>
- </sect2>
- </sect1>
|