|
|
@@ -0,0 +1,395 @@
|
|
|
+<?xml version="1.0" encoding="utf-8"?>
|
|
|
+<!-- EN-Revision: 18536 -->
|
|
|
+<!-- Reviewed: no -->
|
|
|
+<sect1 id="zend.search.lucene.query-language">
|
|
|
+ <title>Langage de requêtes</title>
|
|
|
+ <para>
|
|
|
+ Java Lucene et <classname>Zend_Search_Lucene</classname> fournissent des langages de requêtes plutôt puissants.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Ces langages sont pratiquement pareils, exceptées les quelques différences ci-dessous.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ La syntaxe complète du langage de requêtes Java Lucene peut être trouvée
|
|
|
+ <ulink url="http://lucene.apache.org/java/2_3_0/queryparsersyntax.html">ici</ulink>.
|
|
|
+ </para>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.terms">
|
|
|
+ <title>Termes</title>
|
|
|
+ <para>
|
|
|
+ Une requête est décomposée en termes et opérateurs. Il y a 3 types de termes : le termes simples, les
|
|
|
+ phrases et les sous-requêtes.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Un terme simple est un simple mot, tel que "test" ou "hello".
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Une phrase est un groupe de mots inclus dans des double guillemets, tel que "hello dolly".
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Une sous-requête est une requête incluse dans des parenthèses, tel que "(hello dolly)".
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ De multiples termes peuvent être combinés ensemble avec des opérateurs booléens pour former
|
|
|
+ des requêtes complexes (voyez ci-dessous).
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.fields">
|
|
|
+ <title>Champs</title>
|
|
|
+ <para>
|
|
|
+ Lucene supporte les champs de données. Lorsque vous effectuez une recherche, vous pouvez soit
|
|
|
+ spécifier un champ, soit utiliser le champ par défaut. Le nom du champ dépend des données indexées
|
|
|
+ et le champ par défaut est défini par les paramètres courants.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ La première différence et la plus significative avec Java Lucene est que par défaut les termes
|
|
|
+ sont cherchés dans <emphasis>tous les champs</emphasis>.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Il y a deux méthodes statiques dans la classe <classname>Zend_Search_Lucene</classname> qui
|
|
|
+ permettent au développeur de configurer ces paramètres :
|
|
|
+ </para>
|
|
|
+ <programlisting language="php"><![CDATA[
|
|
|
+$defaultSearchField = Zend_Search_Lucene::getDefaultSearchField();
|
|
|
+...
|
|
|
+Zend_Search_Lucene::setDefaultSearchField('contents');
|
|
|
+]]></programlisting>
|
|
|
+ <para>
|
|
|
+ La valeur <constant>NULL</constant> indique que la recherche est effectuée dans tous les champs. C'est
|
|
|
+ le paramétrage par défaut
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Vous pouvez chercher dans des champs spécifiques en tapant le nom du champ suivi de ":", suivi du terme
|
|
|
+ que vous cherchez.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Par exemple, prenons un index Lucene contenant deux champs -title et text- avec text comme champ par défaut.
|
|
|
+ Si vous voulez trouver le document ayant pour titre "The Right Way" qui contient le text "don't go this way",
|
|
|
+ vous pouvez entrer :
|
|
|
+ </para>
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+title:"The Right Way" AND text:go
|
|
|
+]]></programlisting>
|
|
|
+ <para>
|
|
|
+ or
|
|
|
+ </para>
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+title:"Do it right" AND go
|
|
|
+]]></programlisting>
|
|
|
+ <para>
|
|
|
+ "text" étant le champ par défaut, l'indicateur de champ n'est pas requis.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Note: Le champ n'est valable que pour le terme, la phrase ou la sous-requête qu'il précède directement,
|
|
|
+ ainsi la requête
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+title:Do it right
|
|
|
+]]></programlisting>
|
|
|
+ ne trouvera que "Do" dans le champ 'title'. Elle trouvera "it" et "right" dans le champ par défaut (si le
|
|
|
+ champ par défaut est défini) ou dans tous les champs indexés (si le champ par défaut est défini à <constant>NULL</constant>).
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.wildcard">
|
|
|
+ <title>Jokers (Wildcards)</title>
|
|
|
+ <para>
|
|
|
+ Lucene supporte les recherches avec joker sur un ou plusieurs caractères au sein des termes simples (mais pas
|
|
|
+ dans les phrases).
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Pour effectuez une recherche avec joker sur un seul caractère, utilisez le symbole "?".
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Pour effectuez une recherche avec joker sur plusieurs caractères, utilisez le symbole "*".
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ La recherche avec un joker sur un seul caractère va faire correspondre le terme avec le "?" remplacé par n'importe quel autre caractère unique.
|
|
|
+ Par exemple, pour trouver "text" ou "test" vous pouvez utiliser la recherche :
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+te?t
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Multiple character wildcard searches look for 0 or more characters when matching strings against terms. For example, to search for test,
|
|
|
+ tests or tester, you can use the search:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+test*
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ You can use "?", "*" or both at any place of the term:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+*wr?t*
|
|
|
+]]></programlisting>
|
|
|
+ It searches for "write", "wrote", "written", "rewrite", "rewrote" and so on.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Starting from ZF 1.7.7 wildcard patterns need some non-wildcard prefix. Default prefix length is 3 (like in Java Lucene).
|
|
|
+ So "*", "te?t", "*wr?t*" terms will cause an exception<footnote>
|
|
|
+ <para>Please note, that it's not a <code>Zend_Search_Lucene_Search_QueryParserException</code>, but a
|
|
|
+ <code>Zend_Search_Lucene_Exception</code>. It's thrown during query rewrite (execution) operation.</para></footnote>.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ It can be altered using <code>Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength()</code> and
|
|
|
+ <code>Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength()</code> methods.
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.modifiers">
|
|
|
+ <title>Term Modifiers</title>
|
|
|
+ <para>
|
|
|
+ Lucene supports modifying query terms to provide a wide range of searching options.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ "~" modifier can be used to specify proximity search for phrases or fuzzy search for individual terms.
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.range">
|
|
|
+ <title>Range Searches</title>
|
|
|
+ <para>
|
|
|
+ Range queries allow the developer or user to match documents whose field(s) values are between the lower and upper bound specified by the range query.
|
|
|
+ Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is performed lexicographically.
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+mod_date:[20020101 TO 20030101]
|
|
|
+]]></programlisting>
|
|
|
+ This will find documents whose mod_date fields have values between 20020101 and 20030101, inclusive. Note that Range Queries are not
|
|
|
+ reserved for date fields. You could also use range queries with non-date fields:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+title:{Aida TO Carmen}
|
|
|
+]]></programlisting>
|
|
|
+ This will find all documents whose titles would be sorted between Aida and Carmen, but not including Aida and Carmen.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ If field is not specified then <classname>Zend_Search_Lucene</classname> searches for specified interval through all fields by default.
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+{Aida TO Carmen}
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.fuzzy">
|
|
|
+ <title>Fuzzy Searches</title>
|
|
|
+ <para>
|
|
|
+ <classname>Zend_Search_Lucene</classname> as well as Java Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm.
|
|
|
+ To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar
|
|
|
+ in spelling to "roam" use the fuzzy search:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+roam~
|
|
|
+]]></programlisting>
|
|
|
+ This search will find terms like foam and roams.
|
|
|
+ Additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms
|
|
|
+ with a higher similarity will be matched. For example:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+roam~0.8
|
|
|
+]]></programlisting>
|
|
|
+ The default that is used if the parameter is not given is 0.5.
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.matched-terms-limitations">
|
|
|
+ <title>Matched terms limitation</title>
|
|
|
+ <para>
|
|
|
+ Wildcard, range and fuzzy search queries may match too many terms. It may cause incredible search performance downgrade.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ So Zend_Search_Lucene sets a limit of matching terms per query (subquery). This limit can be retrieved and set using
|
|
|
+ <code>Zend_Search_Lucene::getTermsPerQueryLimit()</code>/<code>Zend_Search_Lucene::setTermsPerQueryLimit($limit)</code>
|
|
|
+ methods.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Default matched terms per query limit is 1024.
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.proximity-search">
|
|
|
+ <title>Proximity Searches</title>
|
|
|
+ <para>
|
|
|
+ Lucene supports finding words from a phrase that are within a specified word distance in a string. To do a proximity search
|
|
|
+ use the tilde, "~", symbol at the end of the phrase. For example to search for a "Zend" and
|
|
|
+ "Framework" within 10 words of each other in a document use the search:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+"Zend Framework"~10
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.boosting">
|
|
|
+ <title>Boosting a Term</title>
|
|
|
+ <para>
|
|
|
+ Java Lucene and <classname>Zend_Search_Lucene</classname> provide the relevance level of matching documents based
|
|
|
+ on the terms found. To boost the relevance of a term use the caret, "^", symbol with a boost factor (a number)
|
|
|
+ at the end of the term you are searching. The higher the boost factor, the more relevant
|
|
|
+ the term will be.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ Boosting allows you to control the relevance of a document by boosting individual terms. For example,
|
|
|
+ if you are searching for
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+PHP framework
|
|
|
+]]></programlisting>
|
|
|
+ and you want the term "PHP" to be more relevant boost it using the ^ symbol along with the
|
|
|
+ boost factor next to the term. You would type:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+PHP^4 framework
|
|
|
+]]></programlisting>
|
|
|
+ This will make documents with the term PHP appear more relevant. You can also boost phrase
|
|
|
+ terms and subqueries as in the example:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+"PHP framework"^4 "Zend Framework"
|
|
|
+]]></programlisting>
|
|
|
+ By default, the boost factor is 1. Although the boost factor must be positive,
|
|
|
+ it may be less than 1 (e.g. 0.2).
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.boolean">
|
|
|
+ <title>Boolean Operators</title>
|
|
|
+ <para>
|
|
|
+ Boolean operators allow terms to be combined through logic operators.
|
|
|
+ Lucene supports AND, "+", OR, NOT and "-" as Boolean operators.
|
|
|
+ Java Lucene requires boolean operators to be ALL CAPS. <classname>Zend_Search_Lucene</classname> does not.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ AND, OR, and NOT operators and "+", "-" defines two different styles to construct boolean queries.
|
|
|
+ Unlike Java Lucene, <classname>Zend_Search_Lucene</classname> doesn't allow these two styles to be mixed.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ If the AND/OR/NOT style is used, then an AND or OR operator must be present between all query terms.
|
|
|
+ Each term may also be preceded by NOT operator. The AND operator has higher precedence than the OR operator.
|
|
|
+ This differs from Java Lucene behavior.
|
|
|
+ </para>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.and">
|
|
|
+ <title>AND</title>
|
|
|
+ <para>
|
|
|
+ The AND operator means that all terms in the "AND group" must match some part of the searched field(s).
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for documents that contain "PHP framework" and "Zend Framework" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+"PHP framework" AND "Zend Framework"
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.or">
|
|
|
+ <title>OR</title>
|
|
|
+ <para>
|
|
|
+ The OR operator divides the query into several optional terms.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for documents that contain "PHP framework" or "Zend Framework" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+"PHP framework" OR "Zend Framework"
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.not">
|
|
|
+ <title>NOT</title>
|
|
|
+ <para>
|
|
|
+ The NOT operator excludes documents that contain the term after NOT. But an "AND group" which contains
|
|
|
+ only terms with the NOT operator gives an empty result set instead of a full set of indexed documents.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for documents that contain "PHP framework" but not "Zend Framework" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+"PHP framework" AND NOT "Zend Framework"
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.other-form">
|
|
|
+ <title>&&, ||, and ! operators</title>
|
|
|
+ <para>
|
|
|
+ &&, ||, and ! may be used instead of AND, OR, and NOT notation.
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.plus">
|
|
|
+ <title>+</title>
|
|
|
+ <para>
|
|
|
+ The "+" or required operator stipulates that the term after the "+" symbol must match the document.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for documents that must contain "Zend" and may contain "Framework" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
++Zend Framework
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.minus">
|
|
|
+ <title>-</title>
|
|
|
+ <para>
|
|
|
+ The "-" or prohibit operator excludes documents that match the term after the "-" symbol.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for documents that contain "PHP framework" but not "Zend Framework" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+"PHP framework" -"Zend Framework"
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ <sect3 id="zend.search.lucene.query-language.boolean.no-operator">
|
|
|
+ <title>No Operator</title>
|
|
|
+ <para>
|
|
|
+ If no operator is used, then the search behavior is defined by the "default boolean operator".
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ This is set to <code>OR</code> by default.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ That implies each term is optional by default. It may or may not be present within document, but documents with this term
|
|
|
+ will receive a higher score.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for documents that requires "PHP framework" and may contain "Zend Framework" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
++"PHP framework" "Zend Framework"
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ The default boolean operator may be set or retrieved with the
|
|
|
+ <classname>Zend_Search_Lucene_Search_QueryParser::setDefaultOperator($operator)</classname> and
|
|
|
+ <classname>Zend_Search_Lucene_Search_QueryParser::getDefaultOperator()</classname> methods, respectively.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ These methods operate with the
|
|
|
+ <classname>Zend_Search_Lucene_Search_QueryParser::B_AND</classname> and
|
|
|
+ <classname>Zend_Search_Lucene_Search_QueryParser::B_OR</classname> constants.
|
|
|
+ </para>
|
|
|
+ </sect3>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.grouping">
|
|
|
+ <title>Grouping</title>
|
|
|
+ <para>
|
|
|
+ Java Lucene and <classname>Zend_Search_Lucene</classname> support using parentheses to group clauses to form sub queries. This can be
|
|
|
+ useful if you want to control the precedence of boolean logic operators for a query or mix different boolean query styles:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
++(framework OR library) +php
|
|
|
+]]></programlisting>
|
|
|
+ <classname>Zend_Search_Lucene</classname> supports subqueries nested to any level.
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.field-grouping">
|
|
|
+ <title>Field Grouping</title>
|
|
|
+ <para>
|
|
|
+ Lucene also supports using parentheses to group multiple clauses to a single field.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ To search for a title that contains both the word "return" and the phrase "pink panther" use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+title:(+return +"pink panther")
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+ <sect2 id="zend.search.lucene.query-language.escaping">
|
|
|
+ <title>Escaping Special Characters</title>
|
|
|
+ <para>
|
|
|
+ Lucene supports escaping special characters that are used in query syntax. The current list of special
|
|
|
+ characters is:
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ + - && || ! ( ) { } [ ] ^ " ~ * ? : \
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ + and - inside single terms are automatically treated as common characters.
|
|
|
+ </para>
|
|
|
+ <para>
|
|
|
+ For other instances of these characters use the \ before each special character you'd like to escape. For example to search for (1+1):2 use the query:
|
|
|
+ <programlisting language="querystring"><![CDATA[
|
|
|
+\(1\+1\)\:2
|
|
|
+]]></programlisting>
|
|
|
+ </para>
|
|
|
+ </sect2>
|
|
|
+</sect1>
|