Zend_Search_Lucene-QueryLanguage.xml 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395
  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!-- EN-Revision: 18536 -->
  3. <!-- Reviewed: no -->
  4. <sect1 id="zend.search.lucene.query-language">
  5. <title>Langage de requêtes</title>
  6. <para>
  7. Java Lucene et <classname>Zend_Search_Lucene</classname> fournissent des langages de requêtes plutôt puissants.
  8. </para>
  9. <para>
  10. Ces langages sont pratiquement pareils, exceptées les quelques différences ci-dessous.
  11. </para>
  12. <para>
  13. La syntaxe complète du langage de requêtes Java Lucene peut être trouvée
  14. <ulink url="http://lucene.apache.org/java/2_3_0/queryparsersyntax.html">ici</ulink>.
  15. </para>
  16. <sect2 id="zend.search.lucene.query-language.terms">
  17. <title>Termes</title>
  18. <para>
  19. Une requête est décomposée en termes et opérateurs. Il y a 3 types de termes : le termes simples, les
  20. phrases et les sous-requêtes.
  21. </para>
  22. <para>
  23. Un terme simple est un simple mot, tel que "test" ou "hello".
  24. </para>
  25. <para>
  26. Une phrase est un groupe de mots inclus dans des double guillemets, tel que "hello dolly".
  27. </para>
  28. <para>
  29. Une sous-requête est une requête incluse dans des parenthèses, tel que "(hello dolly)".
  30. </para>
  31. <para>
  32. De multiples termes peuvent être combinés ensemble avec des opérateurs booléens pour former
  33. des requêtes complexes (voyez ci-dessous).
  34. </para>
  35. </sect2>
  36. <sect2 id="zend.search.lucene.query-language.fields">
  37. <title>Champs</title>
  38. <para>
  39. Lucene supporte les champs de données. Lorsque vous effectuez une recherche, vous pouvez soit
  40. spécifier un champ, soit utiliser le champ par défaut. Le nom du champ dépend des données indexées
  41. et le champ par défaut est défini par les paramètres courants.
  42. </para>
  43. <para>
  44. La première différence et la plus significative avec Java Lucene est que par défaut les termes
  45. sont cherchés dans <emphasis>tous les champs</emphasis>.
  46. </para>
  47. <para>
  48. Il y a deux méthodes statiques dans la classe <classname>Zend_Search_Lucene</classname> qui
  49. permettent au développeur de configurer ces paramètres :
  50. </para>
  51. <programlisting language="php"><![CDATA[
  52. $defaultSearchField = Zend_Search_Lucene::getDefaultSearchField();
  53. ...
  54. Zend_Search_Lucene::setDefaultSearchField('contents');
  55. ]]></programlisting>
  56. <para>
  57. La valeur <constant>NULL</constant> indique que la recherche est effectuée dans tous les champs. C'est
  58. le paramétrage par défaut
  59. </para>
  60. <para>
  61. Vous pouvez chercher dans des champs spécifiques en tapant le nom du champ suivi de ":", suivi du terme
  62. que vous cherchez.
  63. </para>
  64. <para>
  65. Par exemple, prenons un index Lucene contenant deux champs -title et text- avec text comme champ par défaut.
  66. Si vous voulez trouver le document ayant pour titre "The Right Way" qui contient le text "don't go this way",
  67. vous pouvez entrer :
  68. </para>
  69. <programlisting language="querystring"><![CDATA[
  70. title:"The Right Way" AND text:go
  71. ]]></programlisting>
  72. <para>
  73. or
  74. </para>
  75. <programlisting language="querystring"><![CDATA[
  76. title:"Do it right" AND go
  77. ]]></programlisting>
  78. <para>
  79. "text" étant le champ par défaut, l'indicateur de champ n'est pas requis.
  80. </para>
  81. <para>
  82. Note: Le champ n'est valable que pour le terme, la phrase ou la sous-requête qu'il précède directement,
  83. ainsi la requête
  84. <programlisting language="querystring"><![CDATA[
  85. title:Do it right
  86. ]]></programlisting>
  87. ne trouvera que "Do" dans le champ 'title'. Elle trouvera "it" et "right" dans le champ par défaut (si le
  88. champ par défaut est défini) ou dans tous les champs indexés (si le champ par défaut est défini à <constant>NULL</constant>).
  89. </para>
  90. </sect2>
  91. <sect2 id="zend.search.lucene.query-language.wildcard">
  92. <title>Jokers (Wildcards)</title>
  93. <para>
  94. Lucene supporte les recherches avec joker sur un ou plusieurs caractères au sein des termes simples (mais pas
  95. dans les phrases).
  96. </para>
  97. <para>
  98. Pour effectuez une recherche avec joker sur un seul caractère, utilisez le symbole "?".
  99. </para>
  100. <para>
  101. Pour effectuez une recherche avec joker sur plusieurs caractères, utilisez le symbole "*".
  102. </para>
  103. <para>
  104. La recherche avec un joker sur un seul caractère va faire correspondre le terme avec le "?" remplacé par n'importe quel autre caractère unique.
  105. Par exemple, pour trouver "text" ou "test" vous pouvez utiliser la recherche :
  106. <programlisting language="querystring"><![CDATA[
  107. te?t
  108. ]]></programlisting>
  109. </para>
  110. <para>
  111. Multiple character wildcard searches look for 0 or more characters when matching strings against terms. For example, to search for test,
  112. tests or tester, you can use the search:
  113. <programlisting language="querystring"><![CDATA[
  114. test*
  115. ]]></programlisting>
  116. </para>
  117. <para>
  118. You can use "?", "*" or both at any place of the term:
  119. <programlisting language="querystring"><![CDATA[
  120. *wr?t*
  121. ]]></programlisting>
  122. It searches for "write", "wrote", "written", "rewrite", "rewrote" and so on.
  123. </para>
  124. <para>
  125. Starting from ZF 1.7.7 wildcard patterns need some non-wildcard prefix. Default prefix length is 3 (like in Java Lucene).
  126. So "*", "te?t", "*wr?t*" terms will cause an exception<footnote>
  127. <para>Please note, that it's not a <code>Zend_Search_Lucene_Search_QueryParserException</code>, but a
  128. <code>Zend_Search_Lucene_Exception</code>. It's thrown during query rewrite (execution) operation.</para></footnote>.
  129. </para>
  130. <para>
  131. It can be altered using <code>Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength()</code> and
  132. <code>Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength()</code> methods.
  133. </para>
  134. </sect2>
  135. <sect2 id="zend.search.lucene.query-language.modifiers">
  136. <title>Term Modifiers</title>
  137. <para>
  138. Lucene supports modifying query terms to provide a wide range of searching options.
  139. </para>
  140. <para>
  141. "~" modifier can be used to specify proximity search for phrases or fuzzy search for individual terms.
  142. </para>
  143. </sect2>
  144. <sect2 id="zend.search.lucene.query-language.range">
  145. <title>Range Searches</title>
  146. <para>
  147. Range queries allow the developer or user to match documents whose field(s) values are between the lower and upper bound specified by the range query.
  148. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is performed lexicographically.
  149. <programlisting language="querystring"><![CDATA[
  150. mod_date:[20020101 TO 20030101]
  151. ]]></programlisting>
  152. This will find documents whose mod_date fields have values between 20020101 and 20030101, inclusive. Note that Range Queries are not
  153. reserved for date fields. You could also use range queries with non-date fields:
  154. <programlisting language="querystring"><![CDATA[
  155. title:{Aida TO Carmen}
  156. ]]></programlisting>
  157. This will find all documents whose titles would be sorted between Aida and Carmen, but not including Aida and Carmen.
  158. </para>
  159. <para>
  160. Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
  161. </para>
  162. <para>
  163. If field is not specified then <classname>Zend_Search_Lucene</classname> searches for specified interval through all fields by default.
  164. <programlisting language="querystring"><![CDATA[
  165. {Aida TO Carmen}
  166. ]]></programlisting>
  167. </para>
  168. </sect2>
  169. <sect2 id="zend.search.lucene.query-language.fuzzy">
  170. <title>Fuzzy Searches</title>
  171. <para>
  172. <classname>Zend_Search_Lucene</classname> as well as Java Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm.
  173. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar
  174. in spelling to "roam" use the fuzzy search:
  175. <programlisting language="querystring"><![CDATA[
  176. roam~
  177. ]]></programlisting>
  178. This search will find terms like foam and roams.
  179. Additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms
  180. with a higher similarity will be matched. For example:
  181. <programlisting language="querystring"><![CDATA[
  182. roam~0.8
  183. ]]></programlisting>
  184. The default that is used if the parameter is not given is 0.5.
  185. </para>
  186. </sect2>
  187. <sect2 id="zend.search.lucene.query-language.matched-terms-limitations">
  188. <title>Matched terms limitation</title>
  189. <para>
  190. Wildcard, range and fuzzy search queries may match too many terms. It may cause incredible search performance downgrade.
  191. </para>
  192. <para>
  193. So Zend_Search_Lucene sets a limit of matching terms per query (subquery). This limit can be retrieved and set using
  194. <code>Zend_Search_Lucene::getTermsPerQueryLimit()</code>/<code>Zend_Search_Lucene::setTermsPerQueryLimit($limit)</code>
  195. methods.
  196. </para>
  197. <para>
  198. Default matched terms per query limit is 1024.
  199. </para>
  200. </sect2>
  201. <sect2 id="zend.search.lucene.query-language.proximity-search">
  202. <title>Proximity Searches</title>
  203. <para>
  204. Lucene supports finding words from a phrase that are within a specified word distance in a string. To do a proximity search
  205. use the tilde, "~", symbol at the end of the phrase. For example to search for a "Zend" and
  206. "Framework" within 10 words of each other in a document use the search:
  207. <programlisting language="querystring"><![CDATA[
  208. "Zend Framework"~10
  209. ]]></programlisting>
  210. </para>
  211. </sect2>
  212. <sect2 id="zend.search.lucene.query-language.boosting">
  213. <title>Boosting a Term</title>
  214. <para>
  215. Java Lucene and <classname>Zend_Search_Lucene</classname> provide the relevance level of matching documents based
  216. on the terms found. To boost the relevance of a term use the caret, "^", symbol with a boost factor (a number)
  217. at the end of the term you are searching. The higher the boost factor, the more relevant
  218. the term will be.
  219. </para>
  220. <para>
  221. Boosting allows you to control the relevance of a document by boosting individual terms. For example,
  222. if you are searching for
  223. <programlisting language="querystring"><![CDATA[
  224. PHP framework
  225. ]]></programlisting>
  226. and you want the term "PHP" to be more relevant boost it using the ^ symbol along with the
  227. boost factor next to the term. You would type:
  228. <programlisting language="querystring"><![CDATA[
  229. PHP^4 framework
  230. ]]></programlisting>
  231. This will make documents with the term PHP appear more relevant. You can also boost phrase
  232. terms and subqueries as in the example:
  233. <programlisting language="querystring"><![CDATA[
  234. "PHP framework"^4 "Zend Framework"
  235. ]]></programlisting>
  236. By default, the boost factor is 1. Although the boost factor must be positive,
  237. it may be less than 1 (e.g. 0.2).
  238. </para>
  239. </sect2>
  240. <sect2 id="zend.search.lucene.query-language.boolean">
  241. <title>Boolean Operators</title>
  242. <para>
  243. Boolean operators allow terms to be combined through logic operators.
  244. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators.
  245. Java Lucene requires boolean operators to be ALL CAPS. <classname>Zend_Search_Lucene</classname> does not.
  246. </para>
  247. <para>
  248. AND, OR, and NOT operators and "+", "-" defines two different styles to construct boolean queries.
  249. Unlike Java Lucene, <classname>Zend_Search_Lucene</classname> doesn't allow these two styles to be mixed.
  250. </para>
  251. <para>
  252. If the AND/OR/NOT style is used, then an AND or OR operator must be present between all query terms.
  253. Each term may also be preceded by NOT operator. The AND operator has higher precedence than the OR operator.
  254. This differs from Java Lucene behavior.
  255. </para>
  256. <sect3 id="zend.search.lucene.query-language.boolean.and">
  257. <title>AND</title>
  258. <para>
  259. The AND operator means that all terms in the "AND group" must match some part of the searched field(s).
  260. </para>
  261. <para>
  262. To search for documents that contain "PHP framework" and "Zend Framework" use the query:
  263. <programlisting language="querystring"><![CDATA[
  264. "PHP framework" AND "Zend Framework"
  265. ]]></programlisting>
  266. </para>
  267. </sect3>
  268. <sect3 id="zend.search.lucene.query-language.boolean.or">
  269. <title>OR</title>
  270. <para>
  271. The OR operator divides the query into several optional terms.
  272. </para>
  273. <para>
  274. To search for documents that contain "PHP framework" or "Zend Framework" use the query:
  275. <programlisting language="querystring"><![CDATA[
  276. "PHP framework" OR "Zend Framework"
  277. ]]></programlisting>
  278. </para>
  279. </sect3>
  280. <sect3 id="zend.search.lucene.query-language.boolean.not">
  281. <title>NOT</title>
  282. <para>
  283. The NOT operator excludes documents that contain the term after NOT. But an "AND group" which contains
  284. only terms with the NOT operator gives an empty result set instead of a full set of indexed documents.
  285. </para>
  286. <para>
  287. To search for documents that contain "PHP framework" but not "Zend Framework" use the query:
  288. <programlisting language="querystring"><![CDATA[
  289. "PHP framework" AND NOT "Zend Framework"
  290. ]]></programlisting>
  291. </para>
  292. </sect3>
  293. <sect3 id="zend.search.lucene.query-language.boolean.other-form">
  294. <title>&amp;&amp;, ||, and ! operators</title>
  295. <para>
  296. &amp;&amp;, ||, and ! may be used instead of AND, OR, and NOT notation.
  297. </para>
  298. </sect3>
  299. <sect3 id="zend.search.lucene.query-language.boolean.plus">
  300. <title>+</title>
  301. <para>
  302. The "+" or required operator stipulates that the term after the "+" symbol must match the document.
  303. </para>
  304. <para>
  305. To search for documents that must contain "Zend" and may contain "Framework" use the query:
  306. <programlisting language="querystring"><![CDATA[
  307. +Zend Framework
  308. ]]></programlisting>
  309. </para>
  310. </sect3>
  311. <sect3 id="zend.search.lucene.query-language.boolean.minus">
  312. <title>-</title>
  313. <para>
  314. The "-" or prohibit operator excludes documents that match the term after the "-" symbol.
  315. </para>
  316. <para>
  317. To search for documents that contain "PHP framework" but not "Zend Framework" use the query:
  318. <programlisting language="querystring"><![CDATA[
  319. "PHP framework" -"Zend Framework"
  320. ]]></programlisting>
  321. </para>
  322. </sect3>
  323. <sect3 id="zend.search.lucene.query-language.boolean.no-operator">
  324. <title>No Operator</title>
  325. <para>
  326. If no operator is used, then the search behavior is defined by the "default boolean operator".
  327. </para>
  328. <para>
  329. This is set to <code>OR</code> by default.
  330. </para>
  331. <para>
  332. That implies each term is optional by default. It may or may not be present within document, but documents with this term
  333. will receive a higher score.
  334. </para>
  335. <para>
  336. To search for documents that requires "PHP framework" and may contain "Zend Framework" use the query:
  337. <programlisting language="querystring"><![CDATA[
  338. +"PHP framework" "Zend Framework"
  339. ]]></programlisting>
  340. </para>
  341. <para>
  342. The default boolean operator may be set or retrieved with the
  343. <classname>Zend_Search_Lucene_Search_QueryParser::setDefaultOperator($operator)</classname> and
  344. <classname>Zend_Search_Lucene_Search_QueryParser::getDefaultOperator()</classname> methods, respectively.
  345. </para>
  346. <para>
  347. These methods operate with the
  348. <classname>Zend_Search_Lucene_Search_QueryParser::B_AND</classname> and
  349. <classname>Zend_Search_Lucene_Search_QueryParser::B_OR</classname> constants.
  350. </para>
  351. </sect3>
  352. </sect2>
  353. <sect2 id="zend.search.lucene.query-language.grouping">
  354. <title>Grouping</title>
  355. <para>
  356. Java Lucene and <classname>Zend_Search_Lucene</classname> support using parentheses to group clauses to form sub queries. This can be
  357. useful if you want to control the precedence of boolean logic operators for a query or mix different boolean query styles:
  358. <programlisting language="querystring"><![CDATA[
  359. +(framework OR library) +php
  360. ]]></programlisting>
  361. <classname>Zend_Search_Lucene</classname> supports subqueries nested to any level.
  362. </para>
  363. </sect2>
  364. <sect2 id="zend.search.lucene.query-language.field-grouping">
  365. <title>Field Grouping</title>
  366. <para>
  367. Lucene also supports using parentheses to group multiple clauses to a single field.
  368. </para>
  369. <para>
  370. To search for a title that contains both the word "return" and the phrase "pink panther" use the query:
  371. <programlisting language="querystring"><![CDATA[
  372. title:(+return +"pink panther")
  373. ]]></programlisting>
  374. </para>
  375. </sect2>
  376. <sect2 id="zend.search.lucene.query-language.escaping">
  377. <title>Escaping Special Characters</title>
  378. <para>
  379. Lucene supports escaping special characters that are used in query syntax. The current list of special
  380. characters is:
  381. </para>
  382. <para>
  383. + - &amp;&amp; || ! ( ) { } [ ] ^ " ~ * ? : \
  384. </para>
  385. <para>
  386. + and - inside single terms are automatically treated as common characters.
  387. </para>
  388. <para>
  389. For other instances of these characters use the \ before each special character you'd like to escape. For example to search for (1+1):2 use the query:
  390. <programlisting language="querystring"><![CDATA[
  391. \(1\+1\)\:2
  392. ]]></programlisting>
  393. </para>
  394. </sect2>
  395. </sect1>