2
0

Zend_Search_Lucene-QueryLanguage.xml 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="zend.search.lucene.query-language">
  4. <title>Query Language</title>
  5. <para>
  6. Java Lucene and <classname>Zend_Search_Lucene</classname> provide quite powerful query
  7. languages.
  8. </para>
  9. <para>
  10. These languages are mostly the same with some minor differences, which are mentioned below.
  11. </para>
  12. <para>
  13. Full Java Lucene query language syntax documentation can be found
  14. <ulink url="http://lucene.apache.org/java/2_3_0/queryparsersyntax.html">here</ulink>.
  15. </para>
  16. <sect2 id="zend.search.lucene.query-language.terms">
  17. <title>Terms</title>
  18. <para>
  19. A query is broken up into terms and operators. There are three types of terms: Single
  20. Terms, Phrases, and Subqueries.
  21. </para>
  22. <para>
  23. A Single Term is a single word such as "test" or "hello".
  24. </para>
  25. <para>
  26. A Phrase is a group of words surrounded by double quotes such as "hello dolly".
  27. </para>
  28. <para>
  29. A Subquery is a query surrounded by parentheses such as "(hello dolly)".
  30. </para>
  31. <para>
  32. Multiple terms can be combined together with boolean operators to form complex queries
  33. (see below).
  34. </para>
  35. </sect2>
  36. <sect2 id="zend.search.lucene.query-language.fields">
  37. <title>Fields</title>
  38. <para>
  39. Lucene supports fields of data. When performing a search you can either specify a field,
  40. or use the default field. The field names depend on indexed data and default field is
  41. defined by current settings.
  42. </para>
  43. <para>
  44. The first and most significant difference from Java Lucene is that terms are searched
  45. through <emphasis>all fields</emphasis> by default.
  46. </para>
  47. <para>
  48. There are two static methods in the <classname>Zend_Search_Lucene</classname> class
  49. which allow the developer to configure these settings:
  50. </para>
  51. <programlisting language="php"><![CDATA[
  52. $defaultSearchField = Zend_Search_Lucene::getDefaultSearchField();
  53. ...
  54. Zend_Search_Lucene::setDefaultSearchField('contents');
  55. ]]></programlisting>
  56. <para>
  57. The <constant>NULL</constant> value indicated that the search is performed across all
  58. fields. It's the default setting.
  59. </para>
  60. <para>
  61. You can search specific fields by typing the field name followed by a colon ":" followed
  62. by the term you are looking for.
  63. </para>
  64. <para>
  65. As an example, let's assume a Lucene index contains two fields- title and text- with
  66. text as the default field. If you want to find the document entitled "The Right Way"
  67. which contains the text "don't go this way", you can enter:
  68. </para>
  69. <programlisting language="querystring"><![CDATA[
  70. title:"The Right Way" AND text:go
  71. ]]></programlisting>
  72. <para>
  73. or
  74. </para>
  75. <programlisting language="querystring"><![CDATA[
  76. title:"Do it right" AND go
  77. ]]></programlisting>
  78. <para>
  79. Because "text" is the default field, the field indicator is not required.
  80. </para>
  81. <para>
  82. Note: The field is only valid for the term, phrase or subquery that it directly
  83. precedes, so the query
  84. </para>
  85. <programlisting language="querystring"><![CDATA[
  86. title:Do it right
  87. ]]></programlisting>
  88. <para>
  89. Will only find "Do" in the title field. It will find "it" and "right" in the default
  90. field (if the default field is set) or in all indexed fields (if the default field is
  91. set to <constant>NULL</constant>).
  92. </para>
  93. </sect2>
  94. <sect2 id="zend.search.lucene.query-language.wildcard">
  95. <title>Wildcards</title>
  96. <para>
  97. Lucene supports single and multiple character wildcard searches within single terms (but
  98. not within phrase queries).
  99. </para>
  100. <para>
  101. To perform a single character wildcard search use the "?" symbol.
  102. </para>
  103. <para>
  104. To perform a multiple character wildcard search use the "*" symbol.
  105. </para>
  106. <para>
  107. The single character wildcard search looks for string that match the term with the "?"
  108. replaced by any single character. For example, to search for "text" or "test" you can
  109. use the search:
  110. </para>
  111. <programlisting language="querystring"><![CDATA[
  112. te?t
  113. ]]></programlisting>
  114. <para>
  115. Multiple character wildcard searches look for 0 or more characters when matching strings
  116. against terms. For example, to search for test, tests or tester, you can use the search:
  117. </para>
  118. <programlisting language="querystring"><![CDATA[
  119. test*
  120. ]]></programlisting>
  121. <para>
  122. You can use "?", "*" or both at any place of the term:
  123. </para>
  124. <programlisting language="querystring"><![CDATA[
  125. *wr?t*
  126. ]]></programlisting>
  127. <para>
  128. It searches for "write", "wrote", "written", "rewrite", "rewrote" and so on.
  129. </para>
  130. <para>
  131. Starting from ZF 1.7.7 wildcard patterns need some non-wildcard prefix. Default prefix
  132. length is 3 (like in Java Lucene). So "*", "te?t", "*wr?t*" terms will cause an
  133. exception
  134. <footnote>
  135. <para>
  136. Please note, that it's not a
  137. <classname>Zend_Search_Lucene_Search_QueryParserException</classname>, but a
  138. <classname>Zend_Search_Lucene_Exception</classname>. It's thrown during query
  139. rewrite (execution) operation.
  140. </para>
  141. </footnote>.
  142. </para>
  143. <para>
  144. It can be altered using
  145. <methodname>Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength()</methodname>
  146. and
  147. <methodname>Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength()</methodname>
  148. methods.
  149. </para>
  150. </sect2>
  151. <sect2 id="zend.search.lucene.query-language.modifiers">
  152. <title>Term Modifiers</title>
  153. <para>
  154. Lucene supports modifying query terms to provide a wide range of searching options.
  155. </para>
  156. <para>
  157. "~" modifier can be used to specify proximity search for phrases or fuzzy search for
  158. individual terms.
  159. </para>
  160. </sect2>
  161. <sect2 id="zend.search.lucene.query-language.range">
  162. <title>Range Searches</title>
  163. <para>
  164. Range queries allow the developer or user to match documents whose field(s) values are
  165. between the lower and upper bound specified by the range query. Range Queries can be
  166. inclusive or exclusive of the upper and lower bounds. Sorting is performed
  167. lexicographically.
  168. </para>
  169. <programlisting language="querystring"><![CDATA[
  170. mod_date:[20020101 TO 20030101]
  171. ]]></programlisting>
  172. <para>
  173. This will find documents whose mod_date fields have values between 20020101 and
  174. 20030101, inclusive. Note that Range Queries are not reserved for date fields. You could
  175. also use range queries with non-date fields:
  176. </para>
  177. <programlisting language="querystring"><![CDATA[
  178. title:{Aida TO Carmen}
  179. ]]></programlisting>
  180. <para>
  181. This will find all documents whose titles would be sorted between Aida and Carmen, but
  182. not including Aida and Carmen.
  183. </para>
  184. <para>
  185. Inclusive range queries are denoted by square brackets. Exclusive range queries are
  186. denoted by curly brackets.
  187. </para>
  188. <para>
  189. If field is not specified then <classname>Zend_Search_Lucene</classname> searches for
  190. specified interval through all fields by default.
  191. </para>
  192. <programlisting language="querystring"><![CDATA[
  193. {Aida TO Carmen}
  194. ]]></programlisting>
  195. </sect2>
  196. <sect2 id="zend.search.lucene.query-language.fuzzy">
  197. <title>Fuzzy Searches</title>
  198. <para>
  199. <classname>Zend_Search_Lucene</classname> as well as Java Lucene supports fuzzy searches
  200. based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use
  201. the tilde, "~", symbol at the end of a Single word Term. For example to search for a
  202. term similar in spelling to "roam" use the fuzzy search:
  203. </para>
  204. <programlisting language="querystring"><![CDATA[
  205. roam~
  206. ]]></programlisting>
  207. <para>
  208. This search will find terms like foam and roams. Additional (optional) parameter can
  209. specify the required similarity. The value is between 0 and 1, with a value closer to 1
  210. only terms with a higher similarity will be matched. For example:
  211. </para>
  212. <programlisting language="querystring"><![CDATA[
  213. roam~0.8
  214. ]]></programlisting>
  215. <para>
  216. The default that is used if the parameter is not given is 0.5.
  217. </para>
  218. </sect2>
  219. <sect2 id="zend.search.lucene.query-language.matched-terms-limitations">
  220. <title>Matched terms limitation</title>
  221. <para>
  222. Wildcard, range and fuzzy search queries may match too many terms. It may cause
  223. incredible search performance downgrade.
  224. </para>
  225. <para>
  226. So <classname>Zend_Search_Lucene</classname> sets a limit of matching terms per query
  227. (subquery). This limit can be retrieved and set using
  228. <methodname>Zend_Search_Lucene::getTermsPerQueryLimit()</methodname> and
  229. <methodname>Zend_Search_Lucene::setTermsPerQueryLimit($limit)</methodname> methods.
  230. </para>
  231. <para>
  232. Default matched terms per query limit is 1024.
  233. </para>
  234. </sect2>
  235. <sect2 id="zend.search.lucene.query-language.proximity-search">
  236. <title>Proximity Searches</title>
  237. <para>
  238. Lucene supports finding words from a phrase that are within a specified word distance in
  239. a string. To do a proximity search use the tilde, "~", symbol at the end of the phrase.
  240. For example to search for a "Zend" and "Framework" within 10 words of each other in a
  241. document use the search:
  242. </para>
  243. <programlisting language="querystring"><![CDATA[
  244. "Zend Framework"~10
  245. ]]></programlisting>
  246. </sect2>
  247. <sect2 id="zend.search.lucene.query-language.boosting">
  248. <title>Boosting a Term</title>
  249. <para>
  250. Java Lucene and <classname>Zend_Search_Lucene</classname> provide the relevance level of
  251. matching documents based on the terms found. To boost the relevance of a term use the
  252. caret, "^", symbol with a boost factor (a number) at the end of the term you are
  253. searching. The higher the boost factor, the more relevant the term will be.
  254. </para>
  255. <para>
  256. Boosting allows you to control the relevance of a document by boosting individual terms.
  257. For example, if you are searching for
  258. </para>
  259. <programlisting language="querystring"><![CDATA[
  260. PHP framework
  261. ]]></programlisting>
  262. <para>
  263. and you want the term "PHP" to be more relevant boost it using the ^ symbol along with
  264. the boost factor next to the term. You would type:
  265. </para>
  266. <programlisting language="querystring"><![CDATA[
  267. PHP^4 framework
  268. ]]></programlisting>
  269. <para>
  270. This will make documents with the term <acronym>PHP</acronym> appear more relevant. You
  271. can also boost phrase terms and subqueries as in the example:
  272. </para>
  273. <programlisting language="querystring"><![CDATA[
  274. "PHP framework"^4 "Zend Framework"
  275. ]]></programlisting>
  276. <para>
  277. By default, the boost factor is 1. Although the boost factor must be positive,
  278. it may be less than 1 (e.g. 0.2).
  279. </para>
  280. </sect2>
  281. <sect2 id="zend.search.lucene.query-language.boolean">
  282. <title>Boolean Operators</title>
  283. <para>
  284. Boolean operators allow terms to be combined through logic operators.
  285. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators.
  286. Java Lucene requires boolean operators to be ALL CAPS.
  287. <classname>Zend_Search_Lucene</classname> does not.
  288. </para>
  289. <para>
  290. AND, OR, and NOT operators and "+", "-" defines two different styles to construct
  291. boolean queries. Unlike Java Lucene, <classname>Zend_Search_Lucene</classname> doesn't
  292. allow these two styles to be mixed.
  293. </para>
  294. <para>
  295. If the AND/OR/NOT style is used, then an AND or OR operator must be present between all
  296. query terms. Each term may also be preceded by NOT operator. The AND operator has higher
  297. precedence than the OR operator. This differs from Java Lucene behavior.
  298. </para>
  299. <sect3 id="zend.search.lucene.query-language.boolean.and">
  300. <title>AND</title>
  301. <para>
  302. The AND operator means that all terms in the "AND group" must match some part of the
  303. searched field(s).
  304. </para>
  305. <para>
  306. To search for documents that contain "PHP framework" and "Zend Framework" use the
  307. query:
  308. </para>
  309. <programlisting language="querystring"><![CDATA[
  310. "PHP framework" AND "Zend Framework"
  311. ]]></programlisting>
  312. </sect3>
  313. <sect3 id="zend.search.lucene.query-language.boolean.or">
  314. <title>OR</title>
  315. <para>
  316. The OR operator divides the query into several optional terms.
  317. </para>
  318. <para>
  319. To search for documents that contain "PHP framework" or "Zend Framework" use the
  320. query:
  321. </para>
  322. <programlisting language="querystring"><![CDATA[
  323. "PHP framework" OR "Zend Framework"
  324. ]]></programlisting>
  325. </sect3>
  326. <sect3 id="zend.search.lucene.query-language.boolean.not">
  327. <title>NOT</title>
  328. <para>
  329. The NOT operator excludes documents that contain the term after NOT. But an "AND
  330. group" which contains only terms with the NOT operator gives an empty result set
  331. instead of a full set of indexed documents.
  332. </para>
  333. <para>
  334. To search for documents that contain "PHP framework" but not "Zend Framework" use
  335. the query:
  336. </para>
  337. <programlisting language="querystring"><![CDATA[
  338. "PHP framework" AND NOT "Zend Framework"
  339. ]]></programlisting>
  340. </sect3>
  341. <sect3 id="zend.search.lucene.query-language.boolean.other-form">
  342. <title>&amp;&amp;, ||, and ! operators</title>
  343. <para>
  344. &amp;&amp;, ||, and ! may be used instead of AND, OR, and NOT notation.
  345. </para>
  346. </sect3>
  347. <sect3 id="zend.search.lucene.query-language.boolean.plus">
  348. <title>+</title>
  349. <para>
  350. The "+" or required operator stipulates that the term after the "+" symbol must
  351. match the document.
  352. </para>
  353. <para>
  354. To search for documents that must contain "Zend" and may contain "Framework" use the
  355. query:
  356. </para>
  357. <programlisting language="querystring"><![CDATA[
  358. +Zend Framework
  359. ]]></programlisting>
  360. </sect3>
  361. <sect3 id="zend.search.lucene.query-language.boolean.minus">
  362. <title>-</title>
  363. <para>
  364. The "-" or prohibit operator excludes documents that match the term after the "-"
  365. symbol.
  366. </para>
  367. <para>
  368. To search for documents that contain "PHP framework" but not "Zend Framework" use
  369. the query:
  370. </para>
  371. <programlisting language="querystring"><![CDATA[
  372. "PHP framework" -"Zend Framework"
  373. ]]></programlisting>
  374. </sect3>
  375. <sect3 id="zend.search.lucene.query-language.boolean.no-operator">
  376. <title>No Operator</title>
  377. <para>
  378. If no operator is used, then the search behavior is defined by the "default boolean
  379. operator".
  380. </para>
  381. <para>
  382. This is set to 'OR' by default.
  383. </para>
  384. <para>
  385. That implies each term is optional by default. It may or may not be present within
  386. document, but documents with this term will receive a higher score.
  387. </para>
  388. <para>
  389. To search for documents that requires "PHP framework" and may contain "Zend
  390. Framework" use the query:
  391. </para>
  392. <programlisting language="querystring"><![CDATA[
  393. +"PHP framework" "Zend Framework"
  394. ]]></programlisting>
  395. <para>
  396. The default boolean operator may be set or retrieved with the
  397. <classname>Zend_Search_Lucene_Search_QueryParser::setDefaultOperator($operator)</classname>
  398. and
  399. <classname>Zend_Search_Lucene_Search_QueryParser::getDefaultOperator()</classname>
  400. methods, respectively.
  401. </para>
  402. <para>
  403. These methods operate with the
  404. <classname>Zend_Search_Lucene_Search_QueryParser::B_AND</classname> and
  405. <classname>Zend_Search_Lucene_Search_QueryParser::B_OR</classname> constants.
  406. </para>
  407. </sect3>
  408. </sect2>
  409. <sect2 id="zend.search.lucene.query-language.grouping">
  410. <title>Grouping</title>
  411. <para>
  412. Java Lucene and <classname>Zend_Search_Lucene</classname> support using parentheses to
  413. group clauses to form sub queries. This can be useful if you want to control the
  414. precedence of boolean logic operators for a query or mix different boolean query styles:
  415. </para>
  416. <programlisting language="querystring"><![CDATA[
  417. +(framework OR library) +php
  418. ]]></programlisting>
  419. <para>
  420. <classname>Zend_Search_Lucene</classname> supports subqueries nested to any level.
  421. </para>
  422. </sect2>
  423. <sect2 id="zend.search.lucene.query-language.field-grouping">
  424. <title>Field Grouping</title>
  425. <para>
  426. Lucene also supports using parentheses to group multiple clauses to a single field.
  427. </para>
  428. <para>
  429. To search for a title that contains both the word "return" and the phrase "pink panther"
  430. use the query:
  431. </para>
  432. <programlisting language="querystring"><![CDATA[
  433. title:(+return +"pink panther")
  434. ]]></programlisting>
  435. </sect2>
  436. <sect2 id="zend.search.lucene.query-language.escaping">
  437. <title>Escaping Special Characters</title>
  438. <para>
  439. Lucene supports escaping special characters that are used in query syntax. The current
  440. list of special characters is:
  441. </para>
  442. <para>
  443. + - &amp;&amp; || ! ( ) { } [ ] ^ " ~ * ? : \
  444. </para>
  445. <para>
  446. + and - inside single terms are automatically treated as common characters.
  447. </para>
  448. <para>
  449. For other instances of these characters use the \ before each special character you'd
  450. like to escape. For example to search for (1+1):2 use the query:
  451. </para>
  452. <programlisting language="querystring"><![CDATA[
  453. \(1\+1\)\:2
  454. ]]></programlisting>
  455. </sect2>
  456. </sect1>