Zend_Search_Lucene-BestPractice.xml 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="zend.search.lucene.best-practice">
  4. <title>Best Practices</title>
  5. <sect2 id="zend.search.lucene.best-practice.field-names">
  6. <title>Field names</title>
  7. <para>
  8. There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
  9. </para>
  10. <para>
  11. Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
  12. '<emphasis>score</emphasis>' names to avoid ambiguity in <classname>QueryHit</classname>
  13. properties names.
  14. </para>
  15. <para>
  16. The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <property>id</property>
  17. and <property>score</property> properties always refer to internal Lucene document id
  18. and hit <link linkend="zend.search.lucene.searching.results-scoring">score</link>. If
  19. the indexed document has the same stored fields, you have to use the
  20. <methodname>getDocument()</methodname> method to access them:
  21. </para>
  22. <programlisting language="php"><![CDATA[
  23. $hits = $index->find($query);
  24. foreach ($hits as $hit) {
  25. // Get 'title' document field
  26. $title = $hit->title;
  27. // Get 'contents' document field
  28. $contents = $hit->contents;
  29. // Get internal Lucene document id
  30. $id = $hit->id;
  31. // Get query hit score
  32. $score = $hit->score;
  33. // Get 'id' document field
  34. $docId = $hit->getDocument()->id;
  35. // Get 'score' document field
  36. $docId = $hit->getDocument()->score;
  37. // Another way to get 'title' document field
  38. $title = $hit->getDocument()->title;
  39. }
  40. ]]></programlisting>
  41. </sect2>
  42. <sect2 id="zend.search.lucene.best-practice.indexing-performance">
  43. <title>Indexing performance</title>
  44. <para>
  45. Indexing performance is a compromise between used resources, indexing time and index
  46. quality.
  47. </para>
  48. <para>
  49. Index quality is completely determined by number of index segments.
  50. </para>
  51. <para>
  52. Each index segment is entirely independent portion of data. So indexes containing more
  53. segments need more memory and time for searching.
  54. </para>
  55. <para>
  56. Index optimization is a process of merging several segments into a new one. A fully
  57. optimized index contains only one segment.
  58. </para>
  59. <para>
  60. Full index optimization may be performed with the <methodname>optimize()</methodname>
  61. method:
  62. </para>
  63. <programlisting language="php"><![CDATA[
  64. $index = Zend_Search_Lucene::open($indexPath);
  65. $index->optimize();
  66. ]]></programlisting>
  67. <para>
  68. Index optimization works with data streams and doesn't take a lot of memory but does
  69. require processor resources and time.
  70. </para>
  71. <para>
  72. Lucene index segments are not updatable by their nature (the update operation requires
  73. the segment file to be completely rewritten). So adding new document(s) to an index
  74. always generates a new segment. This, in turn, decreases index quality.
  75. </para>
  76. <para>
  77. An index auto-optimization process is performed after each segment generation and
  78. consists of merging partial segments.
  79. </para>
  80. <para>
  81. There are three options to control the behavior of auto-optimization (see <link
  82. linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
  83. section):
  84. <itemizedlist>
  85. <listitem>
  86. <para>
  87. <emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
  88. buffered in memory before a new segment is generated and written to the hard
  89. drive.
  90. </para>
  91. </listitem>
  92. <listitem>
  93. <para>
  94. <emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
  95. by auto-optimization process into a new segment.
  96. </para>
  97. </listitem>
  98. <listitem>
  99. <para>
  100. <emphasis>MergeFactor</emphasis> determines how often auto-optimization is
  101. performed.
  102. </para>
  103. </listitem>
  104. </itemizedlist>
  105. <note>
  106. <para>
  107. All these options are <classname>Zend_Search_Lucene</classname> object
  108. properties- not index properties. They affect only current
  109. <classname>Zend_Search_Lucene</classname> object behavior and may vary for
  110. different scripts.
  111. </para>
  112. </note>
  113. </para>
  114. <para>
  115. <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
  116. document per script execution. On the other hand, it's very important for batch
  117. indexing. Greater values increase indexing performance, but also require more memory.
  118. </para>
  119. <para>
  120. There is simply no way to calculate the best value for the
  121. <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
  122. size, the analyzer in use and allowed memory.
  123. </para>
  124. <para>
  125. A good way to find the right value is to perform several tests with the largest document
  126. you expect to be added to the index
  127. <footnote>
  128. <para>
  129. <methodname>memory_get_usage()</methodname> and
  130. <methodname>memory_get_peak_usage()</methodname> may be used to control memory
  131. usage.
  132. </para>
  133. </footnote>
  134. . It's a best practice not to use more than a half of the allowed memory.
  135. </para>
  136. <para>
  137. <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
  138. therefore also limits auto-optimization time by guaranteeing that the
  139. <methodname>addDocument()</methodname> method is not executed more than a certain number
  140. of times. This is very important for interactive applications.
  141. </para>
  142. <para>
  143. Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
  144. performance. Index auto-optimization is an iterative process and is performed from
  145. bottom up. Small segments are merged into larger segment, which are in turn merged into
  146. even larger segments and so on. Full index optimization is achieved when only one large
  147. segment file remains.
  148. </para>
  149. <para>
  150. Small segments generally decrease index quality. Many small segments may also trigger
  151. the "Too many open files" error determined by OS limitations
  152. <footnote>
  153. <para>
  154. <classname>Zend_Search_Lucene</classname> keeps each segment file opened to
  155. improve search performance.
  156. </para>
  157. </footnote>.
  158. </para>
  159. <para>
  160. in general, background index optimization should be performed for interactive indexing
  161. mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
  162. </para>
  163. <para>
  164. <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
  165. increase the quality of unoptimized indexes. Larger values increase indexing
  166. performance, but also increase the number of merged segments. This again may trigger the
  167. "Too many open files" error.
  168. </para>
  169. <para>
  170. <emphasis>MergeFactor</emphasis> groups index segments by their size:
  171. <orderedlist>
  172. <listitem>
  173. <para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
  174. </listitem>
  175. <listitem>
  176. <para>
  177. Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
  178. <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
  179. </para>
  180. </listitem>
  181. <listitem>
  182. <para>
  183. Greater than
  184. <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
  185. not greater than
  186. <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
  187. </para>
  188. </listitem>
  189. <listitem><para>...</para></listitem>
  190. </orderedlist>
  191. </para>
  192. <para>
  193. <classname>Zend_Search_Lucene</classname> checks during each
  194. <methodname>addDocument()</methodname> call to see if merging any segments may move the
  195. newly created segment into the next group. If yes, then merging is performed.
  196. </para>
  197. <para>
  198. So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
  199. (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
  200. <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
  201. documents.
  202. </para>
  203. <para>
  204. This gives good approximation for the number of segments in the index:
  205. </para>
  206. <para>
  207. <emphasis>NumberOfSegments</emphasis> &lt;= <emphasis>MaxBufferedDocs</emphasis> +
  208. <emphasis>MergeFactor</emphasis>*log
  209. <subscript><emphasis>MergeFactor</emphasis></subscript>
  210. (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
  211. </para>
  212. <para>
  213. <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
  214. the appropriate merge factor to get a reasonable number of segments.
  215. </para>
  216. <para>
  217. Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
  218. indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
  219. course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
  220. then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
  221. </para>
  222. </sect2>
  223. <sect2 id="zend.search.lucene.best-practice.shutting-down">
  224. <title>Index during Shut Down</title>
  225. <para>
  226. The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
  227. if any documents were added to the index but not written to a new segment.
  228. </para>
  229. <para>
  230. It also may trigger an auto-optimization process.
  231. </para>
  232. <para>
  233. The index object is automatically closed when it, and all returned QueryHit objects, go
  234. out of scope.
  235. </para>
  236. <para>
  237. If index object is stored in global variable than it's closed only at the end of script
  238. execution
  239. <footnote>
  240. <para>
  241. This also may occur if the index or QueryHit instances are referred to in some
  242. cyclical data structures, because <acronym>PHP</acronym> garbage collects
  243. objects with cyclic references only at the end of script execution.
  244. </para>
  245. </footnote>.
  246. </para>
  247. <para>
  248. <acronym>PHP</acronym> exception processing is also shut down at this moment.
  249. </para>
  250. <para>
  251. It doesn't prevent normal index shutdown process, but may prevent accurate error
  252. diagnostic if any error occurs during shutdown.
  253. </para>
  254. <para>
  255. There are two ways with which you may avoid this problem.
  256. </para>
  257. <para>
  258. The first is to force going out of scope:
  259. </para>
  260. <programlisting language="php"><![CDATA[
  261. $index = Zend_Search_Lucene::open($indexPath);
  262. ...
  263. unset($index);
  264. ]]></programlisting>
  265. <para>
  266. And the second is to perform a commit operation before the end of script execution:
  267. </para>
  268. <programlisting language="php"><![CDATA[
  269. $index = Zend_Search_Lucene::open($indexPath);
  270. $index->commit();
  271. ]]></programlisting>
  272. <para>
  273. This possibility is also described in the "<link
  274. linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
  275. property</link>" section.
  276. </para>
  277. </sect2>
  278. <sect2 id="zend.search.lucene.best-practice.unique-id">
  279. <title>Retrieving documents by unique id</title>
  280. <para>
  281. It's a common practice to store some unique document id in the index. Examples include
  282. url, path, or database id.
  283. </para>
  284. <para>
  285. <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
  286. method for retrieving documents containing specified terms.
  287. </para>
  288. <para>
  289. This is more efficient than using the <methodname>find()</methodname> method:
  290. </para>
  291. <programlisting language="php"><![CDATA[
  292. // Retrieving documents with find() method using a query string
  293. $query = $idFieldName . ':' . $docId;
  294. $hits = $index->find($query);
  295. foreach ($hits as $hit) {
  296. $title = $hit->title;
  297. $contents = $hit->contents;
  298. ...
  299. }
  300. ...
  301. // Retrieving documents with find() method using the query API
  302. $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
  303. $query = new Zend_Search_Lucene_Search_Query_Term($term);
  304. $hits = $index->find($query);
  305. foreach ($hits as $hit) {
  306. $title = $hit->title;
  307. $contents = $hit->contents;
  308. ...
  309. }
  310. ...
  311. // Retrieving documents with termDocs() method
  312. $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
  313. $docIds = $index->termDocs($term);
  314. foreach ($docIds as $id) {
  315. $doc = $index->getDocument($id);
  316. $title = $doc->title;
  317. $contents = $doc->contents;
  318. ...
  319. }
  320. ]]></programlisting>
  321. </sect2>
  322. <sect2 id="zend.search.lucene.best-practice.memory-usage">
  323. <title>Memory Usage</title>
  324. <para>
  325. <classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
  326. </para>
  327. <para>
  328. It uses memory to cache some information and optimize searching and indexing
  329. performance.
  330. </para>
  331. <para>
  332. The memory required differs for different modes.
  333. </para>
  334. <para>
  335. The terms dictionary index is loaded during the search. It's actually each
  336. 128<superscript>th</superscript>
  337. <footnote>
  338. <para>
  339. The Lucene file format allows you to configure this number, but
  340. <classname>Zend_Search_Lucene</classname> doesn't expose this in its
  341. <acronym>API</acronym>. Nevertheless you still have the ability to configure
  342. this value if the index is prepared with another Lucene implementation.
  343. </para>
  344. </footnote>
  345. term of the full dictionary.
  346. </para>
  347. <para>
  348. Thus memory usage is increased if you have a high number of unique terms. This may
  349. happen if you use untokenized phrases as a field values or index a large volume of
  350. non-text information.
  351. </para>
  352. <para>
  353. An unoptimized index consists of several segments. It also increases memory usage.
  354. Segments are independent, so each segment contains its own terms dictionary and terms
  355. dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
  356. increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
  357. optimization to merge all segments into one to avoid such memory consumption.
  358. </para>
  359. <para>
  360. Indexing uses the same memory as searching plus memory for buffering documents. The
  361. amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
  362. parameter.
  363. </para>
  364. <para>
  365. Index optimization (full or partial) uses stream-style data processing and doesn't
  366. require a lot of memory.
  367. </para>
  368. </sect2>
  369. <sect2 id="zend.search.lucene.best-practice.encoding">
  370. <title>Encoding</title>
  371. <para>
  372. <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
  373. strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
  374. </para>
  375. <para>
  376. You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
  377. data, but you should be careful if this is not the case.
  378. </para>
  379. <para>
  380. Wrong encoding may cause error notices at the encoding conversion time or loss of data.
  381. </para>
  382. <para>
  383. <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
  384. for indexed documents and parsed queries.
  385. </para>
  386. <para>
  387. Encoding may be explicitly specified as an optional parameter of field creation methods:
  388. </para>
  389. <programlisting language="php"><![CDATA[
  390. $doc = new Zend_Search_Lucene_Document();
  391. $doc->addField(Zend_Search_Lucene_Field::Text('title',
  392. $title,
  393. 'iso-8859-1'));
  394. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
  395. $contents,
  396. 'utf-8'));
  397. ]]></programlisting>
  398. <para>
  399. This is the best way to avoid ambiguity in the encoding used.
  400. </para>
  401. <para>
  402. If optional encoding parameter is omitted, then the current locale is used. The current
  403. locale may contain character encoding data in addition to the language specification:
  404. </para>
  405. <programlisting language="php"><![CDATA[
  406. setlocale(LC_ALL, 'fr_FR');
  407. ...
  408. setlocale(LC_ALL, 'de_DE.iso-8859-1');
  409. ...
  410. setlocale(LC_ALL, 'ru_RU.UTF-8');
  411. ...
  412. ]]></programlisting>
  413. <para>
  414. The same approach is used to set query string encoding.
  415. </para>
  416. <para>
  417. If encoding is not specified, then the current locale is used to determine the encoding.
  418. </para>
  419. <para>
  420. Encoding may be passed as an optional parameter, if the query is parsed explicitly
  421. before search:
  422. </para>
  423. <programlisting language="php"><![CDATA[
  424. $query =
  425. Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
  426. $hits = $index->find($query);
  427. ...
  428. ]]></programlisting>
  429. <para>
  430. The default encoding may also be specified with
  431. <methodname>setDefaultEncoding()</methodname> method:
  432. </para>
  433. <programlisting language="php"><![CDATA[
  434. Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
  435. $hits = $index->find($queryStr);
  436. ...
  437. ]]></programlisting>
  438. <para>
  439. The empty string implies 'current locale'.
  440. </para>
  441. <para>
  442. If the correct encoding is specified it can be correctly processed by analyzer. The
  443. actual behavior depends on which analyzer is used. See the <link
  444. linkend="zend.search.lucene.charset">Character Set</link> documentation section for
  445. details.
  446. </para>
  447. </sect2>
  448. <sect2 id="zend.search.lucene.best-practice.maintenance">
  449. <title>Index maintenance</title>
  450. <para>
  451. It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
  452. Lucene implementation does not comprise a "database".
  453. </para>
  454. <para>
  455. Indexes should not be used for data storage. They do not provide partial backup/restore
  456. functionality, journaling, logging, transactions and many other features associated with
  457. database management systems.
  458. </para>
  459. <para>
  460. Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
  461. consistent state at all times.
  462. </para>
  463. <para>
  464. Index backup and restoration should be performed by copying the contents of the index
  465. folder.
  466. </para>
  467. <para>
  468. If index corruption occurs for any reason, the corrupted index should be restored or
  469. completely rebuilt.
  470. </para>
  471. <para>
  472. So it's a good idea to backup large indexes and store changelogs to perform manual
  473. restoration and roll-forward operations if necessary. This practice dramatically reduces
  474. index restoration time.
  475. </para>
  476. </sect2>
  477. </sect1>