Zend_Search_Lucene-IndexCreation.xml 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="zend.search.lucene.index-creation">
  4. <title>Building Indexes</title>
  5. <sect2 id="zend.search.lucene.index-creation.creating">
  6. <title>Creating a New Index</title>
  7. <para>
  8. Index creation and updating capabilities are implemented within the <classname>Zend_Search_Lucene</classname> component, as well as the Java Lucene project.
  9. You can use either of these options to create indexes that <classname>Zend_Search_Lucene</classname> can search.
  10. </para>
  11. <para>
  12. The PHP code listing below provides an example of how to index a file
  13. using <classname>Zend_Search_Lucene</classname> indexing API:
  14. </para>
  15. <programlisting language="php"><![CDATA[
  16. // Create index
  17. $index = Zend_Search_Lucene::create('/data/my-index');
  18. $doc = new Zend_Search_Lucene_Document();
  19. // Store document URL to identify it in the search results
  20. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  21. // Index document contents
  22. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));
  23. // Add document to the index
  24. $index->addDocument($doc);
  25. ]]></programlisting>
  26. <para>
  27. Newly added documents are immediately searchable in the index.
  28. </para>
  29. </sect2>
  30. <sect2 id="zend.search.lucene.index-creation.updating">
  31. <title>Updating Index</title>
  32. <para>
  33. The same procedure is used to update an existing index. The only difference
  34. is that the open() method is called instead of the create() method:
  35. </para>
  36. <programlisting language="php"><![CDATA[
  37. // Open existing index
  38. $index = Zend_Search_Lucene::open('/data/my-index');
  39. $doc = new Zend_Search_Lucene_Document();
  40. // Store document URL to identify it in search result.
  41. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  42. // Index document content
  43. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
  44. $docContent));
  45. // Add document to the index.
  46. $index->addDocument($doc);
  47. ]]></programlisting>
  48. </sect2>
  49. <sect2 id="zend.search.lucene.index-creation.document-updating">
  50. <title>Updating Documents</title>
  51. <para>
  52. The Lucene index file format doesn't support document updating.
  53. Documents should be removed and re-added to the index to effectively update them.
  54. </para>
  55. <para>
  56. <classname>Zend_Search_Lucene::delete()</classname> method operates with an internal index document id. It can be retrieved
  57. from a query hit by 'id' property:
  58. </para>
  59. <programlisting language="php"><![CDATA[
  60. $removePath = ...;
  61. $hits = $index->find('path:' . $removePath);
  62. foreach ($hits as $hit) {
  63. $index->delete($hit->id);
  64. }
  65. ]]></programlisting>
  66. </sect2>
  67. <sect2 id="zend.search.lucene.index-creation.counting">
  68. <title>Retrieving Index Size</title>
  69. <para>
  70. There are two methods to retrieve the size of an index in <classname>Zend_Search_Lucene</classname>.
  71. </para>
  72. <para>
  73. <classname>Zend_Search_Lucene::maxDoc()</classname> returns one greater than the largest possible document number.
  74. It's actually the overall number of the documents in the index including deleted documents,
  75. so it has a synonym: <classname>Zend_Search_Lucene::count()</classname>.
  76. </para>
  77. <para>
  78. <classname>Zend_Search_Lucene::numDocs()</classname> returns the total number of non-deleted documents.
  79. </para>
  80. <programlisting language="php"><![CDATA[
  81. $indexSize = $index->count();
  82. $documents = $index->numDocs();
  83. ]]></programlisting>
  84. <para>
  85. <classname>Zend_Search_Lucene::isDeleted($id)</classname> method may be used to check if a document is deleted.
  86. </para>
  87. <programlisting language="php"><![CDATA[
  88. for ($count = 0; $count < $index->maxDoc(); $count++) {
  89. if ($index->isDeleted($count)) {
  90. echo "Document #$id is deleted.\n";
  91. }
  92. }
  93. ]]></programlisting>
  94. <para>
  95. Index optimization removes deleted documents and squeezes documents' IDs in to a smaller range.
  96. A document's internal id may therefore change during index optimization.
  97. </para>
  98. </sect2>
  99. <sect2 id="zend.search.lucene.index-creation.optimization">
  100. <title>Index optimization</title>
  101. <para>
  102. A Lucene index consists of many segments. Each segment is a completely independent set of data.
  103. </para>
  104. <para>
  105. Lucene index segment files can't be updated by design. A segment update needs full segment
  106. reorganization. See Lucene index file formats for details
  107. (<ulink url="http://lucene.apache.org/java/docs/fileformats.html">http://lucene.apache.org/java/docs/fileformats.html</ulink>)
  108. <footnote>
  109. <para>The currently supported Lucene index file format is version 2.3 (starting from ZF 1.6).</para>
  110. </footnote>.
  111. New documents are added to the index by creating new segment.
  112. </para>
  113. <para>
  114. Increasing number of segments reduces quality of the index, but index optimization restores it.
  115. Optimization essentially merges several segments into a new one. This process also doesn't update segments.
  116. It generates one new large segment and updates segment list ('segments' file).
  117. </para>
  118. <para>
  119. Full index optimization can be trigger by calling the <classname>Zend_Search_Lucene::optimize()</classname> method. It merges all
  120. index segments into one new segment:
  121. </para>
  122. <programlisting language="php"><![CDATA[
  123. // Open existing index
  124. $index = Zend_Search_Lucene::open('/data/my-index');
  125. // Optimize index.
  126. $index->optimize();
  127. ]]></programlisting>
  128. <para>
  129. Automatic index optimization is performed to keep indexes in a consistent state.
  130. </para>
  131. <para>
  132. Automatic optimization is an iterative process managed by several index options. It merges very small segments
  133. into larger ones, then merges these larger segments into even larger segments and so on.
  134. </para>
  135. <sect3 id="zend.search.lucene.index-creation.optimization.maxbuffereddocs">
  136. <title>MaxBufferedDocs auto-optimization option</title>
  137. <para>
  138. <emphasis>MaxBufferedDocs</emphasis> is a minimal number of documents required before
  139. the buffered in-memory documents are written into a new segment.
  140. </para>
  141. <para>
  142. <emphasis>MaxBufferedDocs</emphasis> can be retrieved or set by <code>$index->getMaxBufferedDocs()</code> or
  143. <code>$index->setMaxBufferedDocs($maxBufferedDocs)</code> calls.
  144. </para>
  145. <para>
  146. Default value is 10.
  147. </para>
  148. </sect3>
  149. <sect3 id="zend.search.lucene.index-creation.optimization.maxmergedocs">
  150. <title>MaxMergeDocs auto-optimization option</title>
  151. <para>
  152. <emphasis>MaxMergeDocs</emphasis> is a largest number of documents ever merged by addDocument().
  153. Small values (e.g., less than 10.000) are best for interactive indexing, as this limits the length
  154. of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier
  155. searches.
  156. </para>
  157. <para>
  158. <emphasis>MaxMergeDocs</emphasis> can be retrieved or set by <code>$index->getMaxMergeDocs()</code> or
  159. <code>$index->setMaxMergeDocs($maxMergeDocs)</code> calls.
  160. </para>
  161. <para>
  162. Default value is PHP_INT_MAX.
  163. </para>
  164. </sect3>
  165. <sect3 id="zend.search.lucene.index-creation.optimization.mergefactor">
  166. <title>MergeFactor auto-optimization option</title>
  167. <para>
  168. <emphasis>MergeFactor</emphasis> determines how often segment indices are merged by addDocument().
  169. With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster,
  170. but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches
  171. on unoptimized indices are slower, indexing is faster. Thus larger values (&gt; 10) are best for batch
  172. index creation, and smaller values (&lt; 10) for indices that are interactively maintained.
  173. </para>
  174. <para>
  175. <emphasis>MergeFactor</emphasis> is a good estimation for average number of segments merged by one auto-optimization pass.
  176. Too large values produce large number of segments while they are not merged into new one. It may be a cause of
  177. "failed to open stream: Too many open files" error message. This limitation is system dependent.
  178. </para>
  179. <para>
  180. <emphasis>MergeFactor</emphasis> can be retrieved or set by <code>$index->getMergeFactor()</code> or
  181. <code>$index->setMergeFactor($mergeFactor)</code> calls.
  182. </para>
  183. <para>
  184. Default value is 10.
  185. </para>
  186. <para>
  187. Lucene Java and Luke (Lucene Index Toolbox - <ulink url="http://www.getopt.org/luke/">http://www.getopt.org/luke/</ulink>)
  188. can also be used to optimize an index. Latest Luke release (v0.8) is based on Lucene v2.3 and compatible with
  189. current implementation of <classname>Zend_Search_Lucene</classname> component (ZF 1.6). Earlier versions of <classname>Zend_Search_Lucene</classname> implementations
  190. need another versions of Java Lucene tools to be compatible:
  191. <itemizedlist>
  192. <listitem>
  193. <para>ZF 1.5 - Java Lucene 2.1 (Luke tool v0.7.1 - <ulink url="http://www.getopt.org/luke/luke-0.7.1/"/>)</para>
  194. </listitem>
  195. <listitem>
  196. <para>ZF 1.0 - Java Lucene 1.4 - 2.1 (Luke tool v0.6 - <ulink url="http://www.getopt.org/luke/luke-0.6/"/>)</para>
  197. </listitem>
  198. </itemizedlist>
  199. </para>
  200. </sect3>
  201. </sect2>
  202. <sect2 id="zend.search.lucene.index-creation.permissions">
  203. <title>Permissions</title>
  204. <para>
  205. By default, index files are available for reading and writing by everyone.
  206. </para>
  207. <para>
  208. It's possible to override this with the <classname>Zend_Search_Lucene_Storage_Directory_Filesystem::setDefaultFilePermissions()</classname> method:
  209. </para>
  210. <programlisting language="php"><![CDATA[
  211. // Get current default file permissions
  212. $currentPermissions =
  213. Zend_Search_Lucene_Storage_Directory_Filesystem::getDefaultFilePermissions();
  214. // Give read-writing permissions only for current user and group
  215. Zend_Search_Lucene_Storage_Directory_Filesystem::setDefaultFilePermissions(0660);
  216. ]]></programlisting>
  217. </sect2>
  218. <sect2 id="zend.search.lucene.index-creation.limitations">
  219. <title>Limitations</title>
  220. <sect3 id="zend.search.lucene.index-creation.limitations.index-size">
  221. <title>Index size</title>
  222. <para>
  223. Index size is limited by 2GB for 32-bit platforms.
  224. </para>
  225. <para>
  226. Use 64-bit platforms for larger indices.
  227. </para>
  228. </sect3>
  229. <sect3 id="zend.search.lucene.index-creation.limitations.filesystems">
  230. <title>Supported Filesystems</title>
  231. <para>
  232. <classname>Zend_Search_Lucene</classname> uses <code>flock()</code> to provide concurrent searching, index updating and optimization.
  233. </para>
  234. <para>
  235. According to the PHP <ulink url="http://www.php.net/manual/en/function.flock.php">documentation</ulink>,
  236. "<code>flock()</code> will not work on NFS and many other networked file systems".
  237. </para>
  238. <para>
  239. Do not use networked file systems with <classname>Zend_Search_Lucene</classname>.
  240. </para>
  241. </sect3>
  242. </sect2>
  243. </sect1>
  244. <!--
  245. vim:se ts=4 sw=4 et:
  246. -->