Zend_Search_Lucene-IndexCreation.xml 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="zend.search.lucene.index-creation">
  4. <title>Building Indexes</title>
  5. <sect2 id="zend.search.lucene.index-creation.creating">
  6. <title>Creating a New Index</title>
  7. <para>
  8. Index creation and updating capabilities are implemented within the
  9. <classname>Zend_Search_Lucene</classname> component, as well as the Java Lucene project.
  10. You can use either of these options to create indexes that
  11. <classname>Zend_Search_Lucene</classname> can search.
  12. </para>
  13. <para>
  14. The <acronym>PHP</acronym> code listing below provides an example of how to index a file
  15. using <classname>Zend_Search_Lucene</classname> indexing <acronym>API</acronym>:
  16. </para>
  17. <programlisting language="php"><![CDATA[
  18. // Create index
  19. $index = Zend_Search_Lucene::create('/data/my-index');
  20. $doc = new Zend_Search_Lucene_Document();
  21. // Store document URL to identify it in the search results
  22. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  23. // Index document contents
  24. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));
  25. // Add document to the index
  26. $index->addDocument($doc);
  27. ]]></programlisting>
  28. <para>
  29. Newly added documents are immediately searchable in the index.
  30. </para>
  31. </sect2>
  32. <sect2 id="zend.search.lucene.index-creation.updating">
  33. <title>Updating Index</title>
  34. <para>
  35. The same procedure is used to update an existing index. The only difference
  36. is that the open() method is called instead of the create() method:
  37. </para>
  38. <programlisting language="php"><![CDATA[
  39. // Open existing index
  40. $index = Zend_Search_Lucene::open('/data/my-index');
  41. $doc = new Zend_Search_Lucene_Document();
  42. // Store document URL to identify it in search result.
  43. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  44. // Index document content
  45. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
  46. $docContent));
  47. // Add document to the index.
  48. $index->addDocument($doc);
  49. ]]></programlisting>
  50. </sect2>
  51. <sect2 id="zend.search.lucene.index-creation.document-updating">
  52. <title>Updating Documents</title>
  53. <para>
  54. The Lucene index file format doesn't support document updating.
  55. Documents should be removed and re-added to the index to effectively update them.
  56. </para>
  57. <para>
  58. <methodname>Zend_Search_Lucene::delete()</methodname> method operates with an internal
  59. index document id. It can be retrieved from a query hit by 'id' property:
  60. </para>
  61. <programlisting language="php"><![CDATA[
  62. $removePath = ...;
  63. $hits = $index->find('path:' . $removePath);
  64. foreach ($hits as $hit) {
  65. $index->delete($hit->id);
  66. }
  67. ]]></programlisting>
  68. </sect2>
  69. <sect2 id="zend.search.lucene.index-creation.counting">
  70. <title>Retrieving Index Size</title>
  71. <para>
  72. There are two methods to retrieve the size of an index in
  73. <classname>Zend_Search_Lucene</classname>.
  74. </para>
  75. <para>
  76. <methodname>Zend_Search_Lucene::maxDoc()</methodname> returns one greater than the
  77. largest possible document number. It's actually the overall number of the documents in
  78. the index including deleted documents, so it has a synonym:
  79. <methodname>Zend_Search_Lucene::count()</methodname>.
  80. </para>
  81. <para>
  82. <methodname>Zend_Search_Lucene::numDocs()</methodname> returns the total number of
  83. non-deleted documents.
  84. </para>
  85. <programlisting language="php"><![CDATA[
  86. $indexSize = $index->count();
  87. $documents = $index->numDocs();
  88. ]]></programlisting>
  89. <para>
  90. <methodname>Zend_Search_Lucene::isDeleted($id)</methodname> method may be used to check
  91. if a document is deleted.
  92. </para>
  93. <programlisting language="php"><![CDATA[
  94. for ($count = 0; $count < $index->maxDoc(); $count++) {
  95. if ($index->isDeleted($count)) {
  96. echo "Document #$id is deleted.\n";
  97. }
  98. }
  99. ]]></programlisting>
  100. <para>
  101. Index optimization removes deleted documents and squeezes documents' IDs in to a smaller
  102. range. A document's internal id may therefore change during index optimization.
  103. </para>
  104. </sect2>
  105. <sect2 id="zend.search.lucene.index-creation.optimization">
  106. <title>Index optimization</title>
  107. <para>
  108. A Lucene index consists of many segments. Each segment is a completely independent set
  109. of data.
  110. </para>
  111. <para>
  112. Lucene index segment files can't be updated by design. A segment update needs full
  113. segment reorganization. See Lucene index file formats for details (<ulink
  114. url="http://lucene.apache.org/java/2_3_0/fileformats.html">http://lucene.apache.org/java/2_3_0/fileformats.html</ulink>)
  115. <footnote>
  116. <para>
  117. The currently supported Lucene index file format is version 2.3 (starting from
  118. Zend Framework 1.6).
  119. </para>
  120. </footnote>.
  121. New documents are added to the index by creating new segment.
  122. </para>
  123. <para>
  124. Increasing number of segments reduces quality of the index, but index optimization
  125. restores it. Optimization essentially merges several segments into a new one. This
  126. process also doesn't update segments. It generates one new large segment and updates
  127. segment list ('segments' file).
  128. </para>
  129. <para>
  130. Full index optimization can be trigger by calling the
  131. <methodname>Zend_Search_Lucene::optimize()</methodname> method. It merges all index
  132. segments into one new segment:
  133. </para>
  134. <programlisting language="php"><![CDATA[
  135. // Open existing index
  136. $index = Zend_Search_Lucene::open('/data/my-index');
  137. // Optimize index.
  138. $index->optimize();
  139. ]]></programlisting>
  140. <para>
  141. Automatic index optimization is performed to keep indexes in a consistent state.
  142. </para>
  143. <para>
  144. Automatic optimization is an iterative process managed by several index options. It
  145. merges very small segments into larger ones, then merges these larger segments into even
  146. larger segments and so on.
  147. </para>
  148. <sect3 id="zend.search.lucene.index-creation.optimization.maxbuffereddocs">
  149. <title>MaxBufferedDocs auto-optimization option</title>
  150. <para>
  151. <emphasis>MaxBufferedDocs</emphasis> is a minimal number of documents required
  152. before the buffered in-memory documents are written into a new segment.
  153. </para>
  154. <para>
  155. <emphasis>MaxBufferedDocs</emphasis> can be retrieved or set by
  156. <code>$index->getMaxBufferedDocs()</code> or
  157. <code>$index->setMaxBufferedDocs($maxBufferedDocs)</code> calls.
  158. </para>
  159. <para>
  160. Default value is 10.
  161. </para>
  162. </sect3>
  163. <sect3 id="zend.search.lucene.index-creation.optimization.maxmergedocs">
  164. <title>MaxMergeDocs auto-optimization option</title>
  165. <para>
  166. <emphasis>MaxMergeDocs</emphasis> is a largest number of documents ever merged by
  167. addDocument(). Small values (e.g., less than 10.000) are best for interactive
  168. indexing, as this limits the length of pauses while indexing to a few seconds.
  169. Larger values are best for batched indexing and speedier searches.
  170. </para>
  171. <para>
  172. <emphasis>MaxMergeDocs</emphasis> can be retrieved or set by
  173. <code>$index->getMaxMergeDocs()</code> or
  174. <code>$index->setMaxMergeDocs($maxMergeDocs)</code> calls.
  175. </para>
  176. <para>
  177. Default value is PHP_INT_MAX.
  178. </para>
  179. </sect3>
  180. <sect3 id="zend.search.lucene.index-creation.optimization.mergefactor">
  181. <title>MergeFactor auto-optimization option</title>
  182. <para>
  183. <emphasis>MergeFactor</emphasis> determines how often segment indices are merged by
  184. addDocument(). With smaller values, less <acronym>RAM</acronym> is used while
  185. indexing, and searches on unoptimized indices are faster, but indexing speed is
  186. slower. With larger values, more <acronym>RAM</acronym> is used during indexing, and
  187. while searches on unoptimized indices are slower, indexing is faster. Thus larger
  188. values (&gt; 10) are best for batch index creation, and smaller values (&lt; 10) for
  189. indices that are interactively maintained.
  190. </para>
  191. <para>
  192. <emphasis>MergeFactor</emphasis> is a good estimation for average number of segments
  193. merged by one auto-optimization pass. Too large values produce large number of
  194. segments while they are not merged into new one. It may be a cause of "failed to
  195. open stream: Too many open files" error message. This limitation is system
  196. dependent.
  197. </para>
  198. <para>
  199. <emphasis>MergeFactor</emphasis> can be retrieved or set by
  200. <code>$index->getMergeFactor()</code> or
  201. <code>$index->setMergeFactor($mergeFactor)</code> calls.
  202. </para>
  203. <para>
  204. Default value is 10.
  205. </para>
  206. <para>
  207. Lucene Java and Luke (Lucene Index Toolbox - <ulink
  208. url="http://www.getopt.org/luke/">http://www.getopt.org/luke/</ulink>) can also
  209. be used to optimize an index. Latest Luke release (v0.8) is based on Lucene v2.3 and
  210. compatible with current implementation of <classname>Zend_Search_Lucene</classname>
  211. component (Zend Framework 1.6). Earlier versions of
  212. <classname>Zend_Search_Lucene</classname> implementations need another versions of
  213. Java Lucene tools to be compatible:
  214. <itemizedlist>
  215. <listitem>
  216. <para>
  217. Zend Framework 1.5 - Java Lucene 2.1 (Luke tool v0.7.1 - <ulink
  218. url="http://www.getopt.org/luke/luke-0.7.1/"/>)
  219. </para>
  220. </listitem>
  221. <listitem>
  222. <para>
  223. Zend Framework 1.0 - Java Lucene 1.4 - 2.1 (Luke tool v0.6 - <ulink
  224. url="http://www.getopt.org/luke/luke-0.6/"/>)
  225. </para>
  226. </listitem>
  227. </itemizedlist>
  228. </para>
  229. </sect3>
  230. </sect2>
  231. <sect2 id="zend.search.lucene.index-creation.permissions">
  232. <title>Permissions</title>
  233. <para>
  234. By default, index files are available for reading and writing by everyone.
  235. </para>
  236. <para>
  237. It's possible to override this with the
  238. <methodname>Zend_Search_Lucene_Storage_Directory_Filesystem::setDefaultFilePermissions()</methodname>
  239. method:
  240. </para>
  241. <programlisting language="php"><![CDATA[
  242. // Get current default file permissions
  243. $currentPermissions =
  244. Zend_Search_Lucene_Storage_Directory_Filesystem::getDefaultFilePermissions();
  245. // Give read-writing permissions only for current user and group
  246. Zend_Search_Lucene_Storage_Directory_Filesystem::setDefaultFilePermissions(0660);
  247. ]]></programlisting>
  248. </sect2>
  249. <sect2 id="zend.search.lucene.index-creation.limitations">
  250. <title>Limitations</title>
  251. <sect3 id="zend.search.lucene.index-creation.limitations.index-size">
  252. <title>Index size</title>
  253. <para>
  254. Index size is limited by 2GB for 32-bit platforms.
  255. </para>
  256. <para>
  257. Use 64-bit platforms for larger indices.
  258. </para>
  259. </sect3>
  260. <sect3 id="zend.search.lucene.index-creation.limitations.filesystems">
  261. <title>Supported Filesystems</title>
  262. <para>
  263. <classname>Zend_Search_Lucene</classname> uses <methodname>flock()</methodname> to
  264. provide concurrent searching, index updating and optimization.
  265. </para>
  266. <para>
  267. According to the <acronym>PHP</acronym> <ulink
  268. url="http://www.php.net/manual/en/function.flock.php">documentation</ulink>,
  269. "<methodname>flock()</methodname> will not work on NFS and many other networked file
  270. systems".
  271. </para>
  272. <para>
  273. Do not use networked file systems with <classname>Zend_Search_Lucene</classname>.
  274. </para>
  275. </sect3>
  276. </sect2>
  277. </sect1>
  278. <!--
  279. vim:se ts=4 sw=4 et:
  280. -->