lucene-indexing.xml 3.9 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- EN-Revision: 19766 -->
  3. <!-- Reviewed: no -->
  4. <sect1 id="learning.lucene.indexing">
  5. <title>Indexing</title>
  6. <para>
  7. Indexing is performed by adding a new document to an existing or new index:
  8. </para>
  9. <programlisting language="php"><![CDATA[
  10. $index->addDocument($doc);
  11. ]]></programlisting>
  12. <para>
  13. There are two ways to create document object. The first is to do it manually.
  14. </para>
  15. <example id="learning.lucene.indexing.doc-creation">
  16. <title>Manual Document Construction</title>
  17. <programlisting language="php"><![CDATA[
  18. $doc = new Zend_Search_Lucene_Document();
  19. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  20. $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
  21. $doc->addField(Zend_Search_Lucene_Field::unStored('contents', $docBody));
  22. $doc->addField(Zend_Search_Lucene_Field::binary('avatar', $avatarData));
  23. ]]></programlisting>
  24. </example>
  25. <para>
  26. The second method is to load it from <acronym>HTML</acronym> or Microsoft Office 2007 files:
  27. </para>
  28. <example id="learning.lucene.indexing.doc-loading">
  29. <title>Document loading</title>
  30. <programlisting language="php"><![CDATA[
  31. $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
  32. $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($path);
  33. $doc = Zend_Search_Lucene_Document_Pptx::loadPptFile($path);
  34. $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($path);
  35. ]]></programlisting>
  36. </example>
  37. <para>
  38. If a document is loaded from one of the supported formats, it still can be extended manually
  39. with new user defined fields.
  40. </para>
  41. <sect2 id="learning.lucene.indexing.policy">
  42. <title>Indexing Policy</title>
  43. <para>
  44. You should define indexing policy within your application architectural design.
  45. </para>
  46. <para>
  47. You may need an on-demand indexing configuration (something like <acronym>OLTP</acronym>
  48. system). In such systems, you usually add one document per user request. As such, the
  49. <emphasis>MaxBufferedDocs</emphasis> option will not affect the system. On the other
  50. hand, <emphasis>MaxMergeDocs</emphasis> is really helpful as it allows you to limit
  51. maximum script execution time. <emphasis>MergeFactor</emphasis> should be set to a value
  52. that keeps balance between the average indexing time (it's also affected by average
  53. auto-optimization time) and search performance (index optimization level is dependent on
  54. the number of segments).
  55. </para>
  56. <para>
  57. If you will be primarily performing batch index updates, your configuration should use a
  58. <emphasis>MaxBufferedDocs</emphasis> option set to the maximum value supported by the
  59. available amount of memory. <emphasis>MaxMergeDocs</emphasis> and
  60. <emphasis>MergeFactor</emphasis> have to be set to values reducing auto-optimization
  61. involvement as much as possible <footnote><para>An additional limit is the maximum file
  62. handlers supported by the operation system for concurrent open
  63. operations</para></footnote>. Full index optimization should be applied after
  64. indexing.
  65. </para>
  66. <example id="learning.lucene.indexing.optimization">
  67. <title>Index optimization</title>
  68. <programlisting language="php"><![CDATA[
  69. $index->optimize();
  70. ]]></programlisting>
  71. </example>
  72. <para>
  73. In some configurations, it's more effective to serialize index updates by organizing
  74. update requests into a queue and processing several update requests in a single script
  75. execution. This reduces index opening overhead, and allows utilizing index document
  76. buffering.
  77. </para>
  78. </sect2>
  79. </sect1>