lucene-indexing.xml 4.0 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- EN-Revision: 19777 -->
  3. <!-- Reviewed: no -->
  4. <sect1 id="learning.lucene.indexing">
  5. <title>Indizierung</title>
  6. <para>
  7. Indizierung wird durch das Hinzufügen eines neuen Dokuments zu einem bestehenden oder neuen
  8. Index durchgeführt:
  9. </para>
  10. <programlisting language="php"><![CDATA[
  11. $index->addDocument($doc);
  12. ]]></programlisting>
  13. <para>
  14. Es gibt zwei Wege Dokument Objekte zu erstellen. Der erste ist es manuell zu tun.
  15. </para>
  16. <example id="learning.lucene.indexing.doc-creation">
  17. <title>Manuelle Dokument Erstellung</title>
  18. <programlisting language="php"><![CDATA[
  19. $doc = new Zend_Search_Lucene_Document();
  20. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  21. $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
  22. $doc->addField(Zend_Search_Lucene_Field::unStored('contents', $docBody));
  23. $doc->addField(Zend_Search_Lucene_Field::binary('avatar', $avatarData));
  24. ]]></programlisting>
  25. </example>
  26. <para>
  27. Die zweite Methode ist das Laden von <acronym>HTML</acronym> oder von Microsoft Office 2007
  28. Dateien:
  29. </para>
  30. <example id="learning.lucene.indexing.doc-loading">
  31. <title>Laden vom Dokument</title>
  32. <programlisting language="php"><![CDATA[
  33. $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
  34. $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($path);
  35. $doc = Zend_Search_Lucene_Document_Pptx::loadPptFile($path);
  36. $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($path);
  37. ]]></programlisting>
  38. </example>
  39. <para>
  40. Wenn ein Dokument von einem der unterstützten Formate geladen wird, kann es trotzdem noch
  41. manuell mit einem neuen benutzerdefinierten Feld erweitert werden.
  42. </para>
  43. <sect2 id="learning.lucene.indexing.policy">
  44. <title>Indizierungs Policy</title>
  45. <para>
  46. Man sollte eine Indizierungs Policy im Architektur Design der eigenen Anwendung
  47. definieren.
  48. </para>
  49. <para>
  50. You may need an on-demand indexing configuration (something like <acronym>OLTP</acronym>
  51. system). In such systems, you usually add one document per user request. As such, the
  52. <emphasis>MaxBufferedDocs</emphasis> option will not affect the system. On the other
  53. hand, <emphasis>MaxMergeDocs</emphasis> is really helpful as it allows you to limit
  54. maximum script execution time. <emphasis>MergeFactor</emphasis> should be set to a value
  55. that keeps balance between the average indexing time (it's also affected by average
  56. auto-optimization time) and search performance (index optimization level is dependent on
  57. the number of segments).
  58. </para>
  59. <para>
  60. If you will be primarily performing batch index updates, your configuration should use a
  61. <emphasis>MaxBufferedDocs</emphasis> option set to the maximum value supported by the
  62. available amount of memory. <emphasis>MaxMergeDocs</emphasis> and
  63. <emphasis>MergeFactor</emphasis> have to be set to values reducing auto-optimization
  64. involvement as much as possible <footnote><para>An additional limit is the maximum file
  65. handlers supported by the operation system for concurrent open
  66. operations</para></footnote>. Full index optimization should be applied after
  67. indexing.
  68. </para>
  69. <example id="learning.lucene.indexing.optimization">
  70. <title>Index optimization</title>
  71. <programlisting language="php"><![CDATA[
  72. $index->optimize();
  73. ]]></programlisting>
  74. </example>
  75. <para>
  76. In some configurations, it's more effective to serialize index updates by organizing
  77. update requests into a queue and processing several update requests in a single script
  78. execution. This reduces index opening overhead, and allows utilizing index document
  79. buffering.
  80. </para>
  81. </sect2>
  82. </sect1>