lucene-index-structure.xml 4.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="learning.lucene.index-structure">
  4. <title>Lucene Index Structure</title>
  5. <para>
  6. In order to fully utilize <classname>Zend_Search_Lucene</classname>'s capabilities with
  7. maximum performance, you need to understand it's internal index structure.
  8. </para>
  9. <para>
  10. An <emphasis>index</emphasis> is stored as a set of files within a single directory.
  11. </para>
  12. <para>
  13. An <emphasis>index</emphasis> consists of any number of independent
  14. <emphasis>segments</emphasis> which store information about a subset of indexed documents.
  15. Each <emphasis>segment</emphasis> has its own <emphasis>terms dictionary</emphasis>, terms
  16. dictionary index, and document storage (stored field values) <footnote><para>Starting with
  17. Lucene 2.3, document storage files can be shared between segments; however,
  18. <classname>Zend_Search_Lucene</classname> doesn't use this
  19. capability</para></footnote>. All segment data is stored in
  20. <filename>_xxxxx.cfs</filename> files, where <varname>xxxxx</varname> is a segment name.
  21. </para>
  22. <para>
  23. Once an index segment file is created, it can't be updated. New documents are added to new
  24. segments. Deleted documents are only marked as deleted in an optional
  25. <filename>&lt;segmentname&gt;.del</filename> file.
  26. </para>
  27. <para>
  28. Document updating is performed as separate delete and add operations, even though it's done
  29. using an <methodname>update()</methodname> API call<footnote><para>This call is provided
  30. only by Java Lucene now, but it's planned to extend the
  31. <classname>Zend_Search_Lucene</classname> API with similar
  32. functionality</para></footnote>. This simplifies adding new documents, and allows
  33. updating concurrently with search operations.
  34. </para>
  35. <para>
  36. On the other hand, using several segments (one document per segment as a borderline case)
  37. increases search time:
  38. </para>
  39. <itemizedlist>
  40. <listitem>
  41. <para>
  42. retrieving a term from a dictionary is performed for each segment;
  43. </para>
  44. </listitem>
  45. <listitem>
  46. <para>
  47. the terms dictionary index is pre-loaded for each segment (this process takes the
  48. most search time for simple queries, and it also requires additional memory).
  49. </para>
  50. </listitem>
  51. </itemizedlist>
  52. <para>
  53. If the terms dictionary reaches a saturation point, then search through one segment is
  54. <emphasis>N</emphasis> times faster than search through <emphasis>N</emphasis> segments
  55. in most cases.
  56. </para>
  57. <para>
  58. <emphasis>Index optimization</emphasis> merges two or more segments into a single new one. A
  59. new segment is added to the index segments list, and old segments are excluded.
  60. </para>
  61. <para>
  62. Segment list updates are performed as an atomic operation. This gives the ability of
  63. concurrently adding new documents, performing index optimization, and searching through the
  64. index.
  65. </para>
  66. <para>
  67. Index auto-optimization is performed after each new segment generation. It merges sets of
  68. the smallest segments into larger segments, and larger segments into even larger segments,
  69. if we have enough segments to merge.
  70. </para>
  71. <para>
  72. Index auto-optimization is controlled by three options:
  73. </para>
  74. <itemizedlist>
  75. <listitem>
  76. <para>
  77. <varname>MaxBufferedDocs</varname> (the minimal number of documents required before
  78. the buffered in-memory documents are written into a new segment);
  79. </para>
  80. </listitem>
  81. <listitem>
  82. <para>
  83. <varname>MaxMergeDocs</varname> (the largest number of documents ever merged by
  84. an optimization operation); and
  85. </para>
  86. </listitem>
  87. <listitem>
  88. <para>
  89. <varname>MergeFactor</varname> (which determines how often segment indices are
  90. merged by auto-optimization operations).
  91. </para>
  92. </listitem>
  93. </itemizedlist>
  94. <para>
  95. If we add one document per script execution, then <emphasis>MaxBufferedDocs</emphasis> is
  96. actually not used (only one new segment with only one document is created at the end of
  97. script execution, at which time the auto-optimization process starts).
  98. </para>
  99. </sect1>