lucene-intro.xml 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- EN-Revision: 19766 -->
  3. <!-- Reviewed: no -->
  4. <sect1 id="learning.lucene.intro">
  5. <title>Zend_Search_Lucene Introduction</title>
  6. <para>
  7. The <classname>Zend_Search_Lucene</classname> component is intended to provide a
  8. ready-for-use full-text search solution. It doesn't require any <acronym>PHP</acronym>
  9. extensions<footnote><para>Though some <acronym>UTF-8</acronym> processing functionality
  10. requires the <emphasis>mbstring</emphasis> extension to be turned
  11. on</para></footnote> or additional software to be installed, and can be used
  12. immediately after Zend Framework installation.
  13. </para>
  14. <para>
  15. <classname>Zend_Search_Lucene</classname> is a pure <acronym>PHP</acronym> port of the
  16. popular open source full-text search engine known as Apache Lucene. See <ulink
  17. url="http://lucene.apache.org">http://lucene.apache.org/</ulink> for the details.
  18. </para>
  19. <para>
  20. Information must be indexed to be available for searching.
  21. <classname>Zend_Search_Lucene</classname> and Java Lucene use a document concept known as an
  22. "atomic indexing item."
  23. </para>
  24. <para>
  25. Each document is a set of fields: &lt;name, value&gt; pairs where name and value are
  26. <acronym>UTF-8</acronym> strings<footnote><para>Binary strings are also allowed to be used
  27. as field values</para></footnote>. Any subset of the document fields may be marked
  28. as "indexed" to include field data in the text indexing process.
  29. </para>
  30. <para>
  31. Field values may or may not be tokenized while indexing. If a field is not tokenized, then
  32. the field value is stored as one term; otherwise, the current analyzer is used for
  33. tokenization.
  34. </para>
  35. <para>
  36. Several analyzers are provided within the <classname>Zend_Search_Lucene</classname> package.
  37. The default analyzer works with <acronym>ASCII</acronym> text (since the
  38. <acronym>UTF-8</acronym> analyzer needs the <emphasis>mbstring</emphasis> extension to be
  39. turned on). It is case insensitive, and it skips numbers. Use other analyzers or create your
  40. own analyzer if you need to change this behavior.
  41. </para>
  42. <note>
  43. <title>Using analyzers during indexing and searching</title>
  44. <para>
  45. Important note! Search queries are also tokenized using the "current analyzer", so the
  46. same analyzer must be set as the default during both the indexing and searching process.
  47. This will guarantee that source and searched text will be transformed into terms in the
  48. same way.
  49. </para>
  50. </note>
  51. <para>
  52. Field values are optionally stored within an index. This allows the original field data to
  53. be retrieved from the index while searching. This is the only way to associate search
  54. results with the original data (internal document IDs may be changed after index
  55. optimization or auto-optimization).
  56. </para>
  57. <para>
  58. The thing that should be remembered is that a Lucene index is not a database. It doesn't
  59. provide index backup mechanisms except backup of the file system directory. It doesn't
  60. provide transactional mechanisms though concurrent index update as well as concurrent update
  61. and read are supported. It doesn't compare with databases in data retrieving speed.
  62. </para>
  63. <para>
  64. So it's good idea:
  65. </para>
  66. <itemizedlist>
  67. <listitem>
  68. <para>
  69. <emphasis>Not</emphasis> to use Lucene index as a storage since it may dramatically
  70. decrease search hit retrieving performance. Store only unique document identifiers
  71. (doc paths, <acronym>URL</acronym>s, database unique IDs) and associated data within
  72. an index. E.g. title, annotation, category, language info, avatar. (Note: a field
  73. may be included in indexing, but not stored, or stored, but not indexed).
  74. </para>
  75. </listitem>
  76. <listitem>
  77. <para>
  78. To write functionality that can rebuild an index completely if it's corrupted for
  79. any reason.
  80. </para>
  81. </listitem>
  82. </itemizedlist>
  83. <para>
  84. Individual documents in the index may have completely different sets of fields. The same
  85. fields in different documents don't need to have the same attributes. E.g. a field may be
  86. indexed for one document and skipped from indexing for another. The same applies for
  87. storing, tokenizing, or treating field value as a binary string.
  88. </para>
  89. </sect1>