Zend_Search_Lucene-Charset.xml 6.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="zend.search.lucene.charset">
  4. <title>Character Set</title>
  5. <sect2 id="zend.search.lucene.charset.description">
  6. <title>UTF-8 and single-byte character set support</title>
  7. <para>
  8. <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
  9. files store unicode data in Java's "modified UTF-8 encoding".
  10. <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
  11. one exception.
  12. <footnote>
  13. <para>
  14. <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
  15. (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
  16. "supplementary characters" (characters whose code points are
  17. greater than 0xFFFF)
  18. </para>
  19. <para>
  20. Java 2 represents these characters as a pair of char (16-bit)
  21. values, the first from the high-surrogates range (0xD800-0xDBFF),
  22. the second from the low-surrogates range (0xDC00-0xDFFF). Then
  23. they are encoded as usual UTF-8 characters in six bytes.
  24. Standard UTF-8 representation uses four bytes for supplementary
  25. characters.
  26. </para>
  27. </footnote>
  28. </para>
  29. <para>
  30. Actual input data encoding may be specified through
  31. <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
  32. automatically converted into UTF-8 encoding.
  33. </para>
  34. </sect2>
  35. <sect2 id="zend.search.lucene.charset.default_analyzer">
  36. <title>Default text analyzer</title>
  37. <para>
  38. However, the default text analyzer (which is also used within query parser) uses
  39. ctype_alpha() for tokenizing text and queries.
  40. </para>
  41. <para>
  42. ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
  43. 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
  44. performed during query parsing.
  45. <footnote>
  46. <para>
  47. Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
  48. </para>
  49. </footnote>
  50. </para>
  51. <note>
  52. <title/>
  53. <para>
  54. Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
  55. analyzer if you don't want words to be broken by numbers.
  56. </para>
  57. </note>
  58. </sect2>
  59. <sect2 id="zend.search.lucene.charset.utf_analyzer">
  60. <title>UTF-8 compatible text analyzers</title>
  61. <para>
  62. <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
  63. analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
  64. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
  65. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
  66. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
  67. </para>
  68. <para>
  69. Any of this analyzers can be enabled with the code like this:
  70. </para>
  71. <programlisting language="php"><![CDATA[
  72. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  73. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  74. ]]></programlisting>
  75. <warning>
  76. <title/>
  77. <para>
  78. UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
  79. analyzers assumed all non-ascii characters are letters. New analyzers implementation
  80. has more accurate behavior.
  81. </para>
  82. <para>
  83. This may need you to re-build index to have data and search queries tokenized in the
  84. same way, otherwise search engine may return wrong result sets.
  85. </para>
  86. </warning>
  87. <para>
  88. All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
  89. compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
  90. library sources bundled with <acronym>PHP</acronym> source code distribution, but if
  91. shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
  92. UTF-8 support state may depend on you operating system.
  93. </para>
  94. <para>
  95. Use the following code to check, if PCRE UTF-8 support is enabled:
  96. </para>
  97. <programlisting language="php"><![CDATA[
  98. if (@preg_match('/\pL/u', 'a') == 1) {
  99. echo "PCRE unicode support is turned on.\n";
  100. } else {
  101. echo "PCRE unicode support is turned off.\n";
  102. }
  103. ]]></programlisting>
  104. <para>
  105. Case insensitive versions of UTF-8 compatible analyzers also need <ulink
  106. url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
  107. be enabled.
  108. </para>
  109. <para>
  110. If you don't want mbstring extension to be turned on, but need case insensitive search,
  111. you may use the following approach: normalize source data before indexing and query
  112. string before searching by converting them to lowercase:
  113. </para>
  114. <programlisting language="php"><![CDATA[
  115. // Indexing
  116. setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
  117. ...
  118. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  119. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  120. ...
  121. $doc = new Zend_Search_Lucene_Document();
  122. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
  123. strtolower($contents)));
  124. // Title field for search through (indexed, unstored)
  125. $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
  126. strtolower($title)));
  127. // Title field for retrieving (unindexed, stored)
  128. $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
  129. ]]></programlisting>
  130. <programlisting language="php"><![CDATA[
  131. // Searching
  132. setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
  133. ...
  134. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  135. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  136. ...
  137. $hits = $index->find(strtolower($query));
  138. ]]></programlisting>
  139. </sect2>
  140. </sect1>