Zend_Search_Lucene-Charset.xml 6.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- Reviewed: no -->
  3. <sect1 id="zend.search.lucene.charset">
  4. <title>Character Set</title>
  5. <sect2 id="zend.search.lucene.charset.description">
  6. <title>UTF-8 and single-byte character set support</title>
  7. <para>
  8. <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index files store
  9. unicode data in Java's "modified UTF-8 encoding". <classname>Zend_Search_Lucene</classname> core
  10. completely supports this encoding with one exception.
  11. <footnote>
  12. <para>
  13. <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
  14. (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
  15. "supplementary characters" (characters whose code points are
  16. greater than 0xFFFF)
  17. </para>
  18. <para>
  19. Java 2 represents these characters as a pair of char (16-bit)
  20. values, the first from the high-surrogates range (0xD800-0xDBFF),
  21. the second from the low-surrogates range (0xDC00-0xDFFF). Then
  22. they are encoded as usual UTF-8 characters in six bytes.
  23. Standard UTF-8 representation uses four bytes for supplementary
  24. characters.
  25. </para>
  26. </footnote>
  27. </para>
  28. <para>
  29. Actual input data encoding may be specified through <classname>Zend_Search_Lucene</classname> API. Data will
  30. be automatically converted into UTF-8 encoding.
  31. </para>
  32. </sect2>
  33. <sect2 id="zend.search.lucene.charset.default_analyzer">
  34. <title>Default text analyzer</title>
  35. <para>
  36. However, the default text analyzer (which is also used within query parser) uses
  37. ctype_alpha() for tokenizing text and queries.
  38. </para>
  39. <para>
  40. ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to 'ASCII//TRANSLIT' encoding before
  41. indexing. The same processing is transparently performed during query parsing.
  42. <footnote>
  43. <para>
  44. Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
  45. </para>
  46. </footnote>
  47. </para>
  48. <note>
  49. <title/>
  50. <para>
  51. Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num' analyzer if you don't want words
  52. to be broken by numbers.
  53. </para>
  54. </note>
  55. </sect2>
  56. <sect2 id="zend.search.lucene.charset.utf_analyzer">
  57. <title>UTF-8 compatible text analyzers</title>
  58. <para>
  59. <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
  60. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>, <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
  61. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
  62. </para>
  63. <para>
  64. Any of this analyzers can be enabled with the code like this:
  65. <programlisting language="php"><![CDATA[
  66. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  67. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  68. ]]></programlisting>
  69. </para>
  70. <warning>
  71. <title/>
  72. <para>
  73. UTF-8 compatible analyzers were improved in ZF 1.5. Early versions of analyzers assumed
  74. all non-ascii characters are letters. New analyzers implementation has more accurate behavior.
  75. </para>
  76. <para>
  77. This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine
  78. may return wrong result sets.
  79. </para>
  80. </warning>
  81. <para>
  82. All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on.
  83. PCRE UTF-8 support is turned on for the PCRE library sources bundled with PHP source code distribution, but if shared library is used
  84. instead of bundled with PHP sources, then UTF-8 support state may depend on you operating system.
  85. </para>
  86. <para>
  87. Use the following code to check, if PCRE UTF-8 support is enabled:
  88. <programlisting language="php"><![CDATA[
  89. if (@preg_match('/\pL/u', 'a') == 1) {
  90. echo "PCRE unicode support is turned on.\n";
  91. } else {
  92. echo "PCRE unicode support is turned off.\n";
  93. }
  94. ]]></programlisting>
  95. </para>
  96. <para>
  97. Case insensitive versions of UTF-8 compatible analyzers also need <ulink url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to be enabled.
  98. </para>
  99. <para>
  100. If you don't want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing
  101. and query string before searching by converting them to lowercase:
  102. <programlisting language="php"><![CDATA[
  103. // Indexing
  104. setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
  105. ...
  106. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  107. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  108. ...
  109. $doc = new Zend_Search_Lucene_Document();
  110. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
  111. strtolower($contents)));
  112. // Title field for search through (indexed, unstored)
  113. $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
  114. strtolower($title)));
  115. // Title field for retrieving (unindexed, stored)
  116. $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
  117. ]]></programlisting>
  118. <programlisting language="php"><![CDATA[
  119. // Searching
  120. setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
  121. ...
  122. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  123. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  124. ...
  125. $hits = $index->find(strtolower($query));
  126. ]]></programlisting>
  127. </para>
  128. </sect2>
  129. </sect1>