Zend_Search_Lucene-Charset.xml 6.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <!-- EN-Revision: 20854 -->
  3. <!-- Reviewed: no -->
  4. <sect1 id="zend.search.lucene.charset">
  5. <title>Conjunto de Caractere</title>
  6. <sect2 id="zend.search.lucene.charset.description">
  7. <title>Suporte ao conjunto de caractere UTF-8 e byte-simples</title>
  8. <para>
  9. <classname>Zend_Search_Lucene</classname> trabalha internamente com o conjunto de
  10. caractere UTF-8. Arquivos de índice armazenam dados unicode no formato de codificação
  11. "UTF-8 modificado" usado pelo Java. O núcleo do
  12. <classname>Zend_Search_Lucene</classname> suporta esta codificação plenamente, com uma
  13. exceção.
  14. <footnote>
  15. <para>
  16. <classname>Zend_Search_Lucene</classname> suporta somente os caracteres do Plano
  17. Multilingual Básico (BMP) (de 0x0000 a 0xFFFF), não suportando os caracteres
  18. suplementares (caracteres acima de 0xFFFF)
  19. </para>
  20. <para>
  21. O Java 2 representa estes caracteres como um par de valores do tipo char
  22. (16 bits), o primeiro vem da faixa superior (0xD800-0xDBFF), o segundo, da faixa
  23. inferior (0xDC00-0xDFFF). Logo eles são codificados como caracteres usuais UTF-8
  24. em seis bytes. A representação padrão UTF-8 utiliza quatro bytes para caracteres
  25. suplementares.
  26. </para>
  27. </footnote>
  28. </para>
  29. <para>
  30. A codificação dos dados de entrada pode ser especificada através da
  31. <acronym>API</acronym> de <classname>Zend_Search_Lucene</classname>. Os dados serão
  32. convertidos automaticamente na codificação UTF-8.
  33. </para>
  34. </sect2>
  35. <sect2 id="zend.search.lucene.charset.default_analyzer">
  36. <title>Analisador de texto padrão</title>
  37. <para>
  38. De qualquer modo, o analisador de texto padrão (que também é usado no analisador de
  39. consulta) utiliza ctype_alpha() para a sinalização de texto e consultas.
  40. </para>
  41. <para>
  42. ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
  43. 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
  44. performed during query parsing.
  45. <footnote>
  46. <para>
  47. Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
  48. </para>
  49. </footnote>
  50. </para>
  51. <note>
  52. <title/>
  53. <para>
  54. Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
  55. analyzer if you don't want words to be broken by numbers.
  56. </para>
  57. </note>
  58. </sect2>
  59. <sect2 id="zend.search.lucene.charset.utf_analyzer">
  60. <title>UTF-8 compatible text analyzers</title>
  61. <para>
  62. <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
  63. analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
  64. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
  65. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
  66. <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
  67. </para>
  68. <para>
  69. Any of this analyzers can be enabled with the code like this:
  70. <programlisting language="php"><![CDATA[
  71. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  72. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  73. ]]></programlisting>
  74. </para>
  75. <warning>
  76. <title/>
  77. <para>
  78. UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
  79. analyzers assumed all non-ascii characters are letters. New analyzers implementation
  80. has more accurate behavior.
  81. </para>
  82. <para>
  83. This may need you to re-build index to have data and search queries tokenized in the
  84. same way, otherwise search engine may return wrong result sets.
  85. </para>
  86. </warning>
  87. <para>
  88. All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
  89. compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
  90. library sources bundled with <acronym>PHP</acronym> source code distribution, but if
  91. shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
  92. UTF-8 support state may depend on you operating system.
  93. </para>
  94. <para>
  95. Use the following code to check, if PCRE UTF-8 support is enabled:
  96. <programlisting language="php"><![CDATA[
  97. if (@preg_match('/\pL/u', 'a') == 1) {
  98. echo "PCRE unicode support is turned on.\n";
  99. } else {
  100. echo "PCRE unicode support is turned off.\n";
  101. }
  102. ]]></programlisting>
  103. </para>
  104. <para>
  105. Case insensitive versions of UTF-8 compatible analyzers also need <ulink
  106. url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
  107. be enabled.
  108. </para>
  109. <para>
  110. If you don't want mbstring extension to be turned on, but need case insensitive search,
  111. you may use the following approach: normalize source data before indexing and query
  112. string before searching by converting them to lowercase:
  113. <programlisting language="php"><![CDATA[
  114. // Indexing
  115. setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
  116. ...
  117. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  118. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  119. ...
  120. $doc = new Zend_Search_Lucene_Document();
  121. $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
  122. strtolower($contents)));
  123. // Title field for search through (indexed, unstored)
  124. $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
  125. strtolower($title)));
  126. // Title field for retrieving (unindexed, stored)
  127. $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
  128. ]]></programlisting>
  129. <programlisting language="php"><![CDATA[
  130. // Searching
  131. setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
  132. ...
  133. Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  134. new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  135. ...
  136. $hits = $index->find(strtolower($query));
  137. ]]></programlisting>
  138. </para>
  139. </sect2>
  140. </sect1>