| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.charset">
- <title>Character Set</title>
- <sect2 id="zend.search.lucene.charset.description">
- <title>UTF-8 and single-byte character set support</title>
- <para>
- <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
- files store unicode data in Java's "modified UTF-8 encoding".
- <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
- one exception.
- <footnote>
- <para>
- <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
- (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
- "supplementary characters" (characters whose code points are
- greater than 0xFFFF)
- </para>
- <para>
- Java 2 represents these characters as a pair of char (16-bit)
- values, the first from the high-surrogates range (0xD800-0xDBFF),
- the second from the low-surrogates range (0xDC00-0xDFFF). Then
- they are encoded as usual UTF-8 characters in six bytes.
- Standard UTF-8 representation uses four bytes for supplementary
- characters.
- </para>
- </footnote>
- </para>
- <para>
- Actual input data encoding may be specified through
- <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
- automatically converted into UTF-8 encoding.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.charset.default_analyzer">
- <title>Default text analyzer</title>
- <para>
- However, the default text analyzer (which is also used within query parser) uses
- ctype_alpha() for tokenizing text and queries.
- </para>
- <para>
- ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
- 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
- performed during query parsing.
- <footnote>
- <para>
- Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
- </para>
- </footnote>
- </para>
- <note>
- <title/>
- <para>
- Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
- analyzer if you don't want words to be broken by numbers.
- </para>
- </note>
- </sect2>
- <sect2 id="zend.search.lucene.charset.utf_analyzer">
- <title>UTF-8 compatible text analyzers</title>
- <para>
- <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
- analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
- <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
- <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
- <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
- </para>
- <para>
- Any of this analyzers can be enabled with the code like this:
- </para>
- <programlisting language="php"><![CDATA[
- Zend_Search_Lucene_Analysis_Analyzer::setDefault(
- new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
- ]]></programlisting>
- <warning>
- <title/>
- <para>
- UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
- analyzers assumed all non-ascii characters are letters. New analyzers implementation
- has more accurate behavior.
- </para>
- <para>
- This may need you to re-build index to have data and search queries tokenized in the
- same way, otherwise search engine may return wrong result sets.
- </para>
- </warning>
- <para>
- All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
- compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
- library sources bundled with <acronym>PHP</acronym> source code distribution, but if
- shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
- UTF-8 support state may depend on you operating system.
- </para>
- <para>
- Use the following code to check, if PCRE UTF-8 support is enabled:
- </para>
- <programlisting language="php"><![CDATA[
- if (@preg_match('/\pL/u', 'a') == 1) {
- echo "PCRE unicode support is turned on.\n";
- } else {
- echo "PCRE unicode support is turned off.\n";
- }
- ]]></programlisting>
- <para>
- Case insensitive versions of UTF-8 compatible analyzers also need <ulink
- url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
- be enabled.
- </para>
- <para>
- If you don't want mbstring extension to be turned on, but need case insensitive search,
- you may use the following approach: normalize source data before indexing and query
- string before searching by converting them to lowercase:
- </para>
- <programlisting language="php"><![CDATA[
- // Indexing
- setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
- ...
- Zend_Search_Lucene_Analysis_Analyzer::setDefault(
- new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
- ...
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- strtolower($contents)));
- // Title field for search through (indexed, unstored)
- $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
- strtolower($title)));
- // Title field for retrieving (unindexed, stored)
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
- ]]></programlisting>
- <programlisting language="php"><![CDATA[
- // Searching
- setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
- ...
- Zend_Search_Lucene_Analysis_Analyzer::setDefault(
- new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
- ...
- $hits = $index->find(strtolower($query));
- ]]></programlisting>
- </sect2>
- </sect1>
|