Character Set

Character Set UTF-8 and single-byte character set support Zend_Search_Lucene works with the UTF-8 charset internally. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports this encoding with one exception. Zend_Search_Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support "supplementary characters" (characters whose code points are greater than 0xFFFF) Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters. Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding. Default text analyzer However, the default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries. ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently performed during query parsing. Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS. <para> Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num' analyzer if you don't want words to be broken by numbers. </para> </note> </sect2> <sect2 id="zend.search.lucene.charset.utf_analyzer"> <title>UTF-8 compatible text analyzers Zend_Search_Lucene also contains a set of UTF-8 compatible analyzers: Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive. Any of this analyzers can be enabled with the code like this: <para> UTF-8 compatible analyzers were improved in ZF 1.5. Early versions of analyzers assumed all non-ascii characters are letters. New analyzers implementation has more accurate behavior. </para> <para> This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine may return wrong result sets. </para> </warning> <para> All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE library sources bundled with PHP source code distribution, but if shared library is used instead of bundled with PHP sources, then UTF-8 support state may depend on you operating system. </para> <para> Use the following code to check, if PCRE UTF-8 support is enabled: <programlisting language="php"><![CDATA[ if (@preg_match('/\pL/u', 'a') == 1) { echo "PCRE unicode support is turned on.\n"; } else { echo "PCRE unicode support is turned off.\n"; } ]]></programlisting> </para> <para> Case insensitive versions of UTF-8 compatible analyzers also need <ulink url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to be enabled. </para> <para> If you don't want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing and query string before searching by converting them to lowercase: <programlisting language="php"><![CDATA[ // Indexing setlocale(LC_CTYPE, 'de_DE.iso-8859-1'); ... Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8()); ... $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', strtolower($contents))); // Title field for search through (indexed, unstored) $doc->addField(Zend_Search_Lucene_Field::UnStored('title', strtolower($title))); // Title field for retrieving (unindexed, stored) $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title)); ]]></programlisting> <programlisting language="php"><![CDATA[ // Searching setlocale(LC_CTYPE, 'de_DE.iso-8859-1'); ... Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8()); ... $hits = $index->find(strtolower($query)); ]]></programlisting> </para> </sect2> </sect1>