|
|
@@ -11,14 +11,17 @@
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and '<emphasis>score</emphasis>' names
|
|
|
- to avoid ambiguity in <code>QueryHit</code> properties names.
|
|
|
+ Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
|
|
|
+ '<emphasis>score</emphasis>' names to avoid ambiguity in <code>QueryHit</code>
|
|
|
+ properties names.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <code>id</code> and <code>score</code> properties always refer
|
|
|
- to internal Lucene document id and hit <link linkend="zend.search.lucene.searching.results-scoring">score</link>.
|
|
|
- If the indexed document has the same stored fields, you have to use the <methodname>getDocument()</methodname> method to access them:
|
|
|
+ The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <code>id</code> and
|
|
|
+ <code>score</code> properties always refer to internal Lucene document id and hit <link
|
|
|
+ linkend="zend.search.lucene.searching.results-scoring">score</link>. If the indexed
|
|
|
+ document has the same stored fields, you have to use the
|
|
|
+ <methodname>getDocument()</methodname> method to access them:
|
|
|
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
$hits = $index->find($query);
|
|
|
@@ -53,7 +56,8 @@ foreach ($hits as $hit) {
|
|
|
<title>Indexing performance</title>
|
|
|
|
|
|
<para>
|
|
|
- Indexing performance is a compromise between used resources, indexing time and index quality.
|
|
|
+ Indexing performance is a compromise between used resources, indexing time and index
|
|
|
+ quality.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
@@ -61,17 +65,19 @@ foreach ($hits as $hit) {
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Each index segment is entirely independent portion of data. So indexes containing more segments need
|
|
|
- more memory and time for searching.
|
|
|
+ Each index segment is entirely independent portion of data. So indexes containing more
|
|
|
+ segments need more memory and time for searching.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Index optimization is a process of merging several segments into a new one. A fully optimized index contains
|
|
|
- only one segment.
|
|
|
+ Index optimization is a process of merging several segments into a new one. A fully
|
|
|
+ optimized index contains only one segment.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Full index optimization may be performed with the <methodname>optimize()</methodname> method:
|
|
|
+ Full index optimization may be performed with the <methodname>optimize()</methodname>
|
|
|
+ method:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
$index = Zend_Search_Lucene::open($indexPath);
|
|
|
|
|
|
@@ -80,124 +86,187 @@ $index->optimize();
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Index optimization works with data streams and doesn't take a lot of memory but does require processor resources and time.
|
|
|
+ Index optimization works with data streams and doesn't take a lot of memory but does
|
|
|
+ require processor resources and time.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Lucene index segments are not updatable by their nature (the update operation requires the segment file
|
|
|
- to be completely rewritten). So adding new document(s) to an index always generates a new segment.
|
|
|
- This, in turn, decreases index quality.
|
|
|
+ Lucene index segments are not updatable by their nature (the update operation requires
|
|
|
+ the segment file to be completely rewritten). So adding new document(s) to an index
|
|
|
+ always generates a new segment. This, in turn, decreases index quality.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- An index auto-optimization process is performed after each segment generation and consists of merging partial segments.
|
|
|
+ An index auto-optimization process is performed after each segment generation and
|
|
|
+ consists of merging partial segments.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- There are three options to control the behavior of auto-optimization
|
|
|
- (see <link linkend="zend.search.lucene.index-creation.optimization">Index optimization</link> section):
|
|
|
+ There are three options to control the behavior of auto-optimization (see <link
|
|
|
+ linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
|
|
|
+ section):
|
|
|
+
|
|
|
<itemizedlist>
|
|
|
<listitem>
|
|
|
- <para><emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be buffered in memory before a new segment is generated
|
|
|
- and written to the hard drive.</para>
|
|
|
+ <para>
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
|
|
|
+ buffered in memory before a new segment is generated and written to the hard
|
|
|
+ drive.
|
|
|
+ </para>
|
|
|
</listitem>
|
|
|
+
|
|
|
<listitem>
|
|
|
- <para><emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged by auto-optimization process
|
|
|
- into a new segment.</para>
|
|
|
+ <para>
|
|
|
+ <emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
|
|
|
+ by auto-optimization process into a new segment.
|
|
|
+ </para>
|
|
|
</listitem>
|
|
|
+
|
|
|
<listitem>
|
|
|
- <para><emphasis>MergeFactor</emphasis> determines how often auto-optimization is performed.</para>
|
|
|
+ <para>
|
|
|
+ <emphasis>MergeFactor</emphasis> determines how often auto-optimization is
|
|
|
+ performed.
|
|
|
+ </para>
|
|
|
</listitem>
|
|
|
</itemizedlist>
|
|
|
+
|
|
|
<note>
|
|
|
<para>
|
|
|
- All these options are <classname>Zend_Search_Lucene</classname> object properties- not index properties. They affect only current
|
|
|
- <classname>Zend_Search_Lucene</classname> object behavior and may vary for different scripts.
|
|
|
+ All these options are <classname>Zend_Search_Lucene</classname> object
|
|
|
+ properties- not index properties. They affect only current
|
|
|
+ <classname>Zend_Search_Lucene</classname> object behavior and may vary for
|
|
|
+ different scripts.
|
|
|
</para>
|
|
|
</note>
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one document per script execution. On the other hand, it's very important
|
|
|
- for batch indexing. Greater values increase indexing performance, but also require more memory.
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
|
|
|
+ document per script execution. On the other hand, it's very important for batch
|
|
|
+ indexing. Greater values increase indexing performance, but also require more memory.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- There is simply no way to calculate the best value for the <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document size, the
|
|
|
- analyzer in use and allowed memory.
|
|
|
+ There is simply no way to calculate the best value for the
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
|
|
|
+ size, the analyzer in use and allowed memory.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- A good way to find the right value is to perform several tests with the largest document you expect to be added to the index
|
|
|
- <footnote><para><methodname>memory_get_usage()</methodname> and <methodname>memory_get_peak_usage()</methodname> may be used to control
|
|
|
- memory usage.</para></footnote>. It's a best practice not to use more than a half of the allowed memory.
|
|
|
+ A good way to find the right value is to perform several tests with the largest document
|
|
|
+ you expect to be added to the index
|
|
|
+
|
|
|
+ <footnote>
|
|
|
+ <para>
|
|
|
+ <methodname>memory_get_usage()</methodname> and
|
|
|
+ <methodname>memory_get_peak_usage()</methodname> may be used to control memory
|
|
|
+ usage.
|
|
|
+ </para>
|
|
|
+ </footnote>
|
|
|
+
|
|
|
+ . It's a best practice not to use more than a half of the allowed memory.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It therefore also limits auto-optimization time by guaranteeing
|
|
|
- that the <methodname>addDocument()</methodname> method is not executed more than a certain number of times. This is very important for interactive applications.
|
|
|
+ <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
|
|
|
+ therefore also limits auto-optimization time by guaranteeing that the
|
|
|
+ <methodname>addDocument()</methodname> method is not executed more than a certain number
|
|
|
+ of times. This is very important for interactive applications.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing performance. Index auto-optimization is an iterative process
|
|
|
- and is performed from bottom up. Small segments are merged into larger segment, which are in turn merged into even larger segments and
|
|
|
- so on. Full index optimization is achieved when only one large segment file remains.
|
|
|
+ Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
|
|
|
+ performance. Index auto-optimization is an iterative process and is performed from
|
|
|
+ bottom up. Small segments are merged into larger segment, which are in turn merged into
|
|
|
+ even larger segments and so on. Full index optimization is achieved when only one large
|
|
|
+ segment file remains.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
Small segments generally decrease index quality. Many small segments may also trigger
|
|
|
- the "Too many open files" error determined by OS limitations <footnote><para><classname>Zend_Search_Lucene</classname> keeps each segment file opened
|
|
|
- to improve search performance.</para></footnote>.
|
|
|
+ the "Too many open files" error determined by OS limitations
|
|
|
+
|
|
|
+ <footnote>
|
|
|
+ <para>
|
|
|
+ <classname>Zend_Search_Lucene</classname> keeps each segment file opened to
|
|
|
+ improve search performance.
|
|
|
+ </para>
|
|
|
+ </footnote>.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- in general, background index optimization should be performed for interactive indexing mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be
|
|
|
- too low for batch indexing.
|
|
|
+ in general, background index optimization should be performed for interactive indexing
|
|
|
+ mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values increase the quality of unoptimized indexes. Larger values increase
|
|
|
- indexing performance, but also increase the number of merged segments. This again may trigger the "Too many open files" error.
|
|
|
+ <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
|
|
|
+ increase the quality of unoptimized indexes. Larger values increase indexing
|
|
|
+ performance, but also increase the number of merged segments. This again may trigger the
|
|
|
+ "Too many open files" error.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
<emphasis>MergeFactor</emphasis> groups index segments by their size:
|
|
|
+
|
|
|
<orderedlist>
|
|
|
- <listitem><para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para></listitem>
|
|
|
- <listitem><para>Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
|
|
|
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.</para></listitem>
|
|
|
- <listitem><para>Greater than <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but not greater than
|
|
|
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.</para></listitem>
|
|
|
+ <listitem>
|
|
|
+ <para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
|
|
|
+ </listitem>
|
|
|
+
|
|
|
+ <listitem>
|
|
|
+ <para>
|
|
|
+ Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
|
|
|
+ </para>
|
|
|
+ </listitem>
|
|
|
+
|
|
|
+ <listitem>
|
|
|
+ <para>
|
|
|
+ Greater than
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
|
|
|
+ not greater than
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
|
|
|
+ </para>
|
|
|
+ </listitem>
|
|
|
+
|
|
|
<listitem><para>...</para></listitem>
|
|
|
</orderedlist>
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <classname>Zend_Search_Lucene</classname> checks during each <methodname>addDocument()</methodname> call to see if merging any segments may move the newly created segment
|
|
|
- into the next group. If yes, then merging is performed.
|
|
|
+ <classname>Zend_Search_Lucene</classname> checks during each
|
|
|
+ <methodname>addDocument()</methodname> call to see if merging any segments may move the
|
|
|
+ newly created segment into the next group. If yes, then merging is performed.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> + (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
|
|
|
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript> documents.
|
|
|
+ So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
|
|
|
+ (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
|
|
|
+ documents.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
This gives good approximation for the number of segments in the index:
|
|
|
</para>
|
|
|
<para>
|
|
|
- <emphasis>NumberOfSegments</emphasis> <= <emphasis>MaxBufferedDocs</emphasis> + <emphasis>MergeFactor</emphasis>*log
|
|
|
- <subscript><emphasis>MergeFactor</emphasis></subscript> (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
|
|
|
+ <emphasis>NumberOfSegments</emphasis> <= <emphasis>MaxBufferedDocs</emphasis> +
|
|
|
+ <emphasis>MergeFactor</emphasis>*log
|
|
|
+ <subscript><emphasis>MergeFactor</emphasis></subscript>
|
|
|
+ (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for the appropriate merge factor to get a reasonable
|
|
|
- number of segments.
|
|
|
+ <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
|
|
|
+ the appropriate merge factor to get a reasonable number of segments.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more course-grained.
|
|
|
- So use the estimation above for tuning <emphasis>MergeFactor</emphasis>, then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
|
|
|
+ Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
|
|
|
+ indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
|
|
|
+ course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
|
|
|
+ then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
|
|
|
</para>
|
|
|
</sect2>
|
|
|
|
|
|
@@ -205,7 +274,8 @@ $index->optimize();
|
|
|
<title>Index during Shut Down</title>
|
|
|
|
|
|
<para>
|
|
|
- The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time if any documents were added to the index but not written to a new segment.
|
|
|
+ The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
|
|
|
+ if any documents were added to the index but not written to a new segment.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
@@ -213,13 +283,21 @@ $index->optimize();
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- The index object is automatically closed when it, and all returned QueryHit objects, go out of scope.
|
|
|
+ The index object is automatically closed when it, and all returned QueryHit objects, go
|
|
|
+ out of scope.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- If index object is stored in global variable than it's closed only at the end of script execution<footnote><para>This also
|
|
|
- may occur if the index or QueryHit instances are referred to in some cyclical data structures, because <acronym>PHP</acronym> garbage collects objects
|
|
|
- with cyclic references only at the end of script execution.</para></footnote>.
|
|
|
+ If index object is stored in global variable than it's closed only at the end of script
|
|
|
+ execution
|
|
|
+
|
|
|
+ <footnote>
|
|
|
+ <para>
|
|
|
+ This also may occur if the index or QueryHit instances are referred to in some
|
|
|
+ cyclical data structures, because <acronym>PHP</acronym> garbage collects
|
|
|
+ objects with cyclic references only at the end of script execution.
|
|
|
+ </para>
|
|
|
+ </footnote>.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
@@ -227,7 +305,8 @@ $index->optimize();
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- It doesn't prevent normal index shutdown process, but may prevent accurate error diagnostic if any error occurs during shutdown.
|
|
|
+ It doesn't prevent normal index shutdown process, but may prevent accurate error
|
|
|
+ diagnostic if any error occurs during shutdown.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
@@ -236,6 +315,7 @@ $index->optimize();
|
|
|
|
|
|
<para>
|
|
|
The first is to force going out of scope:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
$index = Zend_Search_Lucene::open($indexPath);
|
|
|
|
|
|
@@ -247,13 +327,16 @@ unset($index);
|
|
|
|
|
|
<para>
|
|
|
And the second is to perform a commit operation before the end of script execution:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
$index = Zend_Search_Lucene::open($indexPath);
|
|
|
|
|
|
$index->commit();
|
|
|
]]></programlisting>
|
|
|
- This possibility is also described in the "<link linkend="zend.search.lucene.advanced.static">Advanced. Using index as static property</link>"
|
|
|
- section.
|
|
|
+
|
|
|
+ This possibility is also described in the "<link
|
|
|
+ linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
|
|
|
+ property</link>" section.
|
|
|
</para>
|
|
|
</sect2>
|
|
|
|
|
|
@@ -261,15 +344,18 @@ $index->commit();
|
|
|
<title>Retrieving documents by unique id</title>
|
|
|
|
|
|
<para>
|
|
|
- It's a common practice to store some unique document id in the index. Examples include url, path, or database id.
|
|
|
+ It's a common practice to store some unique document id in the index. Examples include
|
|
|
+ url, path, or database id.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname> method for retrieving documents containing specified terms.
|
|
|
+ <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
|
|
|
+ method for retrieving documents containing specified terms.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
This is more efficient than using the <methodname>find()</methodname> method:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
// Retrieving documents with find() method using a query string
|
|
|
$query = $idFieldName . ':' . $docId;
|
|
|
@@ -314,7 +400,8 @@ foreach ($docIds as $id) {
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- It uses memory to cache some information and optimize searching and indexing performance.
|
|
|
+ It uses memory to cache some information and optimize searching and indexing
|
|
|
+ performance.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
@@ -322,30 +409,44 @@ foreach ($docIds as $id) {
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- The terms dictionary index is loaded during the search. It's actually each 128<superscript>th</superscript>
|
|
|
- <footnote><para>The Lucene file format allows you to configure this number, but <classname>Zend_Search_Lucene</classname> doesn't expose this
|
|
|
- in its <acronym>API</acronym>. Nevertheless you still have the ability to configure this value if the index is prepared with another
|
|
|
- Lucene implementation.</para></footnote> term of the full dictionary.
|
|
|
+ The terms dictionary index is loaded during the search. It's actually each
|
|
|
+ 128<superscript>th</superscript>
|
|
|
+
|
|
|
+ <footnote>
|
|
|
+ <para>
|
|
|
+ The Lucene file format allows you to configure this number, but
|
|
|
+ <classname>Zend_Search_Lucene</classname> doesn't expose this in its
|
|
|
+ <acronym>API</acronym>. Nevertheless you still have the ability to configure
|
|
|
+ this value if the index is prepared with another Lucene implementation.
|
|
|
+ </para>
|
|
|
+ </footnote>
|
|
|
+
|
|
|
+ term of the full dictionary.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Thus memory usage is increased if you have a high number of unique terms. This may happen if you use untokenized phrases as a field values
|
|
|
- or index a large volume of non-text information.
|
|
|
+ Thus memory usage is increased if you have a high number of unique terms. This may
|
|
|
+ happen if you use untokenized phrases as a field values or index a large volume of
|
|
|
+ non-text information.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- An unoptimized index consists of several segments. It also increases memory usage. Segments are independent, so each segment contains
|
|
|
- its own terms dictionary and terms dictionary index. If an index consists of <emphasis>N</emphasis> segments it may increase memory
|
|
|
- usage by <emphasis>N</emphasis> times in worst case. Perform index optimization to merge all segments into one to avoid such memory consumption.
|
|
|
+ An unoptimized index consists of several segments. It also increases memory usage.
|
|
|
+ Segments are independent, so each segment contains its own terms dictionary and terms
|
|
|
+ dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
|
|
|
+ increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
|
|
|
+ optimization to merge all segments into one to avoid such memory consumption.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Indexing uses the same memory as searching plus memory for buffering documents. The amount of memory used may be managed with
|
|
|
- <emphasis>MaxBufferedDocs</emphasis> parameter.
|
|
|
+ Indexing uses the same memory as searching plus memory for buffering documents. The
|
|
|
+ amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
|
|
|
+ parameter.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Index optimization (full or partial) uses stream-style data processing and doesn't require a lot of memory.
|
|
|
+ Index optimization (full or partial) uses stream-style data processing and doesn't
|
|
|
+ require a lot of memory.
|
|
|
</para>
|
|
|
</sect2>
|
|
|
|
|
|
@@ -353,11 +454,13 @@ foreach ($docIds as $id) {
|
|
|
<title>Encoding</title>
|
|
|
|
|
|
<para>
|
|
|
- <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
|
|
|
+ <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
|
|
|
+ strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym> data, but you should be careful if this is not the case.
|
|
|
+ You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
|
|
|
+ data, but you should be careful if this is not the case.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
@@ -365,11 +468,13 @@ foreach ($docIds as $id) {
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities for indexed documents and parsed queries.
|
|
|
+ <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
|
|
|
+ for indexed documents and parsed queries.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
Encoding may be explicitly specified as an optional parameter of field creation methods:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
$doc = new Zend_Search_Lucene_Document();
|
|
|
$doc->addField(Zend_Search_Lucene_Field::Text('title',
|
|
|
@@ -379,12 +484,14 @@ $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
|
|
|
$contents,
|
|
|
'utf-8'));
|
|
|
]]></programlisting>
|
|
|
+
|
|
|
This is the best way to avoid ambiguity in the encoding used.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- If optional encoding parameter is omitted, then the current locale is used. The current locale may contain character encoding data
|
|
|
- in addition to the language specification:
|
|
|
+ If optional encoding parameter is omitted, then the current locale is used. The current
|
|
|
+ locale may contain character encoding data in addition to the language specification:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
setlocale(LC_ALL, 'fr_FR');
|
|
|
...
|
|
|
@@ -406,7 +513,9 @@ setlocale(LC_ALL, 'ru_RU.UTF-8');
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Encoding may be passed as an optional parameter, if the query is parsed explicitly before search:
|
|
|
+ Encoding may be passed as an optional parameter, if the query is parsed explicitly
|
|
|
+ before search:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
$query =
|
|
|
Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
|
|
|
@@ -416,18 +525,23 @@ $hits = $index->find($query);
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- The default encoding may also be specified with <methodname>setDefaultEncoding()</methodname> method:
|
|
|
+ The default encoding may also be specified with
|
|
|
+ <methodname>setDefaultEncoding()</methodname> method:
|
|
|
+
|
|
|
<programlisting language="php"><![CDATA[
|
|
|
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
|
|
|
$hits = $index->find($queryStr);
|
|
|
...
|
|
|
]]></programlisting>
|
|
|
+
|
|
|
The empty string implies 'current locale'.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- If the correct encoding is specified it can be correctly processed by analyzer. The actual behavior depends on which analyzer is used.
|
|
|
- See the <link linkend="zend.search.lucene.charset">Character Set</link> documentation section for details.
|
|
|
+ If the correct encoding is specified it can be correctly processed by analyzer. The
|
|
|
+ actual behavior depends on which analyzer is used. See the <link
|
|
|
+ linkend="zend.search.lucene.charset">Character Set</link> documentation section for
|
|
|
+ details.
|
|
|
</para>
|
|
|
</sect2>
|
|
|
|
|
|
@@ -435,30 +549,35 @@ $hits = $index->find($queryStr);
|
|
|
<title>Index maintenance</title>
|
|
|
|
|
|
<para>
|
|
|
- It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other Lucene implementation does not comprise a "database".
|
|
|
+ It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
|
|
|
+ Lucene implementation does not comprise a "database".
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Indexes should not be used for data storage. They do not provide partial backup/restore functionality, journaling, logging, transactions
|
|
|
- and many other features associated with database management systems.
|
|
|
+ Indexes should not be used for data storage. They do not provide partial backup/restore
|
|
|
+ functionality, journaling, logging, transactions and many other features associated with
|
|
|
+ database management systems.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a consistent state at all times.
|
|
|
+ Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
|
|
|
+ consistent state at all times.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- Index backup and restoration should be performed by copying the contents of the index folder.
|
|
|
+ Index backup and restoration should be performed by copying the contents of the index
|
|
|
+ folder.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- If index corruption occurs for any reason, the corrupted index should be restored or completely rebuilt.
|
|
|
+ If index corruption occurs for any reason, the corrupted index should be restored or
|
|
|
+ completely rebuilt.
|
|
|
</para>
|
|
|
|
|
|
<para>
|
|
|
- So it's a good idea to backup large indexes and store changelogs to perform manual restoration and roll-forward operations
|
|
|
- if necessary. This practice dramatically reduces index restoration time.
|
|
|
+ So it's a good idea to backup large indexes and store changelogs to perform manual
|
|
|
+ restoration and roll-forward operations if necessary. This practice dramatically reduces
|
|
|
+ index restoration time.
|
|
|
</para>
|
|
|
-
|
|
|
</sect2>
|
|
|
</sect1>
|