| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.best-practice">
- <title>Best Practices</title>
- <sect2 id="zend.search.lucene.best-practice.field-names">
- <title>Field names</title>
- <para>
- There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
- </para>
- <para>
- Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and '<emphasis>score</emphasis>' names
- to avoid ambiguity in <code>QueryHit</code> properties names.
- </para>
- <para>
- The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <code>id</code> and <code>score</code> properties always refer
- to internal Lucene document id and hit <link linkend="zend.search.lucene.searching.results-scoring">score</link>.
- If the indexed document has the same stored fields, you have to use the <code>getDocument()</code> method to access them:
- <programlisting language="php"><![CDATA[
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- // Get 'title' document field
- $title = $hit->title;
- // Get 'contents' document field
- $contents = $hit->contents;
- // Get internal Lucene document id
- $id = $hit->id;
- // Get query hit score
- $score = $hit->score;
- // Get 'id' document field
- $docId = $hit->getDocument()->id;
- // Get 'score' document field
- $docId = $hit->getDocument()->score;
- // Another way to get 'title' document field
- $title = $hit->getDocument()->title;
- }
- ]]></programlisting>
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.indexing-performance">
- <title>Indexing performance</title>
- <para>
- Indexing performance is a compromise between used resources, indexing time and index quality.
- </para>
- <para>
- Index quality is completely determined by number of index segments.
- </para>
- <para>
- Each index segment is entirely independent portion of data. So indexes containing more segments need
- more memory and time for searching.
- </para>
- <para>
- Index optimization is a process of merging several segments into a new one. A fully optimized index contains
- only one segment.
- </para>
- <para>
- Full index optimization may be performed with the <code>optimize()</code> method:
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open($indexPath);
- $index->optimize();
- ]]></programlisting>
- </para>
- <para>
- Index optimization works with data streams and doesn't take a lot of memory but does require processor resources and time.
- </para>
- <para>
- Lucene index segments are not updatable by their nature (the update operation requires the segment file
- to be completely rewritten). So adding new document(s) to an index always generates a new segment.
- This, in turn, decreases index quality.
- </para>
- <para>
- An index auto-optimization process is performed after each segment generation and consists of merging partial segments.
- </para>
- <para>
- There are three options to control the behavior of auto-optimization
- (see <link linkend="zend.search.lucene.index-creation.optimization">Index optimization</link> section):
- <itemizedlist>
- <listitem>
- <para><emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be buffered in memory before a new segment is generated
- and written to the hard drive.</para>
- </listitem>
- <listitem>
- <para><emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged by auto-optimization process
- into a new segment.</para>
- </listitem>
- <listitem>
- <para><emphasis>MergeFactor</emphasis> determines how often auto-optimization is performed.</para>
- </listitem>
- </itemizedlist>
- <note>
- <para>
- All these options are <classname>Zend_Search_Lucene</classname> object properties- not index properties. They affect only current
- <classname>Zend_Search_Lucene</classname> object behavior and may vary for different scripts.
- </para>
- </note>
- </para>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one document per script execution. On the other hand, it's very important
- for batch indexing. Greater values increase indexing performance, but also require more memory.
- </para>
- <para>
- There is simply no way to calculate the best value for the <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document size, the
- analyzer in use and allowed memory.
- </para>
- <para>
- A good way to find the right value is to perform several tests with the largest document you expect to be added to the index
- <footnote><para><code>memory_get_usage()</code> and <code>memory_get_peak_usage()</code> may be used to control
- memory usage.</para></footnote>. It's a best practice not to use more than a half of the allowed memory.
- </para>
- <para>
- <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It therefore also limits auto-optimization time by guaranteeing
- that the <code>addDocument()</code> method is not executed more than a certain number of times. This is very important for interactive applications.
- </para>
- <para>
- Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing performance. Index auto-optimization is an iterative process
- and is performed from bottom up. Small segments are merged into larger segment, which are in turn merged into even larger segments and
- so on. Full index optimization is achieved when only one large segment file remains.
- </para>
- <para>
- Small segments generally decrease index quality. Many small segments may also trigger
- the "Too many open files" error determined by OS limitations <footnote><para><classname>Zend_Search_Lucene</classname> keeps each segment file opened
- to improve search performance.</para></footnote>.
- </para>
- <para>
- in general, background index optimization should be performed for interactive indexing mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be
- too low for batch indexing.
- </para>
- <para>
- <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values increase the quality of unoptimized indexes. Larger values increase
- indexing performance, but also increase the number of merged segments. This again may trigger the "Too many open files" error.
- </para>
- <para>
- <emphasis>MergeFactor</emphasis> groups index segments by their size:
- <orderedlist>
- <listitem><para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para></listitem>
- <listitem><para>Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.</para></listitem>
- <listitem><para>Greater than <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but not greater than
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.</para></listitem>
- <listitem><para>...</para></listitem>
- </orderedlist>
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> checks during each <code>addDocument()</code> call to see if merging any segments may move the newly created segment
- into the next group. If yes, then merging is performed.
- </para>
- <para>
- So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> + (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript> documents.
- </para>
- <para>
- This gives good approximation for the number of segments in the index:
- </para>
- <para>
- <emphasis>NumberOfSegments</emphasis> <= <emphasis>MaxBufferedDocs</emphasis> + <emphasis>MergeFactor</emphasis>*log
- <subscript><emphasis>MergeFactor</emphasis></subscript> (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
- </para>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for the appropriate merge factor to get a reasonable
- number of segments.
- </para>
- <para>
- Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more course-grained.
- So use the estimation above for tuning <emphasis>MergeFactor</emphasis>, then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.shutting-down">
- <title>Index during Shut Down</title>
- <para>
- The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time if any documents were added to the index but not written to a new segment.
- </para>
- <para>
- It also may trigger an auto-optimization process.
- </para>
- <para>
- The index object is automatically closed when it, and all returned QueryHit objects, go out of scope.
- </para>
- <para>
- If index object is stored in global variable than it's closed only at the end of script execution<footnote><para>This also
- may occur if the index or QueryHit instances are referred to in some cyclical data structures, because PHP garbage collects objects
- with cyclic references only at the end of script execution.</para></footnote>.
- </para>
- <para>
- PHP exception processing is also shut down at this moment.
- </para>
- <para>
- It doesn't prevent normal index shutdown process, but may prevent accurate error diagnostic if any error occurs during shutdown.
- </para>
- <para>
- There are two ways with which you may avoid this problem.
- </para>
- <para>
- The first is to force going out of scope:
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open($indexPath);
- ...
- unset($index);
- ]]></programlisting>
- </para>
- <para>
- And the second is to perform a commit operation before the end of script execution:
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open($indexPath);
- $index->commit();
- ]]></programlisting>
- This possibility is also described in the "<link linkend="zend.search.lucene.advanced.static">Advanced. Using index as static property</link>"
- section.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.unique-id">
- <title>Retrieving documents by unique id</title>
- <para>
- It's a common practice to store some unique document id in the index. Examples include url, path, or database id.
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> provides a <code>termDocs()</code> method for retrieving documents containing specified terms.
- </para>
- <para>
- This is more efficient than using the <code>find()</code> method:
- <programlisting language="php"><![CDATA[
- // Retrieving documents with find() method using a query string
- $query = $idFieldName . ':' . $docId;
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- $title = $hit->title;
- $contents = $hit->contents;
- ...
- }
- ...
- // Retrieving documents with find() method using the query API
- $term = new Zend_Search_Lucene_Index_Term($docId, idFieldName);
- $query = new Zend_Search_Lucene_Search_Query_Term($term);
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- $title = $hit->title;
- $contents = $hit->contents;
- ...
- }
- ...
- // Retrieving documents with termDocs() method
- $term = new Zend_Search_Lucene_Index_Term($docId, idFieldName);
- $docIds = $index->termDocs($term);
- foreach ($docIds as $id) {
- $doc = $index->getDocument($id);
- $title = $doc->title;
- $contents = $doc->contents;
- ...
- }
- ]]></programlisting>
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.memory-usage">
- <title>Memory Usage</title>
- <para>
- <classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
- </para>
- <para>
- It uses memory to cache some information and optimize searching and indexing performance.
- </para>
- <para>
- The memory required differs for different modes.
- </para>
- <para>
- The terms dictionary index is loaded during the search. It's actually each 128<superscript>th</superscript>
- <footnote><para>The Lucene file format allows you to configure this number, but <classname>Zend_Search_Lucene</classname> doesn't expose this
- in its API. Nevertheless you still have the ability to configure this value if the index is prepared with another
- Lucene implementation.</para></footnote> term of the full dictionary.
- </para>
- <para>
- Thus memory usage is increased if you have a high number of unique terms. This may happen if you use untokenized phrases as a field values
- or index a large volume of non-text information.
- </para>
- <para>
- An unoptimized index consists of several segments. It also increases memory usage. Segments are independent, so each segment contains
- its own terms dictionary and terms dictionary index. If an index consists of <emphasis>N</emphasis> segments it may increase memory
- usage by <emphasis>N</emphasis> times in worst case. Perform index optimization to merge all segments into one to avoid such memory consumption.
- </para>
- <para>
- Indexing uses the same memory as searching plus memory for buffering documents. The amount of memory used may be managed with
- <emphasis>MaxBufferedDocs</emphasis> parameter.
- </para>
- <para>
- Index optimization (full or partial) uses stream-style data processing and doesn't require a lot of memory.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.encoding">
- <title>Encoding</title>
- <para>
- <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
- </para>
- <para>
- You shouldn't be concerned with encoding if you work with pure ASCII data, but you should be careful if this is not the case.
- </para>
- <para>
- Wrong encoding may cause error notices at the encoding conversion time or loss of data.
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities for indexed documents and parsed queries.
- </para>
- <para>
- Encoding may be explicitly specified as an optional parameter of field creation methods:
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::Text('title',
- $title,
- 'iso-8859-1'));
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- $contents,
- 'utf-8'));
- ]]></programlisting>
- This is the best way to avoid ambiguity in the encoding used.
- </para>
- <para>
- If optional encoding parameter is omitted, then the current locale is used. The current locale may contain character encoding data
- in addition to the language specification:
- <programlisting language="php"><![CDATA[
- setlocale(LC_ALL, 'fr_FR');
- ...
- setlocale(LC_ALL, 'de_DE.iso-8859-1');
- ...
- setlocale(LC_ALL, 'ru_RU.UTF-8');
- ...
- ]]></programlisting>
- </para>
- <para>
- The same approach is used to set query string encoding.
- </para>
- <para>
- If encoding is not specified, then the current locale is used to determine the encoding.
- </para>
- <para>
- Encoding may be passed as an optional parameter, if the query is parsed explicitly before search:
- <programlisting language="php"><![CDATA[
- $query =
- Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
- $hits = $index->find($query);
- ...
- ]]></programlisting>
- </para>
- <para>
- The default encoding may also be specified with <code>setDefaultEncoding()</code> method:
- <programlisting language="php"><![CDATA[
- Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
- $hits = $index->find($queryStr);
- ...
- ]]></programlisting>
- The empty string implies 'current locale'.
- </para>
- <para>
- If the correct encoding is specified it can be correctly processed by analyzer. The actual behavior depends on which analyzer is used.
- See the <link linkend="zend.search.lucene.charset">Character Set</link> documentation section for details.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.maintenance">
- <title>Index maintenance</title>
- <para>
- It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other Lucene implementation does not comprise a "database".
- </para>
- <para>
- Indexes should not be used for data storage. They do not provide partial backup/restore functionality, journaling, logging, transactions
- and many other features associated with database management systems.
- </para>
- <para>
- Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a consistent state at all times.
- </para>
- <para>
- Index backup and restoration should be performed by copying the contents of the index folder.
- </para>
- <para>
- If index corruption occurs for any reason, the corrupted index should be restored or completely rebuilt.
- </para>
- <para>
- So it's a good idea to backup large indexes and store changelogs to perform manual restoration and roll-forward operations
- if necessary. This practice dramatically reduces index restoration time.
- </para>
- </sect2>
- </sect1>
|