| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.best-practice">
- <title>Best Practices</title>
- <sect2 id="zend.search.lucene.best-practice.field-names">
- <title>Field names</title>
- <para>
- There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
- </para>
- <para>
- Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
- '<emphasis>score</emphasis>' names to avoid ambiguity in <classname>QueryHit</classname>
- properties names.
- </para>
- <para>
- The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <property>id</property>
- and <property>score</property> properties always refer to internal Lucene document id
- and hit <link linkend="zend.search.lucene.searching.results-scoring">score</link>. If
- the indexed document has the same stored fields, you have to use the
- <methodname>getDocument()</methodname> method to access them:
- </para>
- <programlisting language="php"><![CDATA[
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- // Get 'title' document field
- $title = $hit->title;
- // Get 'contents' document field
- $contents = $hit->contents;
- // Get internal Lucene document id
- $id = $hit->id;
- // Get query hit score
- $score = $hit->score;
- // Get 'id' document field
- $docId = $hit->getDocument()->id;
- // Get 'score' document field
- $docId = $hit->getDocument()->score;
- // Another way to get 'title' document field
- $title = $hit->getDocument()->title;
- }
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.indexing-performance">
- <title>Indexing performance</title>
- <para>
- Indexing performance is a compromise between used resources, indexing time and index
- quality.
- </para>
- <para>
- Index quality is completely determined by number of index segments.
- </para>
- <para>
- Each index segment is entirely independent portion of data. So indexes containing more
- segments need more memory and time for searching.
- </para>
- <para>
- Index optimization is a process of merging several segments into a new one. A fully
- optimized index contains only one segment.
- </para>
- <para>
- Full index optimization may be performed with the <methodname>optimize()</methodname>
- method:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open($indexPath);
- $index->optimize();
- ]]></programlisting>
- <para>
- Index optimization works with data streams and doesn't take a lot of memory but does
- require processor resources and time.
- </para>
- <para>
- Lucene index segments are not updatable by their nature (the update operation requires
- the segment file to be completely rewritten). So adding new document(s) to an index
- always generates a new segment. This, in turn, decreases index quality.
- </para>
- <para>
- An index auto-optimization process is performed after each segment generation and
- consists of merging partial segments.
- </para>
- <para>
- There are three options to control the behavior of auto-optimization (see <link
- linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
- section):
- <itemizedlist>
- <listitem>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
- buffered in memory before a new segment is generated and written to the hard
- drive.
- </para>
- </listitem>
- <listitem>
- <para>
- <emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
- by auto-optimization process into a new segment.
- </para>
- </listitem>
- <listitem>
- <para>
- <emphasis>MergeFactor</emphasis> determines how often auto-optimization is
- performed.
- </para>
- </listitem>
- </itemizedlist>
- <note>
- <para>
- All these options are <classname>Zend_Search_Lucene</classname> object
- properties- not index properties. They affect only current
- <classname>Zend_Search_Lucene</classname> object behavior and may vary for
- different scripts.
- </para>
- </note>
- </para>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
- document per script execution. On the other hand, it's very important for batch
- indexing. Greater values increase indexing performance, but also require more memory.
- </para>
- <para>
- There is simply no way to calculate the best value for the
- <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
- size, the analyzer in use and allowed memory.
- </para>
- <para>
- A good way to find the right value is to perform several tests with the largest document
- you expect to be added to the index
- <footnote>
- <para>
- <methodname>memory_get_usage()</methodname> and
- <methodname>memory_get_peak_usage()</methodname> may be used to control memory
- usage.
- </para>
- </footnote>
- . It's a best practice not to use more than a half of the allowed memory.
- </para>
- <para>
- <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
- therefore also limits auto-optimization time by guaranteeing that the
- <methodname>addDocument()</methodname> method is not executed more than a certain number
- of times. This is very important for interactive applications.
- </para>
- <para>
- Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
- performance. Index auto-optimization is an iterative process and is performed from
- bottom up. Small segments are merged into larger segment, which are in turn merged into
- even larger segments and so on. Full index optimization is achieved when only one large
- segment file remains.
- </para>
- <para>
- Small segments generally decrease index quality. Many small segments may also trigger
- the "Too many open files" error determined by OS limitations
- <footnote>
- <para>
- <classname>Zend_Search_Lucene</classname> keeps each segment file opened to
- improve search performance.
- </para>
- </footnote>.
- </para>
- <para>
- in general, background index optimization should be performed for interactive indexing
- mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
- </para>
- <para>
- <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
- increase the quality of unoptimized indexes. Larger values increase indexing
- performance, but also increase the number of merged segments. This again may trigger the
- "Too many open files" error.
- </para>
- <para>
- <emphasis>MergeFactor</emphasis> groups index segments by their size:
- <orderedlist>
- <listitem>
- <para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
- </listitem>
- <listitem>
- <para>
- Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
- </para>
- </listitem>
- <listitem>
- <para>
- Greater than
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
- not greater than
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
- </para>
- </listitem>
- <listitem><para>...</para></listitem>
- </orderedlist>
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> checks during each
- <methodname>addDocument()</methodname> call to see if merging any segments may move the
- newly created segment into the next group. If yes, then merging is performed.
- </para>
- <para>
- So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
- (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
- <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
- documents.
- </para>
- <para>
- This gives good approximation for the number of segments in the index:
- </para>
- <para>
- <emphasis>NumberOfSegments</emphasis> <= <emphasis>MaxBufferedDocs</emphasis> +
- <emphasis>MergeFactor</emphasis>*log
- <subscript><emphasis>MergeFactor</emphasis></subscript>
- (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
- </para>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
- the appropriate merge factor to get a reasonable number of segments.
- </para>
- <para>
- Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
- indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
- course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
- then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.shutting-down">
- <title>Index during Shut Down</title>
- <para>
- The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
- if any documents were added to the index but not written to a new segment.
- </para>
- <para>
- It also may trigger an auto-optimization process.
- </para>
- <para>
- The index object is automatically closed when it, and all returned QueryHit objects, go
- out of scope.
- </para>
- <para>
- If index object is stored in global variable than it's closed only at the end of script
- execution
- <footnote>
- <para>
- This also may occur if the index or QueryHit instances are referred to in some
- cyclical data structures, because <acronym>PHP</acronym> garbage collects
- objects with cyclic references only at the end of script execution.
- </para>
- </footnote>.
- </para>
- <para>
- <acronym>PHP</acronym> exception processing is also shut down at this moment.
- </para>
- <para>
- It doesn't prevent normal index shutdown process, but may prevent accurate error
- diagnostic if any error occurs during shutdown.
- </para>
- <para>
- There are two ways with which you may avoid this problem.
- </para>
- <para>
- The first is to force going out of scope:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open($indexPath);
- ...
- unset($index);
- ]]></programlisting>
- <para>
- And the second is to perform a commit operation before the end of script execution:
- </para>
- <programlisting language="php"><![CDATA[
- $index = Zend_Search_Lucene::open($indexPath);
- $index->commit();
- ]]></programlisting>
- <para>
- This possibility is also described in the "<link
- linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
- property</link>" section.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.unique-id">
- <title>Retrieving documents by unique id</title>
- <para>
- It's a common practice to store some unique document id in the index. Examples include
- url, path, or database id.
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
- method for retrieving documents containing specified terms.
- </para>
- <para>
- This is more efficient than using the <methodname>find()</methodname> method:
- </para>
- <programlisting language="php"><![CDATA[
- // Retrieving documents with find() method using a query string
- $query = $idFieldName . ':' . $docId;
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- $title = $hit->title;
- $contents = $hit->contents;
- ...
- }
- ...
- // Retrieving documents with find() method using the query API
- $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
- $query = new Zend_Search_Lucene_Search_Query_Term($term);
- $hits = $index->find($query);
- foreach ($hits as $hit) {
- $title = $hit->title;
- $contents = $hit->contents;
- ...
- }
- ...
- // Retrieving documents with termDocs() method
- $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
- $docIds = $index->termDocs($term);
- foreach ($docIds as $id) {
- $doc = $index->getDocument($id);
- $title = $doc->title;
- $contents = $doc->contents;
- ...
- }
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.memory-usage">
- <title>Memory Usage</title>
- <para>
- <classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
- </para>
- <para>
- It uses memory to cache some information and optimize searching and indexing
- performance.
- </para>
- <para>
- The memory required differs for different modes.
- </para>
- <para>
- The terms dictionary index is loaded during the search. It's actually each
- 128<superscript>th</superscript>
- <footnote>
- <para>
- The Lucene file format allows you to configure this number, but
- <classname>Zend_Search_Lucene</classname> doesn't expose this in its
- <acronym>API</acronym>. Nevertheless you still have the ability to configure
- this value if the index is prepared with another Lucene implementation.
- </para>
- </footnote>
- term of the full dictionary.
- </para>
- <para>
- Thus memory usage is increased if you have a high number of unique terms. This may
- happen if you use untokenized phrases as a field values or index a large volume of
- non-text information.
- </para>
- <para>
- An unoptimized index consists of several segments. It also increases memory usage.
- Segments are independent, so each segment contains its own terms dictionary and terms
- dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
- increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
- optimization to merge all segments into one to avoid such memory consumption.
- </para>
- <para>
- Indexing uses the same memory as searching plus memory for buffering documents. The
- amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
- parameter.
- </para>
- <para>
- Index optimization (full or partial) uses stream-style data processing and doesn't
- require a lot of memory.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.encoding">
- <title>Encoding</title>
- <para>
- <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
- strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
- </para>
- <para>
- You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
- data, but you should be careful if this is not the case.
- </para>
- <para>
- Wrong encoding may cause error notices at the encoding conversion time or loss of data.
- </para>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
- for indexed documents and parsed queries.
- </para>
- <para>
- Encoding may be explicitly specified as an optional parameter of field creation methods:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::Text('title',
- $title,
- 'iso-8859-1'));
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- $contents,
- 'utf-8'));
- ]]></programlisting>
- <para>
- This is the best way to avoid ambiguity in the encoding used.
- </para>
- <para>
- If optional encoding parameter is omitted, then the current locale is used. The current
- locale may contain character encoding data in addition to the language specification:
- </para>
- <programlisting language="php"><![CDATA[
- setlocale(LC_ALL, 'fr_FR');
- ...
- setlocale(LC_ALL, 'de_DE.iso-8859-1');
- ...
- setlocale(LC_ALL, 'ru_RU.UTF-8');
- ...
- ]]></programlisting>
- <para>
- The same approach is used to set query string encoding.
- </para>
- <para>
- If encoding is not specified, then the current locale is used to determine the encoding.
- </para>
- <para>
- Encoding may be passed as an optional parameter, if the query is parsed explicitly
- before search:
- </para>
- <programlisting language="php"><![CDATA[
- $query =
- Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
- $hits = $index->find($query);
- ...
- ]]></programlisting>
- <para>
- The default encoding may also be specified with
- <methodname>setDefaultEncoding()</methodname> method:
- </para>
- <programlisting language="php"><![CDATA[
- Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
- $hits = $index->find($queryStr);
- ...
- ]]></programlisting>
- <para>
- The empty string implies 'current locale'.
- </para>
- <para>
- If the correct encoding is specified it can be correctly processed by analyzer. The
- actual behavior depends on which analyzer is used. See the <link
- linkend="zend.search.lucene.charset">Character Set</link> documentation section for
- details.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.best-practice.maintenance">
- <title>Index maintenance</title>
- <para>
- It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
- Lucene implementation does not comprise a "database".
- </para>
- <para>
- Indexes should not be used for data storage. They do not provide partial backup/restore
- functionality, journaling, logging, transactions and many other features associated with
- database management systems.
- </para>
- <para>
- Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
- consistent state at all times.
- </para>
- <para>
- Index backup and restoration should be performed by copying the contents of the index
- folder.
- </para>
- <para>
- If index corruption occurs for any reason, the corrupted index should be restored or
- completely rebuilt.
- </para>
- <para>
- So it's a good idea to backup large indexes and store changelogs to perform manual
- restoration and roll-forward operations if necessary. This practice dramatically reduces
- index restoration time.
- </para>
- </sect2>
- </sect1>
|