| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.index-creation">
- <title>Building Indexes</title>
- <sect2 id="zend.search.lucene.index-creation.creating">
- <title>Creating a New Index</title>
- <para>
- Index creation and updating capabilities are implemented within the <classname>Zend_Search_Lucene</classname> component, as well as the Java Lucene project.
- You can use either of these options to create indexes that <classname>Zend_Search_Lucene</classname> can search.
- </para>
- <para>
- The PHP code listing below provides an example of how to index a file
- using <classname>Zend_Search_Lucene</classname> indexing API:
- </para>
- <programlisting language="php"><![CDATA[
- // Create index
- $index = Zend_Search_Lucene::create('/data/my-index');
- $doc = new Zend_Search_Lucene_Document();
- // Store document URL to identify it in the search results
- $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
- // Index document contents
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));
- // Add document to the index
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- Newly added documents are immediately searchable in the index.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.updating">
- <title>Updating Index</title>
- <para>
- The same procedure is used to update an existing index. The only difference
- is that the open() method is called instead of the create() method:
- </para>
- <programlisting language="php"><![CDATA[
- // Open existing index
- $index = Zend_Search_Lucene::open('/data/my-index');
- $doc = new Zend_Search_Lucene_Document();
- // Store document URL to identify it in search result.
- $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
- // Index document content
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- $docContent));
- // Add document to the index.
- $index->addDocument($doc);
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.document-updating">
- <title>Updating Documents</title>
- <para>
- The Lucene index file format doesn't support document updating.
- Documents should be removed and re-added to the index to effectively update them.
- </para>
- <para>
- <classname>Zend_Search_Lucene::delete()</classname> method operates with an internal index document id. It can be retrieved
- from a query hit by 'id' property:
- </para>
- <programlisting language="php"><![CDATA[
- $removePath = ...;
- $hits = $index->find('path:' . $removePath);
- foreach ($hits as $hit) {
- $index->delete($hit->id);
- }
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.counting">
- <title>Retrieving Index Size</title>
- <para>
- There are two methods to retrieve the size of an index in <classname>Zend_Search_Lucene</classname>.
- </para>
- <para>
- <classname>Zend_Search_Lucene::maxDoc()</classname> returns one greater than the largest possible document number.
- It's actually the overall number of the documents in the index including deleted documents,
- so it has a synonym: <classname>Zend_Search_Lucene::count()</classname>.
- </para>
- <para>
- <classname>Zend_Search_Lucene::numDocs()</classname> returns the total number of non-deleted documents.
- </para>
- <programlisting language="php"><![CDATA[
- $indexSize = $index->count();
- $documents = $index->numDocs();
- ]]></programlisting>
- <para>
- <classname>Zend_Search_Lucene::isDeleted($id)</classname> method may be used to check if a document is deleted.
- </para>
- <programlisting language="php"><![CDATA[
- for ($count = 0; $count < $index->maxDoc(); $count++) {
- if ($index->isDeleted($count)) {
- echo "Document #$id is deleted.\n";
- }
- }
- ]]></programlisting>
- <para>
- Index optimization removes deleted documents and squeezes documents' IDs in to a smaller range.
- A document's internal id may therefore change during index optimization.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.optimization">
- <title>Index optimization</title>
- <para>
- A Lucene index consists of many segments. Each segment is a completely independent set of data.
- </para>
- <para>
- Lucene index segment files can't be updated by design. A segment update needs full segment
- reorganization. See Lucene index file formats for details
- (<ulink url="http://lucene.apache.org/java/docs/fileformats.html">http://lucene.apache.org/java/docs/fileformats.html</ulink>)
- <footnote>
- <para>The currently supported Lucene index file format is version 2.3 (starting from ZF 1.6).</para>
- </footnote>.
- New documents are added to the index by creating new segment.
- </para>
- <para>
- Increasing number of segments reduces quality of the index, but index optimization restores it.
- Optimization essentially merges several segments into a new one. This process also doesn't update segments.
- It generates one new large segment and updates segment list ('segments' file).
- </para>
- <para>
- Full index optimization can be trigger by calling the <classname>Zend_Search_Lucene::optimize()</classname> method. It merges all
- index segments into one new segment:
- </para>
- <programlisting language="php"><![CDATA[
- // Open existing index
- $index = Zend_Search_Lucene::open('/data/my-index');
- // Optimize index.
- $index->optimize();
- ]]></programlisting>
- <para>
- Automatic index optimization is performed to keep indexes in a consistent state.
- </para>
- <para>
- Automatic optimization is an iterative process managed by several index options. It merges very small segments
- into larger ones, then merges these larger segments into even larger segments and so on.
- </para>
- <sect3 id="zend.search.lucene.index-creation.optimization.maxbuffereddocs">
- <title>MaxBufferedDocs auto-optimization option</title>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> is a minimal number of documents required before
- the buffered in-memory documents are written into a new segment.
- </para>
- <para>
- <emphasis>MaxBufferedDocs</emphasis> can be retrieved or set by <code>$index->getMaxBufferedDocs()</code> or
- <code>$index->setMaxBufferedDocs($maxBufferedDocs)</code> calls.
- </para>
- <para>
- Default value is 10.
- </para>
- </sect3>
- <sect3 id="zend.search.lucene.index-creation.optimization.maxmergedocs">
- <title>MaxMergeDocs auto-optimization option</title>
- <para>
- <emphasis>MaxMergeDocs</emphasis> is a largest number of documents ever merged by addDocument().
- Small values (e.g., less than 10.000) are best for interactive indexing, as this limits the length
- of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier
- searches.
- </para>
- <para>
- <emphasis>MaxMergeDocs</emphasis> can be retrieved or set by <code>$index->getMaxMergeDocs()</code> or
- <code>$index->setMaxMergeDocs($maxMergeDocs)</code> calls.
- </para>
- <para>
- Default value is PHP_INT_MAX.
- </para>
- </sect3>
- <sect3 id="zend.search.lucene.index-creation.optimization.mergefactor">
- <title>MergeFactor auto-optimization option</title>
- <para>
- <emphasis>MergeFactor</emphasis> determines how often segment indices are merged by addDocument().
- With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster,
- but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches
- on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch
- index creation, and smaller values (< 10) for indices that are interactively maintained.
- </para>
- <para>
- <emphasis>MergeFactor</emphasis> is a good estimation for average number of segments merged by one auto-optimization pass.
- Too large values produce large number of segments while they are not merged into new one. It may be a cause of
- "failed to open stream: Too many open files" error message. This limitation is system dependent.
- </para>
- <para>
- <emphasis>MergeFactor</emphasis> can be retrieved or set by <code>$index->getMergeFactor()</code> or
- <code>$index->setMergeFactor($mergeFactor)</code> calls.
- </para>
- <para>
- Default value is 10.
- </para>
- <para>
- Lucene Java and Luke (Lucene Index Toolbox - <ulink url="http://www.getopt.org/luke/">http://www.getopt.org/luke/</ulink>)
- can also be used to optimize an index. Latest Luke release (v0.8) is based on Lucene v2.3 and compatible with
- current implementation of <classname>Zend_Search_Lucene</classname> component (ZF 1.6). Earlier versions of <classname>Zend_Search_Lucene</classname> implementations
- need another versions of Java Lucene tools to be compatible:
- <itemizedlist>
- <listitem>
- <para>ZF 1.5 - Java Lucene 2.1 (Luke tool v0.7.1 - <ulink url="http://www.getopt.org/luke/luke-0.7.1/"/>)</para>
- </listitem>
- <listitem>
- <para>ZF 1.0 - Java Lucene 1.4 - 2.1 (Luke tool v0.6 - <ulink url="http://www.getopt.org/luke/luke-0.6/"/>)</para>
- </listitem>
- </itemizedlist>
- </para>
- </sect3>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.permissions">
- <title>Permissions</title>
- <para>
- By default, index files are available for reading and writing by everyone.
- </para>
- <para>
- It's possible to override this with the <classname>Zend_Search_Lucene_Storage_Directory_Filesystem::setDefaultFilePermissions()</classname> method:
- </para>
- <programlisting language="php"><![CDATA[
- // Get current default file permissions
- $currentPermissions =
- Zend_Search_Lucene_Storage_Directory_Filesystem::getDefaultFilePermissions();
- // Give read-writing permissions only for current user and group
- Zend_Search_Lucene_Storage_Directory_Filesystem::setDefaultFilePermissions(0660);
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.limitations">
- <title>Limitations</title>
- <sect3 id="zend.search.lucene.index-creation.limitations.index-size">
- <title>Index size</title>
- <para>
- Index size is limited by 2GB for 32-bit platforms.
- </para>
- <para>
- Use 64-bit platforms for larger indices.
- </para>
- </sect3>
- <sect3 id="zend.search.lucene.index-creation.limitations.filesystems">
- <title>Supported Filesystems</title>
- <para>
- <classname>Zend_Search_Lucene</classname> uses <code>flock()</code> to provide concurrent searching, index updating and optimization.
- </para>
- <para>
- According to the PHP <ulink url="http://www.php.net/manual/en/function.flock.php">documentation</ulink>,
- "<code>flock()</code> will not work on NFS and many other networked file systems".
- </para>
- <para>
- Do not use networked file systems with <classname>Zend_Search_Lucene</classname>.
- </para>
- </sect3>
- </sect2>
- </sect1>
- <!--
- vim:se ts=4 sw=4 et:
- -->
|