| 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="learning.lucene.indexing">
- <title>Indexing</title>
- <para>
- Indexing is performed by adding a new document to an existing or new index:
- </para>
- <programlisting language="php"><![CDATA[
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- There are two ways to create document object. The first is to do it manually.
- </para>
- <example id="learning.lucene.indexing.doc-creation">
- <title>Manual Document Construction</title>
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
- $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
- $doc->addField(Zend_Search_Lucene_Field::unStored('contents', $docBody));
- $doc->addField(Zend_Search_Lucene_Field::binary('avatar', $avatarData));
- ]]></programlisting>
- </example>
- <para>
- The second method is to load it from <acronym>HTML</acronym> or Microsoft Office 2007 files:
- </para>
- <example id="learning.lucene.indexing.doc-loading">
- <title>Document loading</title>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($path);
- $doc = Zend_Search_Lucene_Document_Pptx::loadPptFile($path);
- $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($path);
- ]]></programlisting>
- </example>
- <para>
- If a document is loaded from one of the supported formats, it still can be extended manually
- with new user defined fields.
- </para>
- <sect2 id="learning.lucene.indexing.policy">
- <title>Indexing Policy</title>
- <para>
- You should define indexing policy within your application architectural design.
- </para>
- <para>
- You may need an on-demand indexing configuration (something like <acronym>OLTP</acronym>
- system). In such systems, you usually add one document per user request. As such, the
- <emphasis>MaxBufferedDocs</emphasis> option will not affect the system. On the other
- hand, <emphasis>MaxMergeDocs</emphasis> is really helpful as it allows you to limit
- maximum script execution time. <emphasis>MergeFactor</emphasis> should be set to a value
- that keeps balance between the average indexing time (it's also affected by average
- auto-optimization time) and search performance (index optimization level is dependent on
- the number of segments).
- </para>
- <para>
- If you will be primarily performing batch index updates, your configuration should use a
- <emphasis>MaxBufferedDocs</emphasis> option set to the maximum value supported by the
- available amount of memory. <emphasis>MaxMergeDocs</emphasis> and
- <emphasis>MergeFactor</emphasis> have to be set to values reducing auto-optimization
- involvement as much as possible <footnote><para>An additional limit is the maximum file
- handlers supported by the operation system for concurrent open
- operations</para></footnote>. Full index optimization should be applied after
- indexing.
- </para>
- <example id="learning.lucene.indexing.optimization">
- <title>Index optimization</title>
- <programlisting language="php"><![CDATA[
- $index->optimize();
- ]]></programlisting>
- </example>
- <para>
- In some configurations, it's more effective to serialize index updates by organizing
- update requests into a queue and processing several update requests in a single script
- execution. This reduces index opening overhead, and allows utilizing index document
- buffering.
- </para>
- </sect2>
- </sect1>
|