boarspring
/
repo_zf1
mirror of https://github.com/bluescreen/zf1-php7.2.git


			
				
					
						
						
							1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
							<?xml version="1.0" encoding="UTF-8"?>
<!-- EN-Revision: 19777 -->
<!-- Reviewed: no -->
<sect1 id="learning.lucene.indexing">
    <title>Indizierung</title>

    <para>
        Indizierung wird durch das Hinzufügen eines neuen Dokuments zu einem bestehenden oder neuen
        Index durchgeführt:
    </para>

    <programlisting language="php"><![CDATA[
$index->addDocument($doc);
]]></programlisting>

    <para>
        Es gibt zwei Wege Dokument Objekte zu erstellen. Der erste ist es manuell zu tun.
    </para>

    <example id="learning.lucene.indexing.doc-creation">
        <title>Manuelle Dokument Erstellung</title>

        <programlisting language="php"><![CDATA[
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::unStored('contents', $docBody));
$doc->addField(Zend_Search_Lucene_Field::binary('avatar', $avatarData));
]]></programlisting>
    </example>

    <para>
        Die zweite Methode ist das Laden von <acronym>HTML</acronym> oder von Microsoft Office 2007
        Dateien:
    </para>

    <example id="learning.lucene.indexing.doc-loading">
        <title>Laden vom Dokument</title>

        <programlisting language="php"><![CDATA[
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($path);
$doc = Zend_Search_Lucene_Document_Pptx::loadPptFile($path);
$doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($path);
]]></programlisting>
    </example>

    <para>
        Wenn ein Dokument von einem der unterstützten Formate geladen wird, kann es trotzdem noch
        manuell mit einem neuen benutzerdefinierten Feld erweitert werden.
    </para>

    <sect2 id="learning.lucene.indexing.policy">
        <title>Indizierungs Policy</title>

        <para>
            Man sollte eine Indizierungs Policy im Architektur Design der eigenen Anwendung
            definieren.
        </para>

        <para>
            You may need an on-demand indexing configuration (something like <acronym>OLTP</acronym>
            system). In such systems, you usually add one document per user request. As such, the
            <emphasis>MaxBufferedDocs</emphasis> option will not affect the system. On the other
            hand, <emphasis>MaxMergeDocs</emphasis> is really helpful as it allows you to limit
            maximum script execution time. <emphasis>MergeFactor</emphasis> should be set to a value
            that keeps balance between the average indexing time (it's also affected by average
            auto-optimization time) and search performance (index optimization level is dependent on
            the number of segments).
        </para>

        <para>
            If you will be primarily performing batch index updates, your configuration should use a
            <emphasis>MaxBufferedDocs</emphasis> option set to the maximum value supported by the
            available amount of memory. <emphasis>MaxMergeDocs</emphasis> and
            <emphasis>MergeFactor</emphasis> have to be set to values reducing auto-optimization
            involvement as much as possible <footnote><para>An additional limit is the maximum file
                    handlers supported by the operation system for concurrent open
                    operations</para></footnote>. Full index optimization should be applied after
            indexing.
        </para>

        <example id="learning.lucene.indexing.optimization">
            <title>Index optimization</title>

            <programlisting language="php"><![CDATA[
$index->optimize();
]]></programlisting>
        </example>

        <para>
            In some configurations, it's more effective to serialize index updates by organizing
            update requests into a queue and processing several update requests in a single script
            execution. This reduces index opening overhead, and allows utilizing index document
            buffering.
        </para>
    </sect2>
</sect1>