| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.overview">
- <title>Overview</title>
- <sect2 id="zend.search.lucene.introduction">
- <title>Introduction</title>
- <para>
- <classname>Zend_Search_Lucene</classname> is a general purpose text search engine
- written entirely in <acronym>PHP</acronym> 5. Since it stores its index on the
- filesystem and does not require a database server, it can add search capabilities to
- almost any <acronym>PHP</acronym>-driven website.
- <classname>Zend_Search_Lucene</classname> supports the following features:
- <itemizedlist>
- <listitem>
- <para>Ranked searching - best results returned first</para>
- </listitem>
- <listitem>
- <para>
- Many powerful query types: phrase queries, boolean queries, wildcard queries,
- proximity queries, range queries and many others.
- </para>
- </listitem>
- <listitem>
- <para>Search by specific field (e.g., title, author, contents)</para>
- </listitem>
- </itemizedlist>
- <classname>Zend_Search_Lucene</classname> was derived from the Apache Lucene project.
- The currently (starting from ZF 1.6) supported Lucene index format versions are 1.4 -
- 2.3. For more information on Lucene, visit <ulink
- url="http://lucene.apache.org/java/docs/"/>.
- </para>
- <note>
- <title/>
- <para>
- Previous <classname>Zend_Search_Lucene</classname> implementations support the
- Lucene 1.4 (1.9) - 2.1 index formats.
- </para>
- <para>
- Starting from Zend Framework 1.5 any index created using pre-2.1 index format is
- automatically upgraded to Lucene 2.1 format after the
- <classname>Zend_Search_Lucene</classname> update and will not be compatible with
- <classname>Zend_Search_Lucene</classname> implementations included into Zend
- Framework 1.0.x.
- </para>
- </note>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.documents-and-fields">
- <title>Document and Field Objects</title>
- <para>
- <classname>Zend_Search_Lucene</classname> operates with documents as atomic objects for
- indexing. A document is divided into named fields, and fields have content that can be
- searched.
- </para>
- <para>
- A document is represented by the <classname>Zend_Search_Lucene_Document</classname>
- class, and this objects of this class contain instances of
- <classname>Zend_Search_Lucene_Field</classname> that represent the fields on the
- document.
- </para>
- <para>
- It is important to note that any information can be added to the index.
- Application-specific information or metadata can be stored in the document
- fields, and later retrieved with the document during search.
- </para>
- <para>
- It is the responsibility of your application to control the indexer.
- This means that data can be indexed from any source
- that is accessible by your application. For example, this could be the
- filesystem, a database, an <acronym>HTML</acronym> form, etc.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Field</classname> class provides several static methods to
- create fields with different characteristics:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- // Field is not tokenized, but is indexed and stored within the index.
- // Stored fields can be retrived from the index.
- $doc->addField(Zend_Search_Lucene_Field::Keyword('doctype',
- 'autogenerated'));
- // Field is not tokenized nor indexed, but is stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
- time()));
- // Binary String valued Field that is not tokenized nor indexed,
- // but is stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::Binary('icon',
- $iconData));
- // Field is tokenized and indexed, and is stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
- 'Document annotation text'));
- // Field is tokenized and indexed, but is not stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- 'My document content'));
- ]]></programlisting>
- <para>
- Each of these methods (excluding the
- <methodname>Zend_Search_Lucene_Field::Binary()</methodname> method) has an optional
- <varname>$encoding</varname> parameter for specifying input data encoding.
- </para>
- <para>
- Encoding may differ for different documents as well as for different fields within one
- document:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::Text('title',
- $title,
- 'iso-8859-1'));
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- $contents,
- 'utf-8'));
- ]]></programlisting>
- <para>
- If encoding parameter is omitted, then the current locale is used at processing time.
- For example:
- </para>
- <programlisting language="php"><![CDATA[
- setlocale(LC_ALL, 'de_DE.iso-8859-1');
- ...
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents));
- ]]></programlisting>
- <para>
- Fields are always stored and returned from the index in UTF-8 encoding. Any required
- conversion to UTF-8 happens automatically.
- </para>
- <para>
- Text analyzers (<link linkend="zend.search.lucene.extending.analysis">see below</link>)
- may also convert text to some other encodings. Actually, the default analyzer converts
- text to 'ASCII//TRANSLIT' encoding. Be careful, however; this translation may depend on
- current locale.
- </para>
- <para>
- Fields' names are defined at your discretion in the <methodname>addField()</methodname>
- method.
- </para>
- <para>
- Java Lucene uses the 'contents' field as a default field to search.
- <classname>Zend_Search_Lucene</classname> searches through all fields by default, but
- the behavior is configurable. See the <link
- linkend="zend.search.lucene.query-language.fields">"Default search field"</link>
- chapter for details.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.understanding-field-types">
- <title>Understanding Field Types</title>
- <itemizedlist>
- <listitem>
- <para>
- <code>Keyword</code> fields are stored and indexed, meaning that they can be
- searched as well as displayed in search results. They are not split up into
- separate words by tokenization. Enumerated database fields usually translate
- well to Keyword fields in <classname>Zend_Search_Lucene</classname>.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>UnIndexed</code> fields are not searchable, but they are returned with
- search hits. Database timestamps, primary keys, file system paths, and other
- external identifiers are good candidates for UnIndexed fields.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>Binary</code> fields are not tokenized or indexed, but are stored for
- retrieval with search hits. They can be used to store any data encoded as a
- binary string, such as an image icon.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>Text</code> fields are stored, indexed, and tokenized. Text fields are
- appropriate for storing information like subjects and titles that need to be
- searchable as well as returned with search results.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>UnStored</code> fields are tokenized and indexed, but not stored in the
- index. Large amounts of text are best indexed using this type of field. Storing
- data creates a larger index on disk, so if you need to search but not redisplay
- the data, use an UnStored field. UnStored fields are practical when using a
- <classname>Zend_Search_Lucene</classname> index in combination with a relational
- database. You can index large data fields with UnStored fields for searching,
- and retrieve them from your relational database by using a separate field as an
- identifier.
- </para>
- <table id="zend.search.lucene.index-creation.understanding-field-types.table">
- <title>Zend_Search_Lucene_Field Types</title>
- <tgroup cols="5">
- <thead>
- <row>
- <entry>Field Type</entry>
- <entry>Stored</entry>
- <entry>Indexed</entry>
- <entry>Tokenized</entry>
- <entry>Binary</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Keyword</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- <entry>No</entry>
- </row>
- <row>
- <entry>UnIndexed</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- <entry>No</entry>
- <entry>No</entry>
- </row>
- <row>
- <entry>Binary</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- <entry>No</entry>
- <entry>Yes</entry>
- </row>
- <row>
- <entry>Text</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- </row>
- <row>
- <entry>UnStored</entry>
- <entry>No</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- </row>
- </tbody>
- </tgroup>
- </table>
- </listitem>
- </itemizedlist>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.html-documents">
- <title>HTML documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a <acronym>HTML</acronym> parsing
- feature. Documents can be created directly from a <acronym>HTML</acronym> file or
- string:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
- $index->addDocument($doc);
- ...
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- <classname>Zend_Search_Lucene_Document_Html</classname> class uses the
- <methodname>DOMDocument::loadHTML()</methodname> and
- <methodname>DOMDocument::loadHTMLFile()</methodname> methods to parse the source
- <acronym>HTML</acronym>, so it doesn't need <acronym>HTML</acronym> to be well formed or
- to be <acronym>XHTML</acronym>. On the other hand, it's sensitive to the encoding
- specified by the "meta http-equiv" header tag.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Html</classname> class recognizes document title,
- body and document header meta tags.
- </para>
- <para>
- The 'title' field is actually the /html/head/title value. It's stored within the index,
- tokenized and available for search.
- </para>
- <para>
- The 'body' field is the actual body content of the <acronym>HTML</acronym> file or
- string. It doesn't include scripts, comments or attributes.
- </para>
- <para>
- The <methodname>loadHTML()</methodname> and <methodname>loadHTMLFile()</methodname>
- methods of <classname>Zend_Search_Lucene_Document_Html</classname> class also have
- second optional argument. If it's set to <constant>TRUE</constant>, then body content is
- also stored within index and can be retrieved from the index. By default, the body is
- tokenized and indexed, but not stored.
- </para>
- <para>
- The third parameter of <methodname>loadHTML()</methodname> and
- <methodname>loadHTMLFile()</methodname> methods optionally specifies source
- <acronym>HTML</acronym> document encoding. It's used if encoding is not specified using
- Content-type HTTP-EQUIV meta tag.
- </para>
- <para>
- Other document header meta tags produce additional document fields. The field 'name' is
- taken from 'name' attribute, and the 'content' attribute populates the field 'value'.
- Both are tokenized, indexed and stored, so documents may be searched by their meta tags
- (for example, by keywords).
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
- 'Document annotation text'));
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- Document links are not included in the generated document, but may be retrieved with
- the <methodname>Zend_Search_Lucene_Document_Html::getLinks()</methodname> and
- <methodname>Zend_Search_Lucene_Document_Html::getHeaderLinks()</methodname> methods:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $linksArray = $doc->getLinks();
- $headerLinksArray = $doc->getHeaderLinks();
- ]]></programlisting>
- <para>
- Starting from Zend Framework 1.6 it's also possible to exclude links with
- <code>rel</code> attribute set to <code>'nofollow'</code>. Use
- <methodname>Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks($true)</methodname>
- to turn on this option.
- </para>
- <para>
- <methodname>Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks()</methodname>
- method returns current state of "Exclude nofollow links" flag.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.docx-documents">
- <title>Word 2007 documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a Word 2007 parsing feature. Documents
- can be created directly from a Word 2007 file:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- <classname>Zend_Search_Lucene_Document_Docx</classname> class uses the
- <code>ZipArchive</code> class and <code>simplexml</code> methods to parse the source
- document. If the <code>ZipArchive</code> class (from module php_zip) is not available,
- the <classname>Zend_Search_Lucene_Document_Docx</classname> will also not be available
- for use with Zend Framework.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Docx</classname> class recognizes document meta
- data and document text. Meta data consists, depending on document contents, of filename,
- title, subject, creator, keywords, description, lastModifiedBy, revision, modified,
- created.
- </para>
- <para>
- The 'filename' field is the actual Word 2007 file name.
- </para>
- <para>
- The 'title' field is the actual document title.
- </para>
- <para>
- The 'subject' field is the actual document subject.
- </para>
- <para>
- The 'creator' field is the actual document creator.
- </para>
- <para>
- The 'keywords' field contains the actual document keywords.
- </para>
- <para>
- The 'description' field is the actual document description.
- </para>
- <para>
- The 'lastModifiedBy' field is the username who has last modified the actual document.
- </para>
- <para>
- The 'revision' field is the actual document revision number.
- </para>
- <para>
- The 'modified' field is the actual document last modified date / time.
- </para>
- <para>
- The 'created' field is the actual document creation date / time.
- </para>
- <para>
- The 'body' field is the actual body content of the Word 2007 document. It only includes
- normal text, comments and revisions are not included.
- </para>
- <para>
- The <methodname>loadDocxFile()</methodname> methods of
- <classname>Zend_Search_Lucene_Document_Docx</classname> class also have second optional
- argument. If it's set to <constant>TRUE</constant>, then body content is also stored
- within index and can be retrieved from the index. By default, the body is tokenized and
- indexed, but not stored.
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
- 'indexTime',
- time())
- );
- $doc->addField(Zend_Search_Lucene_Field::Text(
- 'annotation',
- 'Document annotation text')
- );
- $index->addDocument($doc);
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.pptx-documents">
- <title>Powerpoint 2007 documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a Powerpoint 2007 parsing feature.
- Documents can be created directly from a Powerpoint 2007 file:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- <classname>Zend_Search_Lucene_Document_Pptx</classname> class uses the
- <code>ZipArchive</code> class and <code>simplexml</code> methods to parse the source
- document. If the <code>ZipArchive</code> class (from module php_zip) is not available,
- the <classname>Zend_Search_Lucene_Document_Pptx</classname> will also not be available
- for use with Zend Framework.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Pptx</classname> class recognizes document meta
- data and document text. Meta data consists, depending on document contents, of filename,
- title, subject, creator, keywords, description, lastModifiedBy, revision, modified,
- created.
- </para>
- <para>
- The 'filename' field is the actual Powerpoint 2007 file name.
- </para>
- <para>
- The 'title' field is the actual document title.
- </para>
- <para>
- The 'subject' field is the actual document subject.
- </para>
- <para>
- The 'creator' field is the actual document creator.
- </para>
- <para>
- The 'keywords' field contains the actual document keywords.
- </para>
- <para>
- The 'description' field is the actual document description.
- </para>
- <para>
- The 'lastModifiedBy' field is the username who has last modified the actual document.
- </para>
- <para>
- The 'revision' field is the actual document revision number.
- </para>
- <para>
- The 'modified' field is the actual document last modified date / time.
- </para>
- <para>
- The 'created' field is the actual document creation date / time.
- </para>
- <para>
- The 'body' field is the actual content of all slides and slide notes in the Powerpoint
- 2007 document.
- </para>
- <para>
- The <methodname>loadPptxFile()</methodname> methods of
- <classname>Zend_Search_Lucene_Document_Pptx</classname> class also have second optional
- argument. If it's set to <constant>TRUE</constant>, then body content is also stored
- within index and can be retrieved from the index. By default, the body is tokenized and
- indexed, but not stored.
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
- 'indexTime',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::Text(
- 'annotation',
- 'Document annotation text'));
- $index->addDocument($doc);
- ]]></programlisting>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.xlsx-documents">
- <title>Excel 2007 documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a Excel 2007 parsing feature. Documents
- can be created directly from a Excel 2007 file:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
- $index->addDocument($doc);
- ]]></programlisting>
- <para>
- <classname>Zend_Search_Lucene_Document_Xlsx</classname> class uses the
- <code>ZipArchive</code> class and <code>simplexml</code> methods to parse the source
- document. If the <code>ZipArchive</code> class (from module php_zip) is not available,
- the <classname>Zend_Search_Lucene_Document_Xlsx</classname> will also not be available
- for use with Zend Framework.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Xlsx</classname> class recognizes document meta
- data and document text. Meta data consists, depending on document contents, of filename,
- title, subject, creator, keywords, description, lastModifiedBy, revision, modified,
- created.
- </para>
- <para>
- The 'filename' field is the actual Excel 2007 file name.
- </para>
- <para>
- The 'title' field is the actual document title.
- </para>
- <para>
- The 'subject' field is the actual document subject.
- </para>
- <para>
- The 'creator' field is the actual document creator.
- </para>
- <para>
- The 'keywords' field contains the actual document keywords.
- </para>
- <para>
- The 'description' field is the actual document description.
- </para>
- <para>
- The 'lastModifiedBy' field is the username who has last modified the actual document.
- </para>
- <para>
- The 'revision' field is the actual document revision number.
- </para>
- <para>
- The 'modified' field is the actual document last modified date / time.
- </para>
- <para>
- The 'created' field is the actual document creation date / time.
- </para>
- <para>
- The 'body' field is the actual content of all cells in all worksheets of the Excel 2007
- document.
- </para>
- <para>
- The <methodname>loadXlsxFile()</methodname> methods of
- <classname>Zend_Search_Lucene_Document_Xlsx</classname> class also have second optional
- argument. If it's set to <constant>TRUE</constant>, then body content is also stored
- within index and can be retrieved from the index. By default, the body is tokenized and
- indexed, but not stored.
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
- 'indexTime',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::Text(
- 'annotation',
- 'Document annotation text'));
- $index->addDocument($doc);
- ]]></programlisting>
- </sect2>
- </sect1>
|