| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583 |
- <?xml version="1.0" encoding="UTF-8"?>
- <!-- Reviewed: no -->
- <sect1 id="zend.search.lucene.overview">
- <title>Overview</title>
- <sect2 id="zend.search.lucene.introduction">
- <title>Introduction</title>
- <para><classname>Zend_Search_Lucene</classname> is a general purpose text search engine written entirely in PHP 5.
- Since it stores its index on the filesystem and does not require a database
- server, it can add search capabilities to almost any PHP-driven website.
- <classname>Zend_Search_Lucene</classname> supports the following features:
- <itemizedlist>
- <listitem>
- <para>Ranked searching - best results returned first</para>
- </listitem>
- <listitem>
- <para>
- Many powerful query types: phrase queries, boolean queries, wildcard queries,
- proximity queries, range queries and many others.
- </para>
- </listitem>
- <listitem>
- <para>Search by specific field (e.g., title, author, contents)</para>
- </listitem>
- </itemizedlist>
- <classname>Zend_Search_Lucene</classname> was derived from the Apache Lucene project. The currently (starting from ZF 1.6) supported Lucene index format
- versions are 1.4 - 2.3. For more information on Lucene, visit <ulink url="http://lucene.apache.org/java/docs/"/>.
- </para>
- <note>
- <title/>
- <para>
- Previous <classname>Zend_Search_Lucene</classname> implementations support the Lucene 1.4 (1.9) - 2.1 index formats.
- </para>
- <para>
- Starting from ZF 1.5 any index created using pre-2.1 index format is automatically upgraded to Lucene 2.1 format
- after the <classname>Zend_Search_Lucene</classname> update and will not be compatible with <classname>Zend_Search_Lucene</classname> implementations included into ZF 1.0.x.
- </para>
- </note>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.documents-and-fields">
- <title>Document and Field Objects</title>
- <para>
- <classname>Zend_Search_Lucene</classname> operates with documents as atomic objects for indexing. A document is
- divided into named fields, and fields have content that can be searched.
- </para>
- <para>
- A document is represented by the <classname>Zend_Search_Lucene_Document</classname> class, and this objects of this class contain
- instances of <classname>Zend_Search_Lucene_Field</classname> that represent the fields on the document.
- </para>
- <para>
- It is important to note that any information can be added to the index.
- Application-specific information or metadata can be stored in the document
- fields, and later retrieved with the document during search.
- </para>
- <para>
- It is the responsibility of your application to control the indexer.
- This means that data can be indexed from any source
- that is accessible by your application. For example, this could be the
- filesystem, a database, an HTML form, etc.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Field</classname> class provides several static methods to create fields with
- different characteristics:
- </para>
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- // Field is not tokenized, but is indexed and stored within the index.
- // Stored fields can be retrived from the index.
- $doc->addField(Zend_Search_Lucene_Field::Keyword('doctype',
- 'autogenerated'));
- // Field is not tokenized nor indexed, but is stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
- time()));
- // Binary String valued Field that is not tokenized nor indexed,
- // but is stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::Binary('icon',
- $iconData));
- // Field is tokenized and indexed, and is stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
- 'Document annotation text'));
- // Field is tokenized and indexed, but is not stored in the index.
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- 'My document content'));
- ]]></programlisting>
- <para>
- Each of these methods (excluding the <classname>Zend_Search_Lucene_Field::Binary()</classname> method) has an optional
- <code>$encoding</code> parameter for specifying input data encoding.
- </para>
- <para>
- Encoding may differ for different documents as well as for different fields within one document:
- <programlisting language="php"><![CDATA[
- $doc = new Zend_Search_Lucene_Document();
- $doc->addField(Zend_Search_Lucene_Field::Text('title',
- $title,
- 'iso-8859-1'));
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
- $contents,
- 'utf-8'));
- ]]></programlisting>
- </para>
- <para>
- If encoding parameter is omitted, then the current locale is used at processing time. For example:
- <programlisting language="php"><![CDATA[
- setlocale(LC_ALL, 'de_DE.iso-8859-1');
- ...
- $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents));
- ]]></programlisting>
- </para>
- <para>
- Fields are always stored and returned from the index in UTF-8 encoding. Any required conversion to UTF-8 happens
- automatically.
- </para>
- <para>
- Text analyzers (<link linkend="zend.search.lucene.extending.analysis">see below</link>) may also convert text
- to some other encodings. Actually, the default analyzer converts text to 'ASCII//TRANSLIT' encoding.
- Be careful, however; this translation may depend on current locale.
- </para>
- <para>
- Fields' names are defined at your discretion in the <code>addField()</code> method.
- </para>
- <para>
- Java Lucene uses the 'contents' field as a default field to search.
- <classname>Zend_Search_Lucene</classname> searches through all fields by default, but the behavior is configurable.
- See the <link linkend="zend.search.lucene.query-language.fields">"Default search field"</link> chapter for details.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.understanding-field-types">
- <title>Understanding Field Types</title>
- <itemizedlist>
- <listitem>
- <para>
- <code>Keyword</code> fields are stored and indexed, meaning that they can be searched as well
- as displayed in search results. They are not split up into separate words by tokenization.
- Enumerated database fields usually translate well to Keyword fields in <classname>Zend_Search_Lucene</classname>.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>UnIndexed</code> fields are not searchable, but they are returned with search hits. Database
- timestamps, primary keys, file system paths, and other external identifiers are good
- candidates for UnIndexed fields.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>Binary</code> fields are not tokenized or indexed, but are stored for retrieval with search hits.
- They can be used to store any data encoded as a binary string, such as an image icon.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>Text</code> fields are stored, indexed, and tokenized. Text fields are appropriate for storing
- information like subjects and titles that need to be searchable as well as returned with
- search results.
- </para>
- </listitem>
- <listitem>
- <para>
- <code>UnStored</code> fields are tokenized and indexed, but not stored in the index. Large amounts of
- text are best indexed using this type of field. Storing data creates a larger index on
- disk, so if you need to search but not redisplay the data, use an UnStored field.
- UnStored fields are practical when using a <classname>Zend_Search_Lucene</classname> index in
- combination with a relational database. You can index large data fields with UnStored
- fields for searching, and retrieve them from your relational database by using a separate
- field as an identifier.
- </para>
- <table id="zend.search.lucene.index-creation.understanding-field-types.table">
- <title>Zend_Search_Lucene_Field Types</title>
- <tgroup cols="5">
- <thead>
- <row>
- <entry>Field Type</entry>
- <entry>Stored</entry>
- <entry>Indexed</entry>
- <entry>Tokenized</entry>
- <entry>Binary</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>Keyword</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- <entry>No</entry>
- </row>
- <row>
- <entry>UnIndexed</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- <entry>No</entry>
- <entry>No</entry>
- </row>
- <row>
- <entry>Binary</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- <entry>No</entry>
- <entry>Yes</entry>
- </row>
- <row>
- <entry>Text</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- </row>
- <row>
- <entry>UnStored</entry>
- <entry>No</entry>
- <entry>Yes</entry>
- <entry>Yes</entry>
- <entry>No</entry>
- </row>
- </tbody>
- </tgroup>
- </table>
- </listitem>
- </itemizedlist>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.html-documents">
- <title>HTML documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a HTML parsing feature. Documents can be created directly from a HTML file or string:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
- $index->addDocument($doc);
- ...
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Html</classname> class uses the <code>DOMDocument::loadHTML()</code> and
- <code>DOMDocument::loadHTMLFile()</code> methods to parse the source HTML, so it doesn't need HTML to be well formed or
- to be XHTML. On the other hand, it's sensitive to the encoding specified by the "meta http-equiv" header tag.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Html</classname> class recognizes document title, body and document header meta tags.
- </para>
- <para>
- The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for search.
- </para>
- <para>
- The 'body' field is the actual body content of the HTML file or string. It doesn't include scripts, comments or attributes.
- </para>
- <para>
- The <code>loadHTML()</code> and <code>loadHTMLFile()</code> methods of <classname>Zend_Search_Lucene_Document_Html</classname> class
- also have second optional argument. If it's set to true, then body content is also stored within index and can
- be retrieved from the index. By default, the body is tokenized and indexed, but not stored.
- </para>
- <para>
- The third parameter of <code>loadHTML()</code> and <code>loadHTMLFile()</code> methods optionally specifies source HTML
- document encoding. It's used if encoding is not specified using Content-type HTTP-EQUIV meta tag.
- </para>
- <para>
- Other document header meta tags produce additional document fields. The field 'name' is taken from 'name' attribute, and
- the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so documents may be searched by their meta tags
- (for example, by keywords).
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
- 'Document annotation text'));
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- <para>
- Document links are not included in the generated document, but may be retrieved with
- the <classname>Zend_Search_Lucene_Document_Html::getLinks()</classname> and <classname>Zend_Search_Lucene_Document_Html::getHeaderLinks()</classname>
- methods:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
- $linksArray = $doc->getLinks();
- $headerLinksArray = $doc->getHeaderLinks();
- ]]></programlisting>
- </para>
- <para>
- Starting from ZF 1.6 it's also possible to exclude links with <code>rel</code> attribute set to <code>'nofollow'</code>.
- Use <classname>Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks($true)</classname> to turn on this option.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks()</classname> method returns current state of
- "Exclude nofollow links" flag.
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.docx-documents">
- <title>Word 2007 documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a Word 2007 parsing feature. Documents can be created directly from a Word 2007 file:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Docx</classname> class uses the <code>ZipArchive</code> class and
- <code>simplexml</code> methods to parse the source document. If the <code>ZipArchive</code> class (from module php_zip)
- is not available, the <classname>Zend_Search_Lucene_Document_Docx</classname> will also not be available for use with Zend Framework.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Docx</classname> class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created.
- </para>
- <para>
- The 'filename' field is the actual Word 2007 file name.
- </para>
- <para>
- The 'title' field is the actual document title.
- </para>
- <para>
- The 'subject' field is the actual document subject.
- </para>
- <para>
- The 'creator' field is the actual document creator.
- </para>
- <para>
- The 'keywords' field contains the actual document keywords.
- </para>
- <para>
- The 'description' field is the actual document description.
- </para>
- <para>
- The 'lastModifiedBy' field is the username who has last modified the actual document.
- </para>
- <para>
- The 'revision' field is the actual document revision number.
- </para>
- <para>
- The 'modified' field is the actual document last modified date / time.
- </para>
- <para>
- The 'created' field is the actual document creation date / time.
- </para>
- <para>
- The 'body' field is the actual body content of the Word 2007 document. It only includes normal text, comments and revisions are not included.
- </para>
- <para>
- The <code>loadDocxFile()</code> methods of <classname>Zend_Search_Lucene_Document_Docx</classname> class
- also have second optional argument. If it's set to true, then body content is also stored within index and can
- be retrieved from the index. By default, the body is tokenized and indexed, but not stored.
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
- 'indexTime',
- time())
- );
- $doc->addField(Zend_Search_Lucene_Field::Text(
- 'annotation',
- 'Document annotation text')
- );
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.pptx-documents">
- <title>Powerpoint 2007 documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a Powerpoint 2007 parsing feature. Documents can be created directly from a Powerpoint 2007 file:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Pptx</classname> class uses the <code>ZipArchive</code> class and
- <code>simplexml</code> methods to parse the source document. If the <code>ZipArchive</code> class (from module php_zip)
- is not available, the <classname>Zend_Search_Lucene_Document_Pptx</classname> will also not be available for use with Zend Framework.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Pptx</classname> class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created.
- </para>
- <para>
- The 'filename' field is the actual Powerpoint 2007 file name.
- </para>
- <para>
- The 'title' field is the actual document title.
- </para>
- <para>
- The 'subject' field is the actual document subject.
- </para>
- <para>
- The 'creator' field is the actual document creator.
- </para>
- <para>
- The 'keywords' field contains the actual document keywords.
- </para>
- <para>
- The 'description' field is the actual document description.
- </para>
- <para>
- The 'lastModifiedBy' field is the username who has last modified the actual document.
- </para>
- <para>
- The 'revision' field is the actual document revision number.
- </para>
- <para>
- The 'modified' field is the actual document last modified date / time.
- </para>
- <para>
- The 'created' field is the actual document creation date / time.
- </para>
- <para>
- The 'body' field is the actual content of all slides and slide notes in the Powerpoint 2007 document.
- </para>
- <para>
- The <code>loadPptxFile()</code> methods of <classname>Zend_Search_Lucene_Document_Pptx</classname> class
- also have second optional argument. If it's set to true, then body content is also stored within index and can
- be retrieved from the index. By default, the body is tokenized and indexed, but not stored.
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
- 'indexTime',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::Text(
- 'annotation',
- 'Document annotation text'));
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- </sect2>
- <sect2 id="zend.search.lucene.index-creation.xlsx-documents">
- <title>Excel 2007 documents</title>
- <para>
- <classname>Zend_Search_Lucene</classname> offers a Excel 2007 parsing feature. Documents can be created directly from a Excel 2007 file:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Xlsx</classname> class uses the <code>ZipArchive</code> class and
- <code>simplexml</code> methods to parse the source document. If the <code>ZipArchive</code> class (from module php_zip)
- is not available, the <classname>Zend_Search_Lucene_Document_Xlsx</classname> will also not be available for use with Zend Framework.
- </para>
- <para>
- <classname>Zend_Search_Lucene_Document_Xlsx</classname> class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created.
- </para>
- <para>
- The 'filename' field is the actual Excel 2007 file name.
- </para>
- <para>
- The 'title' field is the actual document title.
- </para>
- <para>
- The 'subject' field is the actual document subject.
- </para>
- <para>
- The 'creator' field is the actual document creator.
- </para>
- <para>
- The 'keywords' field contains the actual document keywords.
- </para>
- <para>
- The 'description' field is the actual document description.
- </para>
- <para>
- The 'lastModifiedBy' field is the username who has last modified the actual document.
- </para>
- <para>
- The 'revision' field is the actual document revision number.
- </para>
- <para>
- The 'modified' field is the actual document last modified date / time.
- </para>
- <para>
- The 'created' field is the actual document creation date / time.
- </para>
- <para>
- The 'body' field is the actual content of all cells in all worksheets of the Excel 2007 document.
- </para>
- <para>
- The <code>loadXlsxFile()</code> methods of <classname>Zend_Search_Lucene_Document_Xlsx</classname> class
- also have second optional argument. If it's set to true, then body content is also stored within index and can
- be retrieved from the index. By default, the body is tokenized and indexed, but not stored.
- </para>
- <para>
- Parsed documents may be augmented by the programmer with any other field:
- <programlisting language="php"><![CDATA[
- $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
- $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
- 'indexTime',
- time()));
- $doc->addField(Zend_Search_Lucene_Field::Text(
- 'annotation',
- 'Document annotation text'));
- $index->addDocument($doc);
- ]]></programlisting>
- </para>
- </sect2>
- </sect1>
|