Package Home

Zend Framework 2 Documentation (Manual)

PHK Home

File: /_sources/modules/zendsearch.lucene.overview.txt

Size:17792
Storage flags:no_autoload,compress/gzip (23%)

.. _zendsearch.lucene.overview:

Overview
========

.. _zendsearch.lucene.introduction:

Introduction
------------

``Zend\Search\Lucene`` is a general purpose text search engine written entirely in *PHP* 5. Since it stores its
index on the filesystem and does not require a database server, it can add search capabilities to almost any
*PHP*-driven website. ``Zend\Search\Lucene`` supports the following features:



   - Ranked searching - best results returned first

   - Many powerful query types: phrase queries, boolean queries, wildcard queries, proximity queries, range queries
     and many others.

   - Search by specific field (e.g., title, author, contents)

``Zend\Search\Lucene`` was derived from the Apache Lucene project. The currently (starting from ZF 1.6) supported
Lucene index format versions are 1.4 - 2.3. For more information on Lucene, visit
http://lucene.apache.org/java/docs/.

.. note::

   Previous ``Zend\Search\Lucene`` implementations support the Lucene 1.4 (1.9) - 2.1 index formats.

   Starting from Zend Framework 1.5 any index created using pre-2.1 index format is automatically upgraded to
   Lucene 2.1 format after the ``Zend\Search\Lucene`` update and will not be compatible with ``Zend\Search\Lucene``
   implementations included into Zend Framework 1.0.x.

.. _zendsearch.lucene.index-creation.documents-and-fields:

Document and Field Objects
--------------------------

``Zend\Search\Lucene`` operates with documents as atomic objects for indexing. A document is divided into named
fields, and fields have content that can be searched.

A document is represented by the ``Zend\Search\Lucene\Document`` class, and this objects of this class contain
instances of ``Zend\Search\Lucene\Field`` that represent the fields on the document.

It is important to note that any information can be added to the index. Application-specific information or
metadata can be stored in the document fields, and later retrieved with the document during search.

It is the responsibility of your application to control the indexer. This means that data can be indexed from any
source that is accessible by your application. For example, this could be the filesystem, a database, an *HTML*
form, etc.

``Zend\Search\Lucene\Field`` class provides several static methods to create fields with different characteristics:

.. code-block:: php
   :linenos:

   $doc = new Zend\Search\Lucene\Document();

   // Field is not tokenized, but is indexed and stored within the index.
   // Stored fields can be retrieved from the index.
   $doc->addField(Zend\Search\Lucene\Field::Keyword('doctype',
                                                    'autogenerated'));

   // Field is not tokenized nor indexed, but is stored in the index.
   $doc->addField(Zend\Search\Lucene\Field::UnIndexed('created',
                                                      time()));

   // Binary String valued Field that is not tokenized nor indexed,
   // but is stored in the index.
   $doc->addField(Zend\Search\Lucene\Field::Binary('icon',
                                                   $iconData));

   // Field is tokenized and indexed, and is stored in the index.
   $doc->addField(Zend\Search\Lucene\Field::Text('annotation',
                                                 'Document annotation text'));

   // Field is tokenized and indexed, but is not stored in the index.
   $doc->addField(Zend\Search\Lucene\Field::UnStored('contents',
                                                     'My document content'));

Each of these methods (excluding the ``Zend\Search\Lucene\Field::Binary()`` method) has an optional ``$encoding``
parameter for specifying input data encoding.

Encoding may differ for different documents as well as for different fields within one document:

.. code-block:: php
   :linenos:

   $doc = new Zend\Search\Lucene\Document();
   $doc->addField(Zend\Search\Lucene\Field::Text('title',
                                                 $title,
                                                 'iso-8859-1'));
   $doc->addField(Zend\Search\Lucene\Field::UnStored('contents',
                                                     $contents,
                                                     'utf-8'));

If encoding parameter is omitted, then the current locale is used at processing time. For example:

.. code-block:: php
   :linenos:

   setlocale(LC_ALL, 'de_DE.iso-8859-1');
   ...
   $doc->addField(Zend\Search\Lucene\Field::UnStored('contents', $contents));

Fields are always stored and returned from the index in UTF-8 encoding. Any required conversion to UTF-8 happens
automatically.

Text analyzers (:ref:`see below <zendsearch.lucene.extending.analysis>`) may also convert text to some other
encodings. Actually, the default analyzer converts text to 'ASCII//TRANSLIT' encoding. Be careful, however; this
translation may depend on current locale.

Fields' names are defined at your discretion in the ``addField()`` method.

Java Lucene uses the 'contents' field as a default field to search. ``Zend\Search\Lucene`` searches through all
fields by default, but the behavior is configurable. See the :ref:`"Default search field"
<zendsearch.lucene.query-language.fields>` chapter for details.

.. _zendsearch.lucene.index-creation.understanding-field-types:

Understanding Field Types
-------------------------

- *Keyword* fields are stored and indexed, meaning that they can be searched as well as displayed in search
  results. They are not split up into separate words by tokenization. Enumerated database fields usually translate
  well to Keyword fields in ``Zend\Search\Lucene``.

- *UnIndexed* fields are not searchable, but they are returned with search hits. Database timestamps, primary keys,
  file system paths, and other external identifiers are good candidates for UnIndexed fields.

- *Binary* fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to
  store any data encoded as a binary string, such as an image icon.

- *Text* fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like
  subjects and titles that need to be searchable as well as returned with search results.

- *UnStored* fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed
  using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay
  the data, use an UnStored field. UnStored fields are practical when using a ``Zend\Search\Lucene`` index in
  combination with a relational database. You can index large data fields with UnStored fields for searching, and
  retrieve them from your relational database by using a separate field as an identifier.

  .. _zendsearch.lucene.index-creation.understanding-field-types.table:

  .. table:: Zend\Search\Lucene\Field Types

     +----------+------+-------+---------+------+
     |Field Type|Stored|Indexed|Tokenized|Binary|
     +==========+======+=======+=========+======+
     |Keyword   |Yes   |Yes    |No       |No    |
     +----------+------+-------+---------+------+
     |UnIndexed |Yes   |No     |No       |No    |
     +----------+------+-------+---------+------+
     |Binary    |Yes   |No     |No       |Yes   |
     +----------+------+-------+---------+------+
     |Text      |Yes   |Yes    |Yes      |No    |
     +----------+------+-------+---------+------+
     |UnStored  |No    |Yes    |Yes      |No    |
     +----------+------+-------+---------+------+

.. _zendsearch.lucene.index-creation.html-documents:

HTML documents
--------------

``Zend\Search\Lucene`` offers a *HTML* parsing feature. Documents can be created directly from a *HTML* file or
string:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Html::loadHTMLFile($filename);
   $index->addDocument($doc);
   ...
   $doc = Zend\Search\Lucene\Document\Html::loadHTML($htmlString);
   $index->addDocument($doc);

``Zend\Search\Lucene\Document\Html`` class uses the ``DOMDocument::loadHTML()`` and ``DOMDocument::loadHTMLFile()``
methods to parse the source *HTML*, so it doesn't need *HTML* to be well formed or to be *XHTML*. On the other
hand, it's sensitive to the encoding specified by the "meta http-equiv" header tag.

``Zend\Search\Lucene\Document\Html`` class recognizes document title, body and document header meta tags.

The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for
search.

The 'body' field is the actual body content of the *HTML* file or string. It doesn't include scripts, comments or
attributes.

The ``loadHTML()`` and ``loadHTMLFile()`` methods of ``Zend\Search\Lucene\Document\Html`` class also have second
optional argument. If it's set to ``TRUE``, then body content is also stored within index and can be retrieved from
the index. By default, the body is tokenized and indexed, but not stored.

The third parameter of ``loadHTML()`` and ``loadHTMLFile()`` methods optionally specifies source *HTML* document
encoding. It's used if encoding is not specified using Content-type HTTP-EQUIV meta tag.

Other document header meta tags produce additional document fields. The field 'name' is taken from 'name'
attribute, and the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so
documents may be searched by their meta tags (for example, by keywords).

Parsed documents may be augmented by the programmer with any other field:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Html::loadHTML($htmlString);
   $doc->addField(Zend\Search\Lucene\Field::UnIndexed('created',
                                                      time()));
   $doc->addField(Zend\Search\Lucene\Field::UnIndexed('updated',
                                                      time()));
   $doc->addField(Zend\Search\Lucene\Field::Text('annotation',
                                                 'Document annotation text'));
   $index->addDocument($doc);

Document links are not included in the generated document, but may be retrieved with the
``Zend\Search\Lucene\Document\Html::getLinks()`` and ``Zend\Search\Lucene\Document\Html::getHeaderLinks()``
methods:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Html::loadHTML($htmlString);
   $linksArray = $doc->getLinks();
   $headerLinksArray = $doc->getHeaderLinks();

Starting from Zend Framework 1.6 it's also possible to exclude links with *rel* attribute set to *'nofollow'*. Use
``Zend\Search\Lucene\Document\Html::setExcludeNoFollowLinks($true)`` to turn on this option.

``Zend\Search\Lucene\Document\Html::getExcludeNoFollowLinks()`` method returns current state of "Exclude nofollow
links" flag.

.. _zendsearch.lucene.index-creation.docx-documents:

Word 2007 documents
-------------------

``Zend\Search\Lucene`` offers a Word 2007 parsing feature. Documents can be created directly from a Word 2007 file:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Docx::loadDocxFile($filename);
   $index->addDocument($doc);

``Zend\Search\Lucene\Document\Docx`` class uses the *ZipArchive* class and *simplexml* methods to parse the source
document. If the *ZipArchive* class (from module php_zip) is not available, the
``Zend\Search\Lucene\Document\Docx`` will also not be available for use with Zend Framework.

``Zend\Search\Lucene\Document\Docx`` class recognizes document meta data and document text. Meta data consists,
depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy,
revision, modified, created.

The 'filename' field is the actual Word 2007 file name.

The 'title' field is the actual document title.

The 'subject' field is the actual document subject.

The 'creator' field is the actual document creator.

The 'keywords' field contains the actual document keywords.

The 'description' field is the actual document description.

The 'lastModifiedBy' field is the username who has last modified the actual document.

The 'revision' field is the actual document revision number.

The 'modified' field is the actual document last modified date / time.

The 'created' field is the actual document creation date / time.

The 'body' field is the actual body content of the Word 2007 document. It only includes normal text, comments and
revisions are not included.

The ``loadDocxFile()`` methods of ``Zend\Search\Lucene\Document\Docx`` class also have second optional argument. If
it's set to ``TRUE``, then body content is also stored within index and can be retrieved from the index. By
default, the body is tokenized and indexed, but not stored.

Parsed documents may be augmented by the programmer with any other field:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Docx::loadDocxFile($filename);
   $doc->addField(Zend\Search\Lucene\Field::UnIndexed(
       'indexTime',
       time())
   );
   $doc->addField(Zend\Search\Lucene\Field::Text(
       'annotation',
       'Document annotation text')
   );
   $index->addDocument($doc);

.. _zendsearch.lucene.index-creation.pptx-documents:

Powerpoint 2007 documents
-------------------------

``Zend\Search\Lucene`` offers a Powerpoint 2007 parsing feature. Documents can be created directly from a
Powerpoint 2007 file:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Pptx::loadPptxFile($filename);
   $index->addDocument($doc);

``Zend\Search\Lucene\Document\Pptx`` class uses the *ZipArchive* class and *simplexml* methods to parse the source
document. If the *ZipArchive* class (from module php_zip) is not available, the
``Zend\Search\Lucene\Document\Pptx`` will also not be available for use with Zend Framework.

``Zend\Search\Lucene\Document\Pptx`` class recognizes document meta data and document text. Meta data consists,
depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy,
revision, modified, created.

The 'filename' field is the actual Powerpoint 2007 file name.

The 'title' field is the actual document title.

The 'subject' field is the actual document subject.

The 'creator' field is the actual document creator.

The 'keywords' field contains the actual document keywords.

The 'description' field is the actual document description.

The 'lastModifiedBy' field is the username who has last modified the actual document.

The 'revision' field is the actual document revision number.

The 'modified' field is the actual document last modified date / time.

The 'created' field is the actual document creation date / time.

The 'body' field is the actual content of all slides and slide notes in the Powerpoint 2007 document.

The ``loadPptxFile()`` methods of ``Zend\Search\Lucene\Document\Pptx`` class also have second optional argument. If
it's set to ``TRUE``, then body content is also stored within index and can be retrieved from the index. By
default, the body is tokenized and indexed, but not stored.

Parsed documents may be augmented by the programmer with any other field:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Pptx::loadPptxFile($filename);
   $doc->addField(Zend\Search\Lucene\Field::UnIndexed(
       'indexTime',
       time()));
   $doc->addField(Zend\Search\Lucene\Field::Text(
       'annotation',
       'Document annotation text'));
   $index->addDocument($doc);

.. _zendsearch.lucene.index-creation.xlsx-documents:

Excel 2007 documents
--------------------

``Zend\Search\Lucene`` offers a Excel 2007 parsing feature. Documents can be created directly from a Excel 2007
file:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Xlsx::loadXlsxFile($filename);
   $index->addDocument($doc);

``Zend\Search\Lucene\Document\Xlsx`` class uses the *ZipArchive* class and *simplexml* methods to parse the source
document. If the *ZipArchive* class (from module php_zip) is not available, the
``Zend\Search\Lucene\Document\Xlsx`` will also not be available for use with Zend Framework.

``Zend\Search\Lucene\Document\Xlsx`` class recognizes document meta data and document text. Meta data consists,
depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy,
revision, modified, created.

The 'filename' field is the actual Excel 2007 file name.

The 'title' field is the actual document title.

The 'subject' field is the actual document subject.

The 'creator' field is the actual document creator.

The 'keywords' field contains the actual document keywords.

The 'description' field is the actual document description.

The 'lastModifiedBy' field is the username who has last modified the actual document.

The 'revision' field is the actual document revision number.

The 'modified' field is the actual document last modified date / time.

The 'created' field is the actual document creation date / time.

The 'body' field is the actual content of all cells in all worksheets of the Excel 2007 document.

The ``loadXlsxFile()`` methods of ``Zend\Search\Lucene\Document\Xlsx`` class also have second optional argument. If
it's set to ``TRUE``, then body content is also stored within index and can be retrieved from the index. By
default, the body is tokenized and indexed, but not stored.

Parsed documents may be augmented by the programmer with any other field:

.. code-block:: php
   :linenos:

   $doc = Zend\Search\Lucene\Document\Xlsx::loadXlsxFile($filename);
   $doc->addField(Zend\Search\Lucene\Field::UnIndexed(
       'indexTime',
       time()));
   $doc->addField(Zend\Search\Lucene\Field::Text(
       'annotation',
       'Document annotation text'));
   $index->addDocument($doc);




For more information about the PHK package format: http://phk.tekwire.net