Package Home

Zend Framework 2 Documentation (Manual)

PHK Home

File: /tutorials/lucene.indexing.html

Size:13178
Storage flags:no_autoload,compress/gzip (29%)

Indexing — Zend Framework 2 2.4.2 documentation

Indexing

Indexing is performed by adding a new document to an existing or new index:

1
$index->addDocument($doc);

There are two ways to create document object. The first is to do it manually.

Manual Document Construction

1
2
3
4
5
$doc = new Zend\Search\Lucene\Document();
$doc->addField(Zend\Search\Lucene\Field::Text('url', $docUrl));
$doc->addField(Zend\Search\Lucene\Field::Text('title', $docTitle));
$doc->addField(Zend\Search\Lucene\Field::unStored('contents', $docBody));
$doc->addField(Zend\Search\Lucene\Field::binary('avatar', $avatarData));

The second method is to load it from HTML or Microsoft Office 2007 files:

Document loading

1
2
3
4
$doc = Zend\Search\Lucene\Document\Html::loadHTML($htmlString);
$doc = Zend\Search\Lucene\Document\Docx::loadDocxFile($path);
$doc = Zend\Search\Lucene\Document\Pptx::loadPptFile($path);
$doc = Zend\Search\Lucene\Document\Xlsx::loadXlsxFile($path);

If a document is loaded from one of the supported formats, it still can be extended manually with new user defined fields.

Indexing Policy

You should define indexing policy within your application architectural design.

You may need an on-demand indexing configuration (something like OLTP system). In such systems, you usually add one document per user request. As such, the MaxBufferedDocs option will not affect the system. On the other hand, MaxMergeDocs is really helpful as it allows you to limit maximum script execution time. MergeFactor should be set to a value that keeps balance between the average indexing time (it’s also affected by average auto-optimization time) and search performance (index optimization level is dependent on the number of segments).

If you will be primarily performing batch index updates, your configuration should use a MaxBufferedDocs option set to the maximum value supported by the available amount of memory. MaxMergeDocs and MergeFactor have to be set to values reducing auto-optimization involvement as much as possible [1]. Full index optimization should be applied after indexing.

Index optimization

1
$index->optimize();

In some configurations, it’s more effective to serialize index updates by organizing update requests into a queue and processing several update requests in a single script execution. This reduces index opening overhead, and allows utilizing index document buffering.

[1]An additional limit is the maximum file handlers supported by the operation system for concurrent open operations

Table Of Contents

This Page

Note: You need to stay logged into your GitHub account to contribute to the documentation.

Edit this document

Edit this document

The source code of this file is hosted on GitHub. Everyone can update and fix errors in this document with few clicks - no downloads needed.

  1. Login with your GitHub account.
  2. Go to Indexing on GitHub.
  3. Edit file contents using GitHub's text editor in your web browser
  4. Fill in the Commit message text box at the end of the page telling why you did the changes. Press Propose file change button next to it when done.
  5. On Send a pull request page you don't need to fill in text anymore. Just press Send pull request button.
  6. Your changes are now queued for review under project's Pull requests tab on GitHub.

For more information about the PHK package format: http://phk.tekwire.net