27.10. Best practice

27.10.1. Field names

There are no limitations for field names in Zend_Search_Lucene.

Nevertheless it's good idea not to use 'id' and 'score' names to avoid ambiguity in QueryHit properties names.

Zend_Search_Lucene_Search_QueryHit id and score properties always refer to internal Lucene document id and hit score. If indexed document has the same stored fields, you have to use getDocument() method to access them:

<?php
$hits = $index->find($query);

foreach ($hits as $hit) {
    // Get 'title' document field
    $title = $hit->title;

    // Get 'contents' document field
    $contents = $hit->contents;


    // Get internal Lucene document id
    $id = $hit->id;

    // Get query hit score
    $score = $hit->score;


    // Get 'id' document field
    $docId = $hit->getDocument()->id;

    // Get 'score' document field
    $docId = $hit->getDocument()->score;

    // Another way to get 'title' document field
    $title = $hit->getDocument()->title;
}
            

27.10.2. Indexing performance

Indexing performance is a compromise between used resources, indexing time and index quality.

Index quality is completely determined by number of index segments.

Each index segment is entirely independent portion of data. So index containing more segments needs more memory and more time for searching.

Index optimization is a process of merging several segments into new one. Fully optimized index contains only one segment.

Full index optimization may be performed with 'optimize()' method:

<?php
$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();
            

Index optimization works with data streams and doesn't take a lot of memory, but takes processor resources and time.

Lucene index segments are not updatable by their nature (update operation needs segment file to be completely rewritten). So adding new document(s) to the index always generates new segment. It decreases index quality.

Index auto-optimization process is performed after each segment generation and consists in partial segments merging.

There are three options to control behavior of auto-optimization (see Index optimization section):

  • MaxBufferedDocs is a number of documents buffered in memory before new segment is generated and written to a hard drive.

  • MaxMergeDocs is a maximum number of documents merged by auto-optimization process into new segment.

  • MergeFactor determines how often auto-optimization is performed.

[Anmerkung] Anmerkung

All these options are Zend_Search_Lucene object properties, but not index properties. So they affect only current Zend_Search_Lucene object behavior and may vary for different scripts.

MaxBufferedDocs doesn't matter if you index only one document per script execution. To the contrary, it's very important for batch indexing. Greater value increases indexing performance, but also needs more memory.

There are no way to calculate best value for MaxBufferedDocs parameter because it depends on documents size, used analyzer and allowed memory.

Good way to get right value is to perform several tests with largest document you expect to be added to the index [12]. That's good idea not to use more than a half of allowed memory.

MaxMergeDocs limits segment size (in terms of documents). So it limits auto-optimization time. That guarantees addDocument() method to be not executed more than a certain time. It's important for interactive application.

Decreasing MaxMergeDocs parameter also may improve batch indexing performance. Index auto-optimization is iterative process and is performed step by step. Small segments are merged into larger, at some moment they are merged into even greater and so on. Full index optimization is much more effective.

On the over hand, smaller segments decreases index quality and may generate too many segments. It may be a cause of the "Too many open files" error determined by OS limitations [13].

So background index optimization should be performed for interactive indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.

MergeFactor affects auto-optimization frequency. Less values increases quality of unoptimized index. Larger values increases indexing performance, but also increases number of segments. It again may be a cause of the "Too many open files" error.

MergeFactor groups index segments by their size:

  1. Not greater than MaxBufferedDocs.

  2. Greater than MaxBufferedDocs, but not greater than MaxBufferedDocs*MergeFactor.

  3. Greater than MaxBufferedDocs*MergeFactor, but not greater than MaxBufferedDocs*MergeFactor*MergeFactor.

  4. ...

Zend_Search_Lucene checks at each addDocument() call if merging of any segments group may move newly created segment into next group. If yes, then merging is performed.

So index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactor segments and contains at least MaxBufferedDocs*MergeFactor(N-1) documents.

It gives good approximation for number of segments in the index:

NumberOfSegments <= MaxBufferedDocs + MergeFactor*log MergeFactor (NumberOfDocuments/MaxBufferedDocs)

MaxBufferedDocs is determined by allowed memory. It gives the possibility to choose appropriate merge factor to get reasonable number of segments.

Tuning MergeFactor parameter is more effective for batch indexing performance than MaxMergeDocs. But it's more rough. So use above estimation for tuning MergeFactor, then play with MaxMergeDocs to get best batch indexing performance.

27.10.3. Index shutting down

Zend_Search_Lucene object performs some work at shutting down time if any documents were added to the index.

It's concerned with buffering added document before generating new segment.

It also may cause auto-optimization process.

Index object is automatically shut down when it, and all returned QueryHit objects, go out of scope.

If index object is stored in global variable than it's destroyed only at the end of script execution[14].

PHP exception processing is also shut down at this moment.

It doesn't prevent normal index shutdown process, but may prevent to get correct error diagnostic if any error occurs.

There are two ways which may help to avoid this problem.

The first is to force going out of scope:

<?php
$index = Zend_Search_Lucene::open($indexPath);

...

unset($index);
            

And the second is to perform commit operation before the end of script execution:

<?php
$index = Zend_Search_Lucene::open($indexPath);

$index->commit();
            

This possibility is also described in "Advanced. Using index as static property" documentation section.

27.10.4. Retrieving documents by unique id

It's common practice to store some unique document id in the index. Ex. url, path, database id or some other.

Zend_Search_Lucene provides termDocs() method for retrieving documents containing specified term.

It's more effective than find() method:

<?php
// Retrieving documents with find() method using query string
$query = $idFieldName . ':' . $docId;
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}
...

// Retrieving documents with find() method using query API
$term = new Zend_Search_Lucene_Index_Term($docId, idFieldName);
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits  = $index->find($query);
foreach ($hits as $hit) {
    $title    = $hit->title;
    $contents = $hit->contents;
    ...
}

...

// Retrieving documents with termDocs() method
$term = new Zend_Search_Lucene_Index_Term($docId, idFieldName);
$docIds  = $index->termDocs($term);
foreach ($docIds as $id) {
    $doc = $index->getDocument($id);
    $title    = $doc->title;
    $contents = $doc->contents;
    ...
}
            

27.10.5. Memory usage

Zend_Search_Lucene is memory expensive module.

It uses memory to cache some information and speed up search and indexing.

The behavior differs for different modes.

Terms dictionary index is loaded during the search. It's actually each 128th [15] term of full dictionary.

Thus memory usage is increased if you have high number of unique terms. This may happen if you use untokenized phrases as a field values or index a large volume of non-text information.

Unoptimized index consists of several segments. It also increases memory usage. Segments are independent, so each segment contains his own terms dictionary and terms dictionary index. If index consists of N segments it may increase memory usage by N times in worst case. Perform index optimization to merge all segments into one.

Indexing uses the same memory as searching plus memory for buffering documents. The amount of memory used for this may be managed with MaxBufferedDocs parameter.

Index optimization (full or partial) uses stream like data processing and doesn't take a lot of memory.

27.10.6. Encoding

Zend_Search_Lucene works with UTF-8 strings internally. So all strings returned by Zend_Search_Lucene are UTF-8 encoded.

You shouldn't care about encoding if you works with pure ASCII data, but should be careful in other cases.

Wrong encoding may cause error notices at the encoding conversion time or cause loss of data.

Zend_Search_Lucene gives wide range of the possibilities to specify actual encoding of indexed documents and parsed queries.

Encoding may be explicitly specified as an optional parameter of field creation methods:

<?php
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title, 'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents, 'utf-8'));
            

That's the best way to avoid ambiguity in encoding specification.

If optional encoding parameter is omitted, then current locale is used. Current locale may also contain character set specification in addition to the language information:

<?php
setlocale(LC_ALL, 'fr_FR');
...

setlocale(LC_ALL, 'de_DE.iso-8859-1');
...

setlocale(LC_ALL, 'ru_RU.UTF-8');
...
            

The same approach is used to specify query string encoding.

If encoding is not specified in any special way, then current locale is used.

Encoding may be pointed as an optional parameter, if query is parsed explicitly before search:

<?php
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
$hits = $index->find($query);
...
            

The default encoding may also be specified with setDefaultEncoding() method:

<?php
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
$hits = $index->find($queryStr);
...
            

Empty string means 'current locale'.

Since correct encoding is specified it can be correctly processed by analyzer. The actual behavior depends on analyzer been used. See Character set documentation section for details.

27.10.7. Index maintenance

It should be clear, that Zend_Search_Lucene as well as any other Lucene implementation is not a "database".

It should not be used as some kind of data storage. It doesn't provide partial backup/restoring functionality, journaling, logging, transactions and many other things provided by database management systems.

Nevertheless, Zend_Search_Lucene tries to keep index in a consistent state at any time.

Index backup/restoring should be performed off-line by complete copying of index folder.

If index corruption happens because of any reason, then index should be completely restored or rebuilt.

So that's good idea to backup large indexes and store changelog somewhere to perform manual restore + roll-forward operation if it's necessary. It essentially reduces index restoring time.



[12] memory_get_usage() and memory_get_peak_usage() may be used to control memory usage.

[13] Zend_Search_Lucene keeps each segment file opened to improve search performance.

[14] It also may occur if index or QueryHit objects are referred in some complex data structures. Ex. PHP destroys objects with cyclic references only at the end of script execution.

[15] Lucene file format allows you to change this number, but Zend_Search_Lucene doesn't give a possibility to do this through its API. Nevertheless you still have possibility to change this value if index is prepared with another Lucene implementation.