|
Size: | 17461 |
Storage flags: | no_autoload,compress/gzip (27%) |
Zend\Search\Lucene works with the UTF-8 charset internally. Index files store unicode data in Java’s “modified UTF-8 encoding”. Zend\Search\Lucene core completely supports this encoding with one exception. [1]
Actual input data encoding may be specified through Zend\Search\Lucene API. Data will be automatically converted into UTF-8 encoding.
However, the default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.
ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to ‘ASCII//TRANSLIT’ encoding before indexing. The same processing is transparently performed during query parsing. [2]
Note
Default analyzer doesn’t treats numbers as parts of terms. Use corresponding ‘Num’ analyzer if you don’t want words to be broken by numbers.
Zend\Search\Lucene also contains a set of UTF-8 compatible analyzers: Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8, Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8Num, Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8\CaseInsensitive, Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8Num\CaseInsensitive.
Any of this analyzers can be enabled with the code like this:
1 2 | Zend\Search\Lucene\Analysis\Analyzer::setDefault(
new Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8());
|
Warning
UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of analyzers assumed all non-ascii characters are letters. New analyzers implementation has more accurate behavior.
This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine may return wrong result sets.
All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE library sources bundled with PHP source code distribution, but if shared library is used instead of bundled with PHP sources, then UTF-8 support state may depend on you operating system.
Use the following code to check, if PCRE UTF-8 support is enabled:
1 2 3 4 5 | if (@preg_match('/\pL/u', 'a') == 1) {
echo "PCRE unicode support is turned on.\n";
} else {
echo "PCRE unicode support is turned off.\n";
}
|
Case insensitive versions of UTF-8 compatible analyzers also need mbstring extension to be enabled.
If you don’t want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing and query string before searching by converting them to lowercase:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | // Indexing
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
...
Zend\Search\Lucene\Analysis\Analyzer::setDefault(
new Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8());
...
$doc = new Zend\Search\Lucene\Document();
$doc->addField(Zend\Search\Lucene\Field::UnStored('contents',
strtolower($contents)));
// Title field for search through (indexed, unstored)
$doc->addField(Zend\Search\Lucene\Field::UnStored('title',
strtolower($title)));
// Title field for retrieving (unindexed, stored)
$doc->addField(Zend\Search\Lucene\Field::UnIndexed('_title', $title));
|
1 2 3 4 5 6 7 8 9 10 11 | // Searching
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
...
Zend\Search\Lucene\Analysis\Analyzer::setDefault(
new Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8());
...
$hits = $index->find(strtolower($query));
|
[1] | Zend\Search\Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn’t support “supplementary characters” (characters whose code points are greater than 0xFFFF) Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters. |
[2] | Conversion to ‘ASCII//TRANSLIT’ may depend on current locale and OS. |
The source code of this file is hosted on GitHub. Everyone can update and fix errors in this document with few clicks - no downloads needed.