Package Home

Zend Framework 2 Documentation (Manual)

PHK Home

File: /modules/zendsearch.lucene.charset.html

Size:17461
Storage flags:no_autoload,compress/gzip (27%)

Character Set — Zend Framework 2 2.4.2 documentation

Character Set

UTF-8 and single-byte character set support

Zend\Search\Lucene works with the UTF-8 charset internally. Index files store unicode data in Java’s “modified UTF-8 encoding”. Zend\Search\Lucene core completely supports this encoding with one exception. [1]

Actual input data encoding may be specified through Zend\Search\Lucene API. Data will be automatically converted into UTF-8 encoding.

Default text analyzer

However, the default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.

ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to ‘ASCII//TRANSLIT’ encoding before indexing. The same processing is transparently performed during query parsing. [2]

Note

Default analyzer doesn’t treats numbers as parts of terms. Use corresponding ‘Num’ analyzer if you don’t want words to be broken by numbers.

UTF-8 compatible text analyzers

Zend\Search\Lucene also contains a set of UTF-8 compatible analyzers: Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8, Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8Num, Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8\CaseInsensitive, Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8Num\CaseInsensitive.

Any of this analyzers can be enabled with the code like this:

1
2
Zend\Search\Lucene\Analysis\Analyzer::setDefault(
    new Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8());

Warning

UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of analyzers assumed all non-ascii characters are letters. New analyzers implementation has more accurate behavior.

This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine may return wrong result sets.

All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE library sources bundled with PHP source code distribution, but if shared library is used instead of bundled with PHP sources, then UTF-8 support state may depend on you operating system.

Use the following code to check, if PCRE UTF-8 support is enabled:

1
2
3
4
5
if (@preg_match('/\pL/u', 'a') == 1) {
    echo "PCRE unicode support is turned on.\n";
} else {
    echo "PCRE unicode support is turned off.\n";
}

Case insensitive versions of UTF-8 compatible analyzers also need mbstring extension to be enabled.

If you don’t want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing and query string before searching by converting them to lowercase:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Indexing
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend\Search\Lucene\Analysis\Analyzer::setDefault(
    new Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8());

...

$doc = new Zend\Search\Lucene\Document();

$doc->addField(Zend\Search\Lucene\Field::UnStored('contents',
                                                  strtolower($contents)));

// Title field for search through (indexed, unstored)
$doc->addField(Zend\Search\Lucene\Field::UnStored('title',
                                                  strtolower($title)));

// Title field for retrieving (unindexed, stored)
$doc->addField(Zend\Search\Lucene\Field::UnIndexed('_title', $title));
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Searching
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

...

Zend\Search\Lucene\Analysis\Analyzer::setDefault(
    new Zend\Search\Lucene\Analysis\Analyzer\Common\Utf8());

...

$hits = $index->find(strtolower($query));
[1]

Zend\Search\Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn’t support “supplementary characters” (characters whose code points are greater than 0xFFFF)

Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.

[2]Conversion to ‘ASCII//TRANSLIT’ may depend on current locale and OS.

Table Of Contents

This Page

Note: You need to stay logged into your GitHub account to contribute to the documentation.

Edit this document

Edit this document

The source code of this file is hosted on GitHub. Everyone can update and fix errors in this document with few clicks - no downloads needed.

  1. Login with your GitHub account.
  2. Go to Character Set on GitHub.
  3. Edit file contents using GitHub's text editor in your web browser
  4. Fill in the Commit message text box at the end of the page telling why you did the changes. Press Propose file change button next to it when done.
  5. On Send a pull request page you don't need to fill in text anymore. Just press Send pull request button.
  6. Your changes are now queued for review under project's Pull requests tab on GitHub.

For more information about the PHK package format: http://phk.tekwire.net