Page MenuHomePhabricator

Serbian language search does not allows for use of bald Latin alphabet
Open, MediumPublic

Description

In search, most Internet users use bald Latin alphabet (without letters č, ć, š, ž and đ). This is similar to how in German language the search for "Muenchen" will return the results for "München". Thus, Serbian Wikipedia should support searching in this way, but it doesn't. Example:

  1. Search for "marković": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=markovi%C4%87&searchToken=cibdktt9t7eu2hv4o3n1hgg84
    1. Observed: 207 search results.
  2. Search for "markovic": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=markovic&searchToken=gf3dawrz4tio3a91fujm144m
    1. Expected: all the 207 previous search results should appear.
    2. Observed: Only 47 results appear.

An overview of the issue is given at https://wiki.apache.org/solr/SerbianLanguageSupport

Event Timeline

Wikimedia sites do not use MediaWiki's default search backend (MediaWiki-Search), hence setting CirrusSearch.

debt triaged this task as Medium priority.Jul 1 2016, 4:45 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt subscribed.

We'll take a look and hopefully it'll be fairly 'easy' to fix.

If we want both bald Latin and Cyrillic-to-Latin mapping, it looks to be straightforward. See T138857#3391852 for more details.

Adding linguistic clarification and confirming this issue is still reproducible.

As a native Serbian speaker, I can confirm that Serbian Wikipedia users overwhelmingly type search queries without diacritics (č, ć, š, ž, đ). Accent-sensitive search is unexpected and significantly reduces recall.

Expected diacritic folding for Serbian Latin search would be:

č → c

ć → c

š → s

ž → z

đ → dj (or possibly d — requires an explicit linguistic decision)

The behavior described in this task is still reproducible today, and partial recall suggests some normalization already exists, but character folding remains incomplete at the language analyzer level.

Clarification question for scope: should this folding apply only to Latin↔Latin search, or also interact with the existing Cyrillic↔Latin script conversion used on srwiki?

This likely affects hrwiki and bswiki as well and may be best addressed in CirrusSearch language analysis rather than via local configuration.

Is this the solution to the problem?

Тhis code directly addresses the issue outlined in task T138858 regarding the "bald Latin" search functionality in Serbian Wikipedia. The core of the problem lies in the fact that the existing stemmer specifically protects Serbian characters with diacritics (like č, ć, š, ž, đ) to ensure accurate linguistic stemming, which consequently breaks the fallback "fuzzy" search for users who type without diacritics.

This solution successfully implements a hybrid approach. It preserves the protective rules in the primary text analyzer so the SCStemmer continues to function correctly. Simultaneously, it modifies the fallback plain analyzer by applying an aggressive folding filter that strips all diacritics and specifically maps the letters "đ/Đ" to "dj/Dj" before any other processing occurs. This ensures that a search query like "markovic" will accurately match the indexed "Marković", providing the expected search results without compromising the overall linguistic integrity of the system.

Below is the detailed implementation plan with the translated code and comments:

// Step 1: Defining components for the "plain" analyzer
// All changes are made in the includes/Maintenance/AnalysisConfigBuilder.php file within the CirrusSearch extension. 
// First, we need to define our new rules within the main block for the Serbian/Croatian language.
// Find the customize() function (or defaults(), depending on the version), and within switch ($language) find case 'serbian':.

case 'bosnian':
case 'croatian':
case 'serbian':
case 'serbo-croatian':
    // 1. Define mapping for the letter Đ
    // We will add this mapping to the plain analyzers.
    $config['char_filter']['serbian_dj_mapping'] = [
        'type' => 'mapping',
        'mappings' => [
            'đ => dj',
            'Đ => Dj',
            'ђ => dj', // Support for users searching Cyrillic đ with bald Latin
            'Ђ => Dj'
        ]
    ];

    // 2. Define "aggressive" folding without TJones's exceptions
    // We need this because global icu_folding has unicodeSetFilter set for Serbian
    $config['filter']['serbian_plain_folding'] = [
        'type' => 'icu_folding'
        // Intentionally NOT adding unicodeSetFilter here!
    ];

    // The existing code for the 'text' analyzer remains intact so as not to break the stemmer
    $config = $myAnalyzerBuilder->withFilters( [ 'lowercase', 'icu_folding', 'serbian_stemmer' ] )
                                ->build( $config );
    break;

// Step 2: Intercepting the system method enableICUFolding
// This is the core of the solution. The enableICUFolding() method in Cirrus automatically goes through plain and plain_search analyzers 
// and glues the system icu_folding to them (which for the Serbian language contains the exceptions [^ĐđŽžĆ抚Čč]). 
// Because of this method, the search for "markovic" currently fails!
// Find the function enableICUFolding( array $config, $language ) and modify the logic that sets the plain filter.

public function enableICUFolding( array $config, $language ) {
    $unicodeSetFilter = $this->getICUSetFilter( $language );
    $filter = [
        'type' => 'icu_folding',
    ];
    if ( $unicodeSetFilter ) {
        $filter['unicodeSetFilter'] = $unicodeSetFilter;
    }
    $config['filter']['icu_folding'] = $filter;

    // ... [Existing code for replacing asciifolding in text analyzers] ...

    // THIS IS THE BLOCK WE ARE CHANGING: Explicitly enable icu_folding on plain analyzers
    if ( isset( $config['analyzer']['plain'] ) ) {
        
        // --- START OF CUSTOM LOGIC FOR T138858 ---
        if ( in_array( $language, [ 'sr', 'hr', 'bs', 'sh', 'serbian', 'croatian', 'bosnian', 'serbo-croatian' ] ) ) {
            // If plain/plain_search do not exist, initialize them
            if ( !isset( $config['analyzer']['plain']['filter'] ) ) {
                $config['analyzer']['plain']['filter'] = [];
            }
            if ( !isset( $config['analyzer']['plain_search']['filter'] ) ) {
                $config['analyzer']['plain_search']['filter'] = [];
            }
            
            // Set filters for plain
            $config['analyzer']['plain']['filter'] = [ 'lowercase', 'serbian_plain_folding' ];
            $config['analyzer']['plain_search']['filter'] = [ 'lowercase', 'serbian_plain_folding' ];

            // Set char_filter for Đ -> Dj
            if ( !isset( $config['analyzer']['plain']['char_filter'] ) ) {
                $config['analyzer']['plain']['char_filter'] = [];
            }
            if ( !isset( $config['analyzer']['plain_search']['char_filter'] ) ) {
                $config['analyzer']['plain_search']['char_filter'] = [];
            }

            // Insert mapping at the very beginning of the array (before tokenization)
            array_unshift( $config['analyzer']['plain']['char_filter'], 'serbian_dj_mapping' );
            array_unshift( $config['analyzer']['plain_search']['char_filter'], 'serbian_dj_mapping' );
        } else {
        // --- END OF CUSTOM LOGIC ---
            
            // Standard fallback logic for all other languages
            if ( !isset( $config['analyzer']['plain']['filter'] ) ) {
                $config['analyzer']['plain']['filter'] = [];
            }
            $config['analyzer']['plain']['filter'] =
                $this->switchFiltersToICUFoldingPreserve(
                    $config['analyzer']['plain']['filter'], true );
        }
    }

    return $config;
}

// Why is this plan perfect for production?
// Separation of domains: We keep the original text indexes intact, respecting Trey Jones's entire work and avoiding regression of the linguistic stemmer.
// Bypassing unicodeSetFilter: We explicitly assign the new serbian_plain_folding definition to the plain analyzer (which performs 100% "bald" removal of diacritics).
// Use of char_filter over token_filter: If you attempted the "đ to dj" transformation via token filter, Elasticsearch would apply it too late. 
// char_filter catches the raw string before tokenization, which is essential for correct indexing of words like "Novak Đoković" into "Novak Djokovic".

// Deployment steps
// After this change is saved and applied (merged) to the server, the operator must manually refresh the ES configuration and reindex the data. 
// In the MediaWiki terminal (on the cluster server):

// 1. Sends the new analyzer definition to the Elasticsearch registries (updates mappings and filters)
// php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki srwiki
// php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki hrwiki
// php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki bswiki
// php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki shwiki

// 2. Forces Elasticsearch to re-read (re-index) all existing text through the new plain analyzer
// php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki srwiki --skipLinks --indexOnSkip