Page MenuHomePhabricator

WikibaseCirrusSearch not indexing Monolingual text and qualifiers
Closed, DuplicatePublicBUG REPORT

Description

Environment:

  • MediaWiki: 1.44
  • Extensions: CirrusSearch, Wikibase, WikibaseCirrusSearch, FacetedSearch
  • Elasticsearch backend

Problem:

  • Properties with Monolingual text datatype are not indexed (e.g., P22 – Additional Title).
  • Qualifiers are partially missing; main value is indexed, but qualifiers (e.g., Year on P31) are absent.
  • Other datatypes (String, Item, URL) are indexed correctly.

Observed Behavior (API example):

json
{
  "hits": [
    {
      "title": "Q3966",
      "labels": { "en": ["Example Event Series"], "de": ["Beispielserie"] },
      "wbfs_P22": [],        // ❌ Monolingual text missing
      "wbfs_P25": ["https://example.com"], // ✅ OK
      "wbfs_P27": ["Q2257","Q2353"],     // ✅ OK
      "wbfs_P31": ["A*"],     // ⚠️ Qualifier missing
      "text": "Example Event Series\nBeispielserie"
    }
  ]
}

Faceted Search Configuration:

json
{
  "itemTypeProperty": "P5",
  "configPerItemType": {
    "Q521": {
      "facets": {
        "P21": {"type":"list","showAnyFilter":true},
        "P22": {"type":"list","showAnyFilter":true},
        "P27": {"type":"list","showAnyFilter":true,"showNoneFilter":true}
      }
    }
  }
}

Wikibase / Elasticsearch settings:

$wgWBRepoSettings['searchIndexProperties'] = [
  'P21','P22','P23','P24','P27','P30','P31','P34'
];
$wgCirrusSearchIndexUpdates = true;

Expected Behavior:

  • Monolingual text values should appear in wbfs_P22.
  • Qualifiers should be indexed and searchable alongside parent statements.

Additional Notes:

  • updateSearchIndex.php completes without errors.
  • Problem seems specific to Monolingual text and qualifier handling.

According to the official documentation, the Wikibase Cirrus search engine only supports these data types.

external identifier
string
item
property
lexeme
form
sense
https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
daniel added subscribers: Lucas_Werkmeister_WMDE, daniel.

@Lucas_Werkmeister_WMDE do you know who could help here? I don't recall how this works...

Search team, probably… I don’t know much about this beyond the fact that [edit: the monolingual text part is] a known and documented limitation (as noted in T422809#11808303).

I also feel like this should be two tasks, as monolingual text and qualifiers are totally different things to search for (and at least qualifiers would probably need a lot of new data structure to represent them). The first part might be the same as T193858: Make monolingual text datatype indexable in wikibase; the second part could be… T193407: Store wikibase statement qualifiers in cirrus search index, which is resolved?

We'll explore why it behaves like this.

Unfortunately I think this is a known limitation.

Technically speaking adding monolingualtext to searchIndexTypes and adding a search-index-data-formatter-callback to VT:monolingualtext we could index these statements (not tested) but I think this would be a bad idea in its current form, such statements might require a tokenizer and an actual full text field to be useful with haswbstatement.
For instance P1476 might make no sense to index as a keyword field such strings indexed as a whole are unlikely to be useful at search time.
This is for the same reason that on wikidata we explicitly exclude some string properties like P3921 which are inherently text that need to be tokenized to be useful at search time.
For qualifiers I don't see where we limit to only one qualifier (example query where a P6262 statement has both P1810 & P407 qualifiers).
The data type of the qualifier must have a search-index-data-formatter-callback callback, which as far as I can tell we have only for UnboundedQuantityValue, StringValue and EntityIdValue by defaults for items.

Supporting monolingual text properly is a big ask and may require proper thinking, the main problem being that we need to tokenize such strings but tokenizing such strings (beside the space required) is that a naive approach will blend all statements text into a single field making it impossible to search for text in a specific statement.

I'm going to close in favor of T193858 but please feel to re-open if you think I missed something.

Note that the WMF search team does not support FacetedSearch so please feel free to open a bug report directly there if you think there's an issue there (I don't think they use wikimedia phabricator for ticket tracking?).