Page MenuHomePhabricator

Use ElasticSearch for bulk Wikidata entity term lookup
Closed, DeclinedPublic

Description

This ticket serves as the main placeholder for the functionality for looking up entity terms from elastic search rather than SQL.
@Addshore wrote this shortly before going on vacation, and this ticket likely needs a bit more love.

Why

There are many use cases that require an entities terms or multiple entities terms (labels descriptions and aliases) to be known.
These are currently looked up from the wb_terms table, rather than loading the whole entity JSON.

Currently in SQL these lockups are batched by # of entities, with batch sizes of 9
https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/128707d2aaa9ba2fadd384af6d7033c43070e69a/lib/includes/Store/Sql/TermSqlIndex.php#L611-L613

Elasticsearch allows faster lookups for complex cases compared to SQL, when many languages are requested (for fallback) or hundreds of entities are requested (large lookups).

The ongoing modifying the terms storage in SQL will also likely make the current performance for these lookups decrease slightly.

Rate

Tracking was added to see how many of these bulk lookups occur
https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/128707d2aaa9ba2fadd384af6d7033c43070e69a/lib/includes/Store/Sql/TermSqlIndex.php#L616-L618
The data can be seen on https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?refresh=30s&orgId=1
This indicates anywhere between 150k and 350k bulk lookup requests per minute.

How

Elastic search uses the pageid as the document ID for storage, Wikibase can do a single sql lookup from entity ID -> page id to allow for efficient document fetching in elastic search rather than a search using an entity id as a search term.
According to Discovery this could also mean that a large increase in hit rate for these lookups to elastic search would be fine due to the document id / page id dictating where the record is (needs testing)

Examples of lookup by entity id and by page id can be seen in P8373

Dealing with stale data

Elastic search can be out of date due to maintenance or slow running jobs.
Our lookup needs to be up to date.
A fallback and check may be needed

  • Lookup in ES, and get the revid?
  • For entities that have newer revisions fallback to sql storage and perform the lookup there

Possible rollout

It would make the most sense to start with the larger more complex queries, such as lookups that require over 5 languages, or over 10 entities.

Event Timeline

A couple of comments from IRC so we don't forget them in the coming months:

3:16 PM <+gehel> addshore: I don't have any constructive thoughts. The 150K req/minute is somewhat scary in itself, but who knows

The "Possible rollout" section of the description is relevant here.
We don't have exact numbers, but using elastic search for the slowest / largest lookups would likely have the biggest gains for us, taking load off the sql servers, while also speeding up the lookups and not causing too much stress on ES.

The type of lookup that we will be doing may also cause the rate to be more or less scary.
From discussions a month or so ago in IRC someone said that as we can lookup by pageid / document id the rate of this type of request could perhaps be much higher than the number of searches hitting the cluster without causing any problems, but this need confirming somehow... (this was something to do with the pageid / doc id determining where the document would be stored so it can be directly retrieved rather than hitting the whole cluster)

3:17 PM <+gehel> addshore: we probably need to take this into account in term of capacity planning, and the new fiscal year is getting close!

This is what originally triggered the creation of this ticket so that is is visible outside of the wikidata team and discussion that have happened in the discovery IRC channel.

As a very rough comparison, i pulled sum(irate(elasticsearch_indices_search_query_total[5m])) from prometheus, which gives 5 min averages for total shard queries executed per second across the cluster as 5 minute averages. We vary between about 12k and 21k shard queries per second, or about 840k to 1.25M per minute.

Note though that the 350k peaks from wikidata are likely not going to translate into 350k shard queries per minute, but as much as 7M (350k queries * 21 shards) shard queries per minute depending on how many shards have to be touched. The number of shards touched will likely average to something like min(num_shards, num_ids) on a per request basis. Also note that while wikidata has 21 shards today, it will likely have to transition to 40 shards soon as the shards are already getting too big. This could plausibly be improved if we defined custom routing keys for wikidata that make it much more likely that documents requested together reside on the same shard, but CirrusSearch does not currently support custom routing keys.

Will certainly need some load testing to determine if this is even plausible.

pinging @Fjalapeno from your comments the other day I understand Wikidata is going to use cassandra for these use cases at the end? cc @Addshore

pinging @Fjalapeno from your comments the other day I understand Wikidata is going to use cassandra for these use cases at the end? cc @Addshore

If that is the case, then perhaps we should start investigating in that direction.
Are there any docs / current open phab tickets etc that have talked about this before?
Does this basically just refer to T100705 ?

@Addshore @Nuria we have potential work scheduled later in the year based on conversations with @WMDE-leszek and @RazShuty

My understanding is that this work provides some runway, but ultimately a different type of storage solution is desired. However, I think they will be able to better answer any questions

CBogen triaged this task as Medium priority.Aug 27 2020, 8:25 PM

I'd suggest adding replica shards (copies of primary shards) that help to both ensure redundancy to protect against failure, but they also vastly increase the capacity for read requests such as searching, like Adam's entity term lookup use case. You can change the number of replica shards at any time without affecting indexing or query operations. https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html

We could still want this in the future, but right now the other changes that we made in the past years are more than sufficient.

@Addshore That's what I figured. :-) This issue did feel old and sort of in a dustbin. Agree it should be closed.