Page MenuHomePhabricator

Consider resharding cebwiki_content
Closed, ResolvedPublic2 Estimated Story Points

Description

A single bot running morelike queries on cebwiki caused the overall p95 morelike latencies to increase:

image.png (1×3 px, 485 KB)

which then caused the alert CirrusSearchMoreLikeLatencyTooHigh to flap on&off.

It appears that morelike is particularly slow on this wiki (>1s). The reason might be related to the fact that the number of docs per shard is relatively high (~3.8mil/shard):

index                                  shard prirep state       docs    store ip            node
cebwiki_content_1728036753             2     p      STARTED  3873640   29.4gb 10.192.32.88  elastic2083-production-search-codfw
cebwiki_content_1728036753             2     r      STARTED  3873640   30.5gb 10.192.16.204 elastic2057-production-search-codfw
cebwiki_content_1728036753             2     r      STARTED  3873640   29.3gb 10.192.48.13  elastic2060-production-search-codfw
cebwiki_content_1728036753             1     r      STARTED  3872253   28.4gb 10.192.0.92   elastic2089-production-search-codfw
cebwiki_content_1728036753             1     p      STARTED  3872253   30.9gb 10.192.48.160 elastic2109-production-search-codfw
cebwiki_content_1728036753             1     r      STARTED  3872253   27.6gb 10.192.16.110 elastic2070-production-search-codfw
cebwiki_content_1728036753             3     r      STARTED  3871061   27.1gb 10.192.0.138  elastic2074-production-search-codfw
cebwiki_content_1728036753             3     p      STARTED  3871061     29gb 10.192.48.89  elastic2107-production-search-codfw
cebwiki_content_1728036753             3     r      STARTED  3871061   27.2gb 10.192.16.232 elastic2095-production-search-codfw
cebwiki_content_1728036753             0     r      STARTED  3871687   29.1gb 10.192.48.179 elastic2086-production-search-codfw
cebwiki_content_1728036753             0     p      STARTED  3871687   28.9gb 10.192.16.228 elastic2092-production-search-codfw
cebwiki_content_1728036753             0     r      STARTED  3871687   27.4gb 10.192.0.206  elastic2076-production-search-codfw

We should perhaps try to re-shard this wiki to bring this number down and assess if the response times for morelike on this wiki gets better.

AC:

  • re-shard cebwiki_content to bring the number of docs per shard down (<2mil/shard), set the shard count to 8?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1104598 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrussearch: increase shard count for cebwiki_content

https://gerrit.wikimedia.org/r/1104598

Change #1104598 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrussearch: increase shard count for cebwiki_content

https://gerrit.wikimedia.org/r/1104598

Mentioned in SAL (#wikimedia-operations) [2024-12-17T14:05:33Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1099727|rdf-streaming-updater: add wdqs udpater streams in event stream config (T374919)]], [[gerrit:1104598|cirrussearch: increase shard count for cebwiki_content (T379002)]]

Mentioned in SAL (#wikimedia-operations) [2024-12-17T14:14:55Z] <dcausse@deploy2002> dcausse: Backport for [[gerrit:1099727|rdf-streaming-updater: add wdqs udpater streams in event stream config (T374919)]], [[gerrit:1104598|cirrussearch: increase shard count for cebwiki_content (T379002)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-12-17T14:25:41Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1099727|rdf-streaming-updater: add wdqs udpater streams in event stream config (T374919)]], [[gerrit:1104598|cirrussearch: increase shard count for cebwiki_content (T379002)]] (duration: 20m 07s)

Mentioned in SAL (#wikimedia-operations) [2025-03-06T19:09:02Z] <ebernhardson> T379002 start reindex of cirrus cebwiki_content index in eqiad

Mentioned in SAL (#wikimedia-operations) [2025-03-06T19:11:27Z] <ebernhardson> T379002 start reindex of cirrus cebwiki_content index in codfw

cloudelastic reindex failed to start:

+----------+---------------------------+--------------------------------------+
| error    | rawmessage                | wgCirrusSearchOptimizeIndexForExperi |
|          |                           | mentalHighlighter is set to true but |
|          |                           |  the 'experimental-highligh...       |
+----------+---------------------------+--------------------------------------+

This is likely to be a mixed cluster problem since we renamed experimental-highlighter to cirrus-highlighter for opensearch. Specifically when checking what plugins exist in the cluster we go through each node and only report the plugins that are the installed on all nodes in the cluster. The alias we apply to allow to work with either name comes later into the process. We could update the code to handle this, but should we? It's not clear we should be doing much to encourage running in a mixed cluster mode, but maybe it would be useful for this to still work in a mixed cluster mode.

eqiad and codfw have finished reindexing. We can't consider this done until cloudelastic has been reindexed as well though. That has to wait until it has finished migrating to opensearch 1.3 (and will i suppose be at least a small test of the reindexing automation under opensearch)

Mentioned in SAL (#wikimedia-operations) [2025-03-24T17:51:01Z] <ebernhardson> T379002 Start reindex of cebwiki search indices in cloudelastic

cloudelastic migration to opensearch finished, the reindex is now running. This is also a good verification that our reindexing process still works as expected on opensearch.