Page MenuHomePhabricator

Suspicious mismatch between psi and omega elastic cluster
Closed, ResolvedPublic

Description

Building completion indices may fail with:

dcausse@mwmaint1002:~$ /usr/local/bin/mwscript extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php --wiki=grwikimedia --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize 3000 --cluster=codfw--optimize
Scanning available plugins...
        analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
        analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
        extra-analysis-serbian, extra-analysis-slovak, ltr
Picking analyzer...greek
Fetching Elasticsearch version...6.5.4...ok
2020-06-03 11:26:40 Deleting broken index grwikimedia_titlesuggest_1591172540
2020-06-03 11:26:40 Deleting broken index grwikimedia_titlesuggest_1591172793
Inferring index identifier...grwikimedia_titlesuggest_first
Index does not exist yet cannot recycle.
Inferring index identifier...grwikimedia_titlesuggest_first
Setting index identifier...grwikimedia_titlesuggest_1591172801
2020-06-03 11:26:41 Waiting for the index to go green...
        Green!2020-06-03 11:26:41
Unexpected Elasticsearch failure.
Elasticsearch failed in an unexpected way.  This is always a bug in CirrusSearch.
Error type: Elastica\Exception\ResponseException
Message: index_not_found_exception: no such index
Trace:
#0 /srv/mediawiki/php-1.35.0-wmf.34/vendor/ruflin/elastica/lib/Elastica/Request.php(194): Elastica\Transport\Http->exec(Object(Elastica\Request), Array)
#1 /srv/mediawiki/php-1.35.0-wmf.34/vendor/ruflin/elastica/lib/Elastica/Client.php(689): Elastica\Request->send()
#2 /srv/mediawiki/php-1.35.0-wmf.34/vendor/ruflin/elastica/lib/Elastica/Search.php(463): Elastica\Client->request('grwikimedia_con...', 'GET', Array, Array)
#3 /srv/mediawiki/php-1.35.0-wmf.34/vendor/ruflin/elastica/lib/Elastica/Scroll.php(131): Elastica\Search->search()
#4 /srv/mediawiki/php-1.35.0-wmf.34/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php(547): Elastica\Scroll->rewind()
#5 /srv/mediawiki/php-1.35.0-wmf.34/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php(307): CirrusSearch\Maintenance\UpdateSuggesterIndex->indexData()
#6 /srv/mediawiki/php-1.35.0-wmf.34/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php(226): CirrusSearch\Maintenance\UpdateSuggesterIndex->rebuild()
#7 /srv/mediawiki/php-1.35.0-wmf.34/maintenance/doMaintenance.php(105): CirrusSearch\Maintenance\UpdateSuggesterIndex->execute()
#8 /srv/mediawiki/php-1.35.0-wmf.34/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php(784): require_once('/srv/mediawiki/...')
#9 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#10 {main}

The problem is that the main indices are in omega while they should be in psi.
Creating these indices (/usr/local/bin/mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=grwikimedia --cluster=eqiad) fixes the problem with UpdateSuggesterIndex.

Note that this problem is causing CRITICAL: Status of the systemd unit mediawiki_job_cirrus_build_completion_indices_codfw.

Event Timeline

Restricted Application added subscribers: Petar.petkovic, Base, Aklapper. · View Herald Transcript

Looking at log files of the completion indices rebuild it does seem to be the sole instance, I've kept the grwikimedia_* indices in omega for further investigations.

dcausse moved this task from needs triage to Ops / SRE on the Discovery-Search board.

If this problem does not occur again I'd be tempted to say that it's a human mistake.

Gehel claimed this task.
Gehel subscribed.

we haven't seen the problem again, let's assume human mistake.

Happened again with the new lldwiki_content index. Showed up on 9443 and 9643 clusters.

We have a variety of new wikis created in the wrong place. @dcausse tracked down the likely cause to the addWiki.php script. Per the documentation this script is run using a dummy wikiid that selects for the correct sql shard. In essense when the CirrusSearch maintenance scripts are being run to create the new indices they are in the context of the dummy wiki, and not in the context of the newly created wiki. Perhaps our internal cluster sharding should have been done based on the indexBaseName, instead of the wikiid, but changing it now would take some effort.

This is plausibly the root cause of T240778 and other addWiki related problems we've seen over the last 12 or 18 months.

The simple and most direct route to solving this problem is to run the appropriate cirrus scripts in the context of the new wiki. The cluster assignment isn't the only thing thats currently wrong, wikis are also being created with the wrong analysis chains due to running in the language of the dummy wiki. This is less of an issue as new wikis are often (but not always) in languages we don't have any customization for.

Change 627912 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/WikimediaMaintenance@master] Remove CirrusSearch initialization from addWiki.php

https://gerrit.wikimedia.org/r/627912

Change 627913 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] Simplify creating indices on all writable clusters

https://gerrit.wikimedia.org/r/627913

Once the above is merged will need to update wikitech for addWiki.php to specify the extra maintenance script to run.

Change 627912 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Remove CirrusSearch initialization from addWiki.php

https://gerrit.wikimedia.org/r/627912

Change 627913 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Simplify creating indices on all writable clusters

https://gerrit.wikimedia.org/r/627913

I've updated Search and Add a wiki with the latest updates. @Urbanecm @Ladsgroup Would either of you like to review the updated docs? It's a fairly minimal change, but it will be good to clarify any open questions in the docs.

Thanks @EBernhardson. So, when I create a wiki, only mwscript extensions/CirrusSearch/Maintenance/UpdateSearchIndexConfig.php --wiki=newwiki --cluster=all needs to be run, and the script will do the job automagically, is that right?

@Urbanecm Correct, it should only be that single script invocation after addWiki.php has completed.

Thanks, seems good. Added to the tracking script too.

Once we create the next wiki (T262812), should we ping you here, so you can double-check?

I wanted to say what about the tracking script and I saw Martin already did it. Thanks! I think this is done.

Thanks, seems good. Added to the tracking script too.

Once we create the next wiki (T262812), should we ping you here, so you can double-check?

Please do, I'll double check everything ended up in the right place.

Thanks, seems good. Added to the tracking script too.

Once we create the next wiki (T262812), should we ping you here, so you can double-check?

Please do, I'll double check everything ended up in the right place.

I have created arbcom_ruwiki now, and ran mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=arbcom_ruwiki --cluster=all at the end.

Everything looks to be in order. Indices created on correct prod clusters, no indices created in the public cloudelastic (since this is a private wiki).