Page MenuHomePhabricator

Completion index fails to build on labswiki: MessageCache.php: Process cache for 'en' should be set by now
Closed, ResolvedPublic3 Estimated Story Points

Description

Seen on mwmaint2002 logs (e.g. /var/log/mediawiki/mediawiki_job_cirrus_build_completion_indices_codfw/syslog)

Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki Fetching Elasticsearch version...6.5.4...ok
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki LogicException from line 413 of /srv/mediawiki/php-1.37.0-wmf.16/includes/cache/MessageCache.php: Process cache for 'en' should be set by now.
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #0 /srv/mediawiki/php-1.37.0-wmf.16/includes/cache/MessageCache.php(1115): MessageCache->load('en')
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #1 /srv/mediawiki/php-1.37.0-wmf.16/includes/cache/MessageCache.php(1044): MessageCache->getMsgFromNamespace('Cirrussearch-bo...', 'en')
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #2 /srv/mediawiki/php-1.37.0-wmf.16/includes/cache/MessageCache.php(1014): MessageCache->getMessageForLang(Object(LanguageEn), 'cirrussearch-bo...', true, Array)
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #3 /srv/mediawiki/php-1.37.0-wmf.16/includes/cache/MessageCache.php(956): MessageCache->getMessageFromFallbackChain(Object(LanguageEn), 'cirrussearch-bo...', true)
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #4 /srv/mediawiki/php-1.37.0-wmf.16/includes/language/Message.php(1398): MessageCache->get('cirrussearch-bo...', true, Object(LanguageEn))
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #5 /srv/mediawiki/php-1.37.0-wmf.16/includes/language/Message.php(1080): Message->fetchMessage()
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #6 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/includes/Util.php(319): Message->isDisabled()
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #7 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/objectcache/wancache/WANObjectCache.php(1713): CirrusSearch\Util::CirrusSearch\{closure}(false, 600, Array, NULL, Array)
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #8 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/objectcache/wancache/WANObjectCache.php(1541): WANObjectCache->fetchOrRegenerate('global:cirrusse...', 600, Object(Closure), Array, Array)
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #9 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/includes/Util.php(325): WANObjectCache->getWithSetCallback('global:cirrusse...', 600, Object(Closure))
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #10 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/includes/Util.php(282): CirrusSearch\Util::getOnWikiBoostTemplates(Object(CirrusSearch\SearchConfig))
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #11 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/includes/BuildDocument/Completion/QualityScore.php(63): CirrusSearch\Util::getDefaultBoostTemplates()
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #12 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/includes/BuildDocument/Completion/SuggestScoringMethodFactory.php(23): CirrusSearch\BuildDocument\Completion\QualityScore->__construct()
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #13 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/includes/BuildDocument/Completion/SuggestBuilder.php(118): CirrusSearch\BuildDocument\Completion\SuggestScoringMethodFactory::getScoringMethod('popqual')
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #14 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php(222): CirrusSearch\BuildDocument\Completion\SuggestBuilder::create(Object(CirrusSearch\Connection), 'popqual', 'labswiki')
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #15 /srv/mediawiki/php-1.37.0-wmf.16/maintenance/doMaintenance.php(108): CirrusSearch\Maintenance\UpdateSuggesterIndex->execute()
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #16 /srv/mediawiki/php-1.37.0-wmf.16/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php(785): require_once('/srv/mediawiki/...')
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #17 /srv/mediawiki/multiversion/MWScript.php(116): require_once('/srv/mediawiki/...')
Aug  3 04:56:21 mwmaint2002 mediawiki_job_cirrus_build_completion_indices_codfw[119149]: labswiki #18 {main}

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

To reproduce on mwmaint2002:
mwscript extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php --wiki=labtestwiki --indexChunkSize 3000 --cluster=codfw --optimize

EBernhardson triaged this task as Medium priority.

Verified currently ongoing, we get two logs a day every day. This means autocomplete on wikitech isn't being updated.

  • This is running on mwmaint* machines, do those even have proper access to labswiki?
  • First failure is July 22 ~04:00, same failure is repeated against eqiad and codfw clusters.
  • labswiki is group1
  • 1.37.0-wmf.15 deployed to group 1 july 21 ~13:00. Seems plausible something in that deployment was the trigger
  • No interesting wmf-config changes between the two runs
  • The related ticket has some end users reporting that their underlying error was related to caching. With labswiki having special configuration it seems plausible something similar is happening here. Indeed the code check that throws this error amounts to $cache->hasField( $lang, 'VERSION' ).

In a test with shell.php fetching the message a second time works:

ebernhardson@mwdebug1002:~/mw-phpdbg$ mwscript shell.php --wiki=labswiki
Psy Shell v0.10.5 (PHP 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf1 — cli) by Justin Hileman
>>> wfMessage( 'ok' )->text()
LogicException with message 'Process cache for 'en' should be set by now.'
>>> wfMessage( 'ok' )->text()
=> "OK"
  • Turns out the first execution tries to load from DB, but fails because no cache is available. On the second attempt within the same execution the db has been marked disabled and no attempt is made to load from the database. Instead it falls back to the CDB files.
  • MessageCache::load() is what is used to load from DB. This first trys to read from cache, fails because no cache is available, then try's to use the cache to aquire a lock. Again it can't aquire the lock because the cache is unavailable.

Verifying 11212 isn't here. verified same on mwmaint[12]002:

ebernhardson@mwdebug1002:~$ netstat -ntpl | grep 11213
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:11213           0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::11213                :::*                    LISTEN      -                   
ebernhardson@mwdebug1002:~$ netstat -ntpl | grep 11212
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)

It's not clear yet what exactly changed, but the thread to pull on revolves around MessageCache and the labswiki specific memcached not being available from mwmaint*

WaitConditionLoop has a bug, repro:

>>> (new Wikimedia\WaitConditionLoop(function () { 
       return Wikimedia\WaitConditionLoop::CONDITION_ABORTED; 
    }, 0))->invoke() === Wikimedia\WaitConditionLoop::CONDITION_REACHED
=> true

In wmf.14 this bug allowed the MessageCache to fail, but continue anyways. The MessageCache would disable itself and fallback to LocalisationCache which reads from CDB files. In wmf.15 the intermediary lock/unlock in BagOStuff was refactored. This refactor avoided the bug above by following a slightly different path to deciding if the lock was actually locked. With the lock correctly reporting failure in wmf.15 the maintenance scripts are no longer running.

That leaves the remaining question, how do we fix this? A couple options:

  • We could deploy the appropriate nutcracker configs to mwmaint* hosts. Are there reasons to not do this, perhaps isolation of wikitech?
  • We could deploy the maintenance scripts to labsweb* hosts and run them on the web hosts.
  • MessageCache seems to work acceptably after it marks the db as failed. I'm assuming this means it's operating in a degraded mode where it only reads from cdb and doesn't have on-wiki message overrides. Is that acceptable for maintenance scripts?

Overall this task seems to have expanded a good bit from the 3 assigned earlier today. Fixing this wont be as simple as reverting some patch, or deploying a short fix. Will have to ponder if there are simpler fixes that could be applied, or if we need to disable completion suggester (falling back to strict prefix) on wikitech until a resolution is found.

Change 713337 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/libs/WaitConditionLoop@master] Correctly report failure with no timeout

https://gerrit.wikimedia.org/r/713337

Attached patch addresses the above identified problem with WaitConditionLoop. Fixing this won't resolve this ticket though, it's just cleaning up found issues.

Change 713340 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Hack around i18n cache failure for wikitech

https://gerrit.wikimedia.org/r/713340

Change 713340 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Hack around i18n cache failure for wikitech

https://gerrit.wikimedia.org/r/713340

Change 713508 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.37.0-wmf.19] Hack around i18n cache failure for wikitech

https://gerrit.wikimedia.org/r/713508

Change 713508 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.37.0-wmf.19] Hack around i18n cache failure for wikitech

https://gerrit.wikimedia.org/r/713508

Mentioned in SAL (#wikimedia-operations) [2021-08-17T23:32:08Z] <ebernhardson@deploy1002> Synchronized php-1.37.0-wmf.19/extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: T288233: Work around cache failure for wikitech (duration: 01m 28s)

Wikitech is now running wmf.19. I've manually run the same index build that is triggered daily and verified wikitech completions are now up to date.

I'm not sure that we should call this ticket resolved though. The fix allows the index builds to complete, but all it really does is hide the error and return to how it was operating prior to wmf.15. The root problem to deal with is that the index rebuild currently runs on a machine that isn't able to run mediawiki against wikitech without errors.

I'm not sure that we should call this ticket resolved though. The fix allows the index builds to complete, but all it really does is hide the error and return to how it was operating prior to wmf.15. The root problem to deal with is that the index rebuild currently runs on a machine that isn't able to run mediawiki against wikitech without errors.

One resolution for that problem would be T237773: Move Wikitech onto the production MW cluster, which in theory should be possible relatively soon if we can get consensus from the SRE folks to allow T237889: Install php-ldap on all MW appservers. The other main blocker of T167973: Move database for wikitech (labswiki) to a main cluster section is expected to be resolved in September following the DC switch back to eqiad.

Change 713337 merged by jenkins-bot:

[mediawiki/libs/WaitConditionLoop@master] Correctly report failure with no timeout

https://gerrit.wikimedia.org/r/713337

One resolution for that problem would be T237773: Move Wikitech onto the production MW cluster, which in theory should be possible relatively soon if we can get consensus from the SRE folks to allow T237889: Install php-ldap on all MW appservers. The other main blocker of T167973: Move database for wikitech (labswiki) to a main cluster section is expected to be resolved in September following the DC switch back to eqiad.

Indeed if wikitech will become a typical wiki then this should be a moot point. Seems reasonable to close this, and cleanup the hack once that is complete.