Page MenuHomePhabricator

Loading or changing language preferences on Wikimedia Commons extremely slow (over 10s consistently)
Closed, DuplicatePublic

Description

  1. Logged-in user on https://commons.wikimedia.org with initial preference of British English (en-gb). Using Firefox on macOS.
  2. Open ULS selector from personal tools on the top right.
  3. Select anything (e.g. Deutsch).

Expected result:

The page reloads in German.

Actual result:

Nothing happens for a good 10-20 seconds, during which an AJAX request is busy. It is a POST request to /w/api.php. After this finishes, eventually, the page reloads.

Event Timeline

I can reproduce the issue over WikimediaDebug:

I've thusfar been unable to inspect the request in Firefox for some reason, but based on a breakpoint I think it must be one of these there request:

It's non-trivial to find out since the Logstash entries even with Verbose logging, do not record the API post payload, nor which API module is handling the request.

TODO: Consider letting the mwlog1001/api.log-style entries go to Logstash at least with severity=debug for when WikimediaDebug is used.
TODO: Consider having the API log with wfDebug or other debug message which API module(s) are handling the request.

And while we have mwlog1001/api.php, it is taking at least 20-30 minutes to get a result from fgrep over that file for a request ID from this time of day (toward the end of rollover).

Also, even after 20 minutes of fgrep'ing, there were no results found. It wasn't logged it seems?

Where the time is spent

There are a lot of rather odd and unexpected things in the Logstash entries so far. In no particular order:

  • There are 340 entries under gitinfo, informing us:
    • Loaded git data from cache for /srv/mediawiki/php-1.37.0-wmf.1.
    • Candidate cacheFile=/srv/mediawiki/php-1.37.0-wmf.1/cache/gitinfo/info.json for /srv/mediawiki/php-1.37.0-wmf.1
    • Is this opening the same file from disk 300 times?
    • Why is GitInfo being consulted?
  • There are 229 entries under ExternalStore, in the style:
    • ExternalStoreDB::fetchBlob cache miss on cluster25/12801096@commonswiki
    • ExternalStoreDB::fetchBlob cache miss on cluster27/65048413@commonswiki
    • ExternalStoreDB::fetchBlob cache miss on cluster25/12801096@commonswiki
    • etc.
    • What are these hundreds of ExternalStore fetches for?
  • There are 255 entries under GlobalTitleFail telling us that:
    • MessageCache::parse called with no title set.
    • That's a lot of Parser invocations.
  • There are 10,107 entries under memcached. A few highlights:
      • 1,688x says getWithSetCallback(commonswiki:pagetranslation:sourcepages): process cache hit.
      • 458x match fetchOrRegenerate(global:SqlBlobStore-blob:commonswiki:tt*: miss, new value computed (combined entries for different revison IDs)
      • 196x says getMulti(WANCache:global:revision-row-1.29:commonswiki:11357193:104218666|#|v) which is Template:Gadget-desc.
    • 140x says Rejected set() for {cachekey} due to snapshot lag (late regeneration).
  • There are 74 entries under UserOptionsManager,
    • 72x says Loading options from override cache from UserOptionsManager::loadUserOptions($user)
    • 2x says Loading options from database
  • Parser, ContentHandler, and interwiki channels:
    • ParserFactory: using default preprocessor
    • Registered handler for Scribunto: ScribuntoContentHandler
    • Parser::setOutputFlag: set user-signature flag on 'Special:Badtitle/dummy title for API calls set in api.php'; User signature detected
    • MediaWiki\Interwiki\ClassicInterwikiLookup::getInterwikiCacheEntry( mediawikiwiki )
    • MediaWiki\Interwiki\ClassicInterwikiLookup::getInterwikiCacheEntry( en )
    • MediaWiki\Interwiki\ClassicInterwikiLookup::getInterwikiCacheEntry( tools )

Rough analysis

The reason for causing so many Parser invocations (or rather, why at all) is presumably due to PreferencesFactory registry being based on HTMLForm labels, and HTMLForm in turn requires an associative array when creating checkboxes (for uniqueness?) where the array key is the label HTML. Thus this means every gadget checkbox has to parse the description in full in order to register the gadget preferences. This mess has been known for years, tracked at T58633: Preference retrieval should not require so much parsing.

These Parser invocations result in hundreds of unbatched SqlBlobStore memcached fetches (that's 500 times 1ms), which in turn result in 200 ExternalStoreDB queries over half a dozen ES shard connections.

The Rejected set() for {cachekey} due to snapshot lag messages explain why this super slow response can happen consistently for users because the majority of the request happens after the lag tolerance has ellapsed (5-6 seconds; in total: 10-20 seconds). Which means the work that happens there can't be safely cached under our current WANCache rules (there actually is a short interim cache to avoid stampedes, but for most users that means a cache miss since this isn't done many times a minute globally).

Open questions / Apparent defects

  • Why is GitInfo being consulted in this API request? (And why 340 times?)
  • Why is UserOptionsManager repeatedly loading the same options nearly a hundred times over?
  • Why is parsing these gadget labels so slow?
  • Why is the Translate extension making 1,688 roundtrips to Memcached for the same key? That's probably ~2 seconds by itself. (Nvm, these are process cached)
  • Can we resource fixing of T58633?
  • Why is the Translate extension making 1,688 roundtrips to Memcached for the same key? (that's probably ~2 seconds by itself).

Seems like this might be due to Parser -> fetchTranslatableTemplateAndTitle -> TranslatablePage::isTranslationPage which a quick git-blame attributes to a recent-ish refactor change 573450 for T47096.

  • Why is the Translate extension making 1,688 roundtrips to Memcached for the same key? (that's probably ~2 seconds by itself).

Seems like this might be due to Parser -> fetchTranslatableTemplateAndTitle -> TranslatablePage::isTranslationPage which a quick git-blame attributes to a recent-ish refactor change 573450 for T47096.

I'm confused since the earlier message says 1,688x says getWithSetCallback(commonswiki:pagetranslation:sourcepages): process cache hit. Is it a process cache hit or not?

[..] a quick git-blame attributes to a recent-ish refactor change 573450 for T47096.

I'm confused since the earlier message says 1,688x says getWithSetCallback(commonswiki:pagetranslation:sourcepages): process cache hit. Is it a process cache hit or not?

You're right. They're process-hits.