Page MenuHomePhabricator

Loading or saving preferences is taking too long on Commons
Closed, ResolvedPublic

Description

When trying to save options on Commons using the API at https://commons.wikimedia.org/w/api.php, it takes extremely long, namely between 15 and 45 seconds. When the period is above than the timeout period (30 sec), the request results in an error due to timeout.

Other requests that I observed get the response quickly, as well as the same request on other wiki sites that I checked.

An example code that I executed:

new mw.Api().post({
  action: 'options',
  assert: 'user',
  assertuser: 'Jack who built the house',
  optionname: 'userjs-convenientDiscussions-localSettings',
  optionvalue: '{"insertButtons":[["{{Done}}."],["{{Not done}}."]],"signaturePrefix":" ","haveInsertButtonsBeenAltered":true}',
  token: '<token here>',
});

The same applies to action: "globalpreferences".

Occasionally, options are not saved at all, and this error is returned by the API after a long waiting period:

{
  error: {
    code: "internal_api_error_DBQueryError"
    errorclass: "Wikimedia\Rdbms\DBQueryError"
    info: "[702876a1-db5f-4557-ad80-31597d47e445] Caught exception of type Wikimedia\Rdbms\DBQueryError"
    servedby: "mw1312"
  }
}

Update: the same issue appears when I try to save user preferences using the standard interface at https://commons.wikimedia.org/wiki/Special:Preferences.

Event Timeline

Jack_who_built_the_house renamed this task from Saving options is taking too long on Commons to Saving options using API is taking too long on Commons.Mar 28 2021, 3:41 PM
Jack_who_built_the_house updated the task description. (Show Details)

Hm, I think the same issue appears when I try to save user preferences using the standard interface at https://commons.wikimedia.org/wiki/Special:Preferences.

Jack_who_built_the_house renamed this task from Saving options using API is taking too long on Commons to Saving options is taking too long on Commons.Mar 28 2021, 3:44 PM
Jack_who_built_the_house updated the task description. (Show Details)

Does this task need a DBA to find the conflicting transaction / database query?

Does the exception id refers the same stack trace? If yes, this should be tagged as wikimedia production error

It looks like, since las Fri (26 March), a whole bunch of these (Lock wait timeout exceeded; try restarting transaction) started to happen.

There seem to be 2 paths triggering these - given how they share the same symptom, wiki (commonswiki), db host (10.64.48.124) & table (users), I suspect they're related (but they might not be):


Query 1 (via User::loadFromDatabase)

SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id  FROM `user` JOIN `actor` `user_actor` ON ((user_actor.actor_user = user_id))   WHERE user_id = <blanked>  LIMIT 1   FOR UPDATE

Stack trace 1

from /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1700)
#0 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1678): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#1 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1244): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1929): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(2029): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/DBConnRef.php(331): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#6 /srv/mediawiki/php-1.36.0-wmf.36/includes/user/User.php(1331): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#7 /srv/mediawiki/php-1.36.0-wmf.36/includes/user/User.php(457): User->loadFromDatabase(integer)
#8 /srv/mediawiki/php-1.36.0-wmf.36/includes/user/User.php(4480): User->loadFromId(integer)
#9 /srv/mediawiki/php-1.36.0-wmf.36/includes/api/ApiOptions.php(165): User->getInstanceForUpdate()
#10 /srv/mediawiki/php-1.36.0-wmf.36/includes/api/ApiOptions.php(40): ApiOptions->getUserForUpdates()
#11 /srv/mediawiki/php-1.36.0-wmf.36/includes/api/ApiMain.php(1647): ApiOptions->execute()
#12 /srv/mediawiki/php-1.36.0-wmf.36/includes/api/ApiMain.php(617): ApiMain->executeAction()
#13 /srv/mediawiki/php-1.36.0-wmf.36/includes/api/ApiMain.php(588): ApiMain->executeActionWithErrorHandling()
#14 /srv/mediawiki/php-1.36.0-wmf.36/api.php(90): ApiMain->execute()
#15 /srv/mediawiki/php-1.36.0-wmf.36/api.php(45): wfApiMain()
#16 /srv/mediawiki/w/api.php(3): require(string)
#17 {main}

Query 2 (via UserEditCountUpdate::doUpdate)

UPDATE  `user` SET user_editcount=user_editcount+1 WHERE user_id = <blanked> AND (user_editcount IS NOT NULL)

Stack trace

from /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1700)
#0 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1678): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#1 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(1244): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/Database.php(2479): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->update(string, array, string, string)
#4 /srv/mediawiki/php-1.36.0-wmf.36/includes/libs/rdbms/database/DBConnRef.php(375): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#5 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/UserEditCountUpdate.php(80): Wikimedia\Rdbms\DBConnRef->update(string, array, array, string)
#6 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/AutoCommitUpdate.php(44): UserEditCountUpdate->{closure}(Wikimedia\Rdbms\DBConnRef, string)
#7 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/UserEditCountUpdate.php(112): AutoCommitUpdate->doUpdate()
#8 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/DeferredUpdates.php(513): UserEditCountUpdate->doUpdate()
#9 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/DeferredUpdates.php(390): DeferredUpdates::attemptUpdate(UserEditCountUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#10 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/DeferredUpdates.php(221): DeferredUpdates::run(UserEditCountUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#11 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/DeferredUpdatesScope.php(267): DeferredUpdates::{closure}(UserEditCountUpdate, integer)
#12 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/DeferredUpdatesScope.php(196): DeferredUpdatesScope->processStageQueue(integer, integer, Closure)
#13 /srv/mediawiki/php-1.36.0-wmf.36/includes/deferred/DeferredUpdates.php(242): DeferredUpdatesScope->processUpdates(integer, Closure)
#14 /srv/mediawiki/php-1.36.0-wmf.36/includes/MediaWiki.php(1097): DeferredUpdates::doUpdates(string)
#15 /srv/mediawiki/php-1.36.0-wmf.36/includes/MediaWiki.php(834): MediaWiki->restInPeace()
#16 /srv/mediawiki/php-1.36.0-wmf.36/includes/MediaWiki.php(846): MediaWiki->{closure}()
#17 /srv/mediawiki/php-1.36.0-wmf.36/api.php(125): MediaWiki->doPostOutputShutdown()
#18 /srv/mediawiki/php-1.36.0-wmf.36/api.php(45): wfApiMain()
#19 /srv/mediawiki/w/api.php(3): require(string)
#20 {main}

See https://logstash.wikimedia.org/app/dashboards#/view/DBQuery?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:'2021-03-24T09:18:27.275Z',to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:query,negate:!f,type:custom,value:'%7B%22match%22:%7B%22type%22:%7B%22query%22:%22mediawiki%22%7D%7D%7D'),query:(match:(type:(query:mediawiki)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:channel,negate:!f,params:(query:DBQuery),type:phrase),query:(match:(channel:(query:DBQuery,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:wiki.keyword,negate:!f,params:(query:commonswiki),type:phrase),query:(match_phrase:(wiki.keyword:commonswiki)))),fullScreenMode:!f,options:(darkTheme:!f),query:(language:lucene,query:'*'),timeRestore:!t,title:DBQuery,viewMode:view) for more of these

Krinkle renamed this task from Saving options is taking too long on Commons to Loading or saving oreferenc is taking too long on Commons.Apr 18 2021, 7:56 PM
Krinkle renamed this task from Loading or saving oreferenc is taking too long on Commons to Loading or saving preferences is taking too long on Commons.
Krinkle added a subscriber: Krinkle.
From T280458, steps to reproduce:
  1. Logged-in user on https://commons.wikimedia.org with initial preference of British English (en-gb). Using Firefox on macOS.
  2. Open ULS selector from personal tools on the top right.
  3. Select anything (e.g. Deutsch).

Expected result:

The page reloads in German.

Actual result:

Nothing happens for a good 10-20 seconds, during which an AJAX request is busy. It is a POST request to /w/api.php. After this finishes, eventually, the page reloads.

I can reproduce the issue over WikimediaDebug:

I've thusfar been unable to inspect the request in Firefox for some reason, but based on a breakpoint I think it must be one of these there request:

It's non-trivial to find out since the Logstash entries even with Verbose logging, do not record the API post payload, nor which API module is handling the request.

TODO: Consider letting the mwlog1001/api.log-style entries go to Logstash at least with severity=debug for when WikimediaDebug is used.
TODO: Consider having the API log with wfDebug or other debug message which API module(s) are handling the request.

And while we have mwlog1001/api.php, it is taking at least 20-30 minutes to get a result from fgrep over that file for a request ID from this time of day (toward the end of rollover).

Also, even after 20 minutes of fgrep'ing, there were no results found. It wasn't logged it seems?

Where the time is spent

There are a lot of rather odd and unexpected things in the Logstash entries so far. In no particular order:

  • There are 340 entries under gitinfo, informing us:
    • Loaded git data from cache for /srv/mediawiki/php-1.37.0-wmf.1.
    • Candidate cacheFile=/srv/mediawiki/php-1.37.0-wmf.1/cache/gitinfo/info.json for /srv/mediawiki/php-1.37.0-wmf.1
    • Is this opening the same file from disk 300 times?
    • Why is GitInfo being consulted?
  • There are 229 entries under ExternalStore, in the style:
    • ExternalStoreDB::fetchBlob cache miss on cluster25/12801096@commonswiki
    • ExternalStoreDB::fetchBlob cache miss on cluster27/65048413@commonswiki
    • ExternalStoreDB::fetchBlob cache miss on cluster25/12801096@commonswiki
    • etc.
    • What are these hundreds of ExternalStore fetches for?
  • There are 255 entries under GlobalTitleFail telling us that:
    • MessageCache::parse called with no title set.
    • That's a lot of Parser invocations.
  • There are 10,107 entries under memcached. A few highlights:
      • 1,688x says getWithSetCallback(commonswiki:pagetranslation:sourcepages): process cache hit.
      • 458x match fetchOrRegenerate(global:SqlBlobStore-blob:commonswiki:tt*: miss, new value computed (combined entries for different revison IDs)
      • 196x says getMulti(WANCache:global:revision-row-1.29:commonswiki:11357193:104218666|#|v) which is Template:Gadget-desc.
    • 140x says Rejected set() for {cachekey} due to snapshot lag (late regeneration).
  • There are 74 entries under UserOptionsManager,
    • 72x says Loading options from override cache from UserOptionsManager::loadUserOptions($user)
    • 2x says Loading options from database
  • Parser, ContentHandler, and interwiki channels:
    • ParserFactory: using default preprocessor
    • Registered handler for Scribunto: ScribuntoContentHandler
    • Parser::setOutputFlag: set user-signature flag on 'Special:Badtitle/dummy title for API calls set in api.php'; User signature detected
    • MediaWiki\Interwiki\ClassicInterwikiLookup::getInterwikiCacheEntry( mediawikiwiki )
    • MediaWiki\Interwiki\ClassicInterwikiLookup::getInterwikiCacheEntry( en )
    • MediaWiki\Interwiki\ClassicInterwikiLookup::getInterwikiCacheEntry( tools )

Rough analysis

The reason for causing so many Parser invocations (or rather, why at all) is presumably due to PreferencesFactory registry being based on HTMLForm labels, and HTMLForm in turn requires an associative array when creating checkboxes (for uniqueness?) where the array key is the label HTML. Thus this means every gadget checkbox has to parse the description in full in order to register the gadget preferences. This mess has been known for years, tracked at T58633: Preference retrieval should not require so much parsing.

These Parser invocations result in hundreds of unbatched SqlBlobStore memcached fetches (that's 500 times 1ms), which in turn result in 200 ExternalStoreDB queries over half a dozen ES shard connections.

The Rejected set() for {cachekey} due to snapshot lag messages explain why this super slow response can happen consistently for users because the majority of the request happens after the lag tolerance has ellapsed (5-6 seconds; in total: 10-20 seconds). Which means the work that happens there can't be safely cached under our current WANCache rules (there actually is a short interim cache to avoid stampedes, but for most users that means a cache miss since this isn't done many times a minute globally).

Open questions / Apparent defects

  • Why is GitInfo being consulted in this API request? (And why 340 times?)
  • Why is UserOptionsManager repeatedly loading the same options nearly a hundred times over?
  • Why is parsing these gadget labels so slow?
  • Why is the Translate extension making 1,688 roundtrips to Memcached for the same key? That's probably ~2 seconds by itself. (Nvm, these are process cached)
  • Can we resource fixing of T58633?
  • Why is parsing these gadget labels so slow?

I traced the slow Lua code to a change in Template:Gadget-state.

Background:

This template is used to give a little bit of structure to the gadget descriptions on the preferences page, such as MediaWiki:Gadget-HotCat. One of the parameters to this template is default=1, which is used to generate a localised way of saying "This gadget is enabled by default". The setting of this template parameter, of course, had to kept in sync by hand with the mechanical definitions of gadgets at MediaWiki:Gadgets-definition.

What happened:

User Tacsipacsi developed a nice Lua module (source), to automatically do this. That way, the parameter doesn't have to be set. Instead, the Lua module reads the raw source of the "Gadgets-definition" page, tries to find the definition of a given gadget, and tells the template whether it contains the default instruction, and if so, tell the tempate to generate the relevant portion of the description for it.

It was enabled (edit) on 26 March 2021, which is two days before this task was filed.

Based on a rough reading of the Lua module code used for this, I feel like this does not seem like something inherently expensive or complicated. The page it reads is a relatively small wiki page (11k), and all it does is scan through it once to find a matching line, and then perform one more match in that line, and then return a boolean. Given the revision text should be cached in both Memcached by MW, and process-cached by Parser or Scribunto; it seems like even doing that ~ 130 times for each parser run of a gadget description message shouldn't take a whole lot of time.

But, given much of the profile and the warning logs do seem to relate to the gadget descriptions and this Lua code, I have reverted that change for now. It seems to have shaven off multiple whole seconds (from ~15s down to ~3s for loading preferences, or changing ULS settings; after a handful of attempts). But there are still huge outliers and it still seems prone to timeouts, so I don't think this is resolved yet. But at least one of the mysteries is answered.

See also: T51501, which would make this user workaround redundant.

Change 681218 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/Gadgets@master] Reduce message parse in GadgetHooks::getPreferences

https://gerrit.wikimedia.org/r/681218

Change 681218 merged by jenkins-bot:

[mediawiki/extensions/Gadgets@master] Reduce message parse in GadgetHooks::getPreferences

https://gerrit.wikimedia.org/r/681218

Change 697076 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/Gadgets@master] Reduce message parse in GadgetHooks::getPreferences (second time)

https://gerrit.wikimedia.org/r/697076

I can try backporting this on Monday.

Change 697076 merged by jenkins-bot:

[mediawiki/extensions/Gadgets@master] Reduce message parse in GadgetHooks::getPreferences (second time)

https://gerrit.wikimedia.org/r/697076

Change 697816 had a related patch set uploaded (by Ladsgroup; author: Umherirrender):

[mediawiki/extensions/Gadgets@wmf/1.37.0-wmf.7] Reduce message parse in GadgetHooks::getPreferences (second time)

https://gerrit.wikimedia.org/r/697816

Change 697816 merged by jenkins-bot:

[mediawiki/extensions/Gadgets@wmf/1.37.0-wmf.7] Reduce message parse in GadgetHooks::getPreferences (second time)

https://gerrit.wikimedia.org/r/697816

Change 697887 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] user: Accept options-messages for multiselect user options

https://gerrit.wikimedia.org/r/697887

Change 697818 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@wmf/1.37.0-wmf.7] user: Accept options-messages for multiselect user options

https://gerrit.wikimedia.org/r/697818

Change 697818 merged by jenkins-bot:

[mediawiki/core@wmf/1.37.0-wmf.7] user: Accept options-messages for multiselect user options

https://gerrit.wikimedia.org/r/697818

Mentioned in SAL (#wikimedia-operations) [2021-06-03T00:36:35Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.7/includes/user/UserOptionsManager.php: Backport: [[gerrit:697818|user: Accept options-messages for multiselect user options (T58633 T278650)]] (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2021-06-03T00:40:00Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.7/extensions/Gadgets: Backport: [[gerrit:697816|Reduce message parse in GadgetHooks::getPreferences (second time) (T58633 T278650)]], Try II (duration: 00m 57s)

Change 697887 merged by jenkins-bot:

[mediawiki/core@master] user: Accept options-messages for multiselect user options

https://gerrit.wikimedia.org/r/697887

Deployed now. Some notes to compare:

This is mwbdeug and saving prefs went from 11seconds to 7. Keep in mind in production, it'll be much faster

Median of options API

image.png (913×1 px, 123 KB)

95th percentile of options API:

image.png (913×1 px, 132 KB)

The 99 percentile had the best drop. The average of 5.56 seconds over time is now one second:

image.png (338×1 px, 86 KB)