Page MenuHomePhabricator

ChronologyProtector breaks if two requests write different sets of databases
Closed, ResolvedPublic

Description

If two requests within 60 seconds of each other write to different sets of databases (and the second request's databases are not a superset of the first request's), ChronologyProtector will wind up waiting until timeout for databases that weren't written during the second request to be "updated" to the second request's timestamp. This causes slow (>5s) responses to subsequent requests until ChronologyProtector's cache expires (after 60 seconds of no further writes), both read and write.

It looks like this is a long-standing bug that was made much more visible by the recent fix to actually use the cpPosTime cookie.

See T182322#3823259 for a more detailed analysis.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Ladsgroup @aude can you help with investigating this? This seems rather urgent.

@aaron Do you have an idea what could be going on here?

Also, is this Wikidata-only, or does the problem affect other sites as well?

I managed to get a similar thing to happen on testwiki T182322#3822451

I see the default of Wikimedia\WaitConditionLoop is 5 seconds which corresponds well with the 5-6 seconds delay I experience with the API when editing on a desktop where there are calls to, e.g., https://www.wikidata.org/w/api.php?action=wbparsevalue

ChronologyProtector which uses WaitConditionLoop passes in its own timeout which is also 5.

const POS_WAIT_TIMEOUT = 5;

The ChronologyProtector is created in LBFactory and the timeout is taken from the request itself.

This was recently changed https://gerrit.wikimedia.org/r/#/c/393983/

daniel renamed this task from Editing Wikidata is slow to API is (intermittently) very slow.Dec 8 2017, 11:53 AM
daniel added a project: MediaWiki-Action-API.
daniel updated the task description. (Show Details)

So, I'm pretty sure reverting https://gerrit.wikimedia.org/r/#/c/393983/ would solve our problems here.

This is to do with the setting and handling of the cpPosTime cookie.

A POST request responds with a set cookie with a cpPosTime

set-cookie:cpPosTime=1512733965.1418; expires=Fri, 08-Dec-2017 11:53:45 GMT; Max-Age=60; path=/; secure; httponly

A subsequent GET request sends this cookie postTime to the server.

cookie:XXXXXXXXXX; cpPosTime=1512733965.1418; UseDC=master; UseCDNCache=false

and then anything that reads from a slave db connection (as far as I can tell) for example central auth getting a session, or indeed wikibase, waits for this time to pass before querying.

We don't really need this for any of the wikibase UI API requests AFAIK.
@Jonas @thiemowmde etc.

Pinging @Krinkle for he merged the patch in question

Thinking about this more this really really doesn't make sense for us (wikibase ui).

  1. We edit and touch the db altering an entity, removing a statement for example
  2. We then quickly click edit again for a different statement, triggering parse and or format value requests (which don't touch the DB directly).
  3. Those GET requests get blocked waiting for centralauth to get a user from the db, when the api module itself doesn't need any db interaction.

This doesn't actually affect just the UI but also bots, if a bot makes and edit, and then tries to make a get request it will be stuck waiting, perhaps for no reason.

Addshore renamed this task from API is (intermittently) very slow to API is (intermittently) very slow - blocked waiting for cpPosTime cookie.Dec 8 2017, 12:08 PM
Addshore updated the task description. (Show Details)
Addshore added a project: User-Addshore.
Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

If we want to revert it looks like we would have to revert both related changes (to be safe):

Or wait for Performance-Team =]

There doesn't seem to be any real "user related stuff" here, or anything actually related to Wikibase or the API. It looks like the bug will be hit on whatever the first thing is that opens a DB connection. In the linked traces that happens to be CentralAuth checking whether the user exists locally.

It seems likely that Addshore is right that the renewed use of the cpPosTime cookie is probably a factor here.


Based on Addshore's comment above that it could be reproduced relatively easily on testwiki, I added a hack on mwdebug1001 to ChronologyProtector::initPositions() to do some additional logging. and fired some write requests immediately followed by read requests at it.

The relevant bit of the hack, inserted here, looked like

wfDebugLog( 'AdHocDebug', "$this->waitForPosTime; " . self::minPosTime( $data ) . "; " . var_export( $data, 1 ) );
1512746796.6372; 1512746796.6398; array (
  'positions' => 
  array (
    'db1075' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'db1075-bin.002287',
      'pos' => '971204539',
      'gtids' => 
      array (
        0 => '0-171966669-4051016729',
        1 => '180359174-180359174-94123433',
        2 => '171966669-171966669-1497088137',
      ),
      'asOfTime' => 1512746796.6398,
    )),
    '10.64.16.187' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'es1014-bin.001930',
      'pos' => '91982167',
      'gtids' => 
      array (
        0 => '',
      ),
      'asOfTime' => 1512746796.641,
    )),
  ),
)
1512746734.7688; 1512746683.2953; array (
  'positions' => 
  array (
    'db1075' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'db1075-bin.002287',
      'pos' => '967383713',
      'gtids' => 
      array (
        0 => '0-171966669-4051016667',
        1 => '180359174-180359174-94123433',
        2 => '171966669-171966669-1497079015',
      ),
      'asOfTime' => 1512746734.7753,
    )),
    '10.64.16.187' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'es1014-bin.001930',
      'pos' => '86433351',
      'gtids' => 
      array (
        0 => '',
      ),
      'asOfTime' => 1512746683.2953,
    )),
    '10.64.0.6' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'es1011-bin.001932',
      'pos' => '392199459',
      'gtids' => 
      array (
        0 => '',
      ),
      'asOfTime' => 1512746734.785,
    )),
  ),
)

It looks like what's happening here is that the write balancing between ExternalStore cluster24 and cluster25 is confusing ChronologyProtector: if one write hits cluster24 and another hits cluster25 within the 60-second lifetime of ChronologyProtector's memcache entry, ChronologyProtector is going to be waiting for an "update" to whichever of the two was written earlier that is never going to come. This sort of thing can probably happen for any two requests that write different sets of databases.

Anomie renamed this task from API is (intermittently) very slow - blocked waiting for cpPosTime cookie to ChronologyProtector breaks if two requests write different sets of databases.Dec 8 2017, 5:30 PM
Anomie updated the task description. (Show Details)
Anomie updated the task description. (Show Details)

Somehow during the process of finding out what is happening, the bug description looks somewhat harmless now. Probably it should read "WIKIDATA UNUSABLE RIGHT NOW!!!"...

That would not be an appropriately descriptive bug title.

I've poked @aaron to take a look, but he doesn't appear to be online at the moment. I think it makes sense to revert the commits that @Addshore pointed out if that will relieve things for wikidata, but I can't say that I understand the process or the other impacts of doing so.

Is there someone who's able to do so while we're waiting on Aaron to take a look?

Change 396443 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396443

Change 396444 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396444

Change 396445 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.31.0-wmf.11] Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396445

Change 396446 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.31.0-wmf.11] Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396446

Change 396447 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396447

Change 396451 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.11] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396451

Change 396451 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.11] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396451

I'm not sure why the time check logic is so complicated, I guess it got prematurely generalized from the single-DB case.

It really only needs to look at the timestamp of the the cas() write that happened with the merged positions last-time. Ideally it would use a counter though.

Not sure if it is related to this, but for about 2 days already I was getting incredible lag and slowness in Commons, specially while performing a task with AWB. The feeling was that it was being throttled, as after some rest it would do between 2-5 edits normally, then that incredible slowness again. The manual changes on Commons were taking a loooong time to commit, and autocomplete would only start after an exasperating delay. Seems to be back to normal now, finally.

Change 396546 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] [WIP] Replace cpPosTime cookie/parameter with cpPosIndex

https://gerrit.wikimedia.org/r/396546

Change 396443 abandoned by Reedy:
Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396443

Change 396444 abandoned by Reedy:
Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396444

Change 396446 abandoned by Reedy:
Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396446

Change 396445 abandoned by Reedy:
Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396445

Not sure if it is related to this, but for about 2 days already I was getting incredible lag and slowness in Commons, specially while performing a task with AWB. The feeling was that it was being throttled, as after some rest it would do between 2-5 edits normally, then that incredible slowness again. The manual changes on Commons were taking a loooong time to commit, and autocomplete would only start after an exasperating delay. Seems to be back to normal now, finally.

Sounds like the correct bug

Addshore claimed this task.

Marking as resolved for now, as the UBN is resolved.
If we want further actions they should probably live in further tickets.

As a user, it's back to normal for me. Thanks !

Change 397894 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.12] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/397894

Change 397894 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.12] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/397894

@Addshore Thanks for adding User-notice – how would you simplify this for Tech News in a couple of sentences? Asking to make sure I don't misrepresent the effects.

@Addshore Thanks for adding User-notice – how would you simplify this for Tech News in a couple of sentences? Asking to make sure I don't misrepresent the effects.

Something like "Between the 6th and 12th of December users may have noticed POST requests to the API taking longer than usual" could do it.
Could also specifically mention that this was most noticeable with the wikidata UI and some gadgets.

Change 401610 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.15] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/401610

Change 401610 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.15] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/401610

Change 403238 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.16] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/403238

Change 403238 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.16] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/403238

Change 396546 merged by jenkins-bot:
[mediawiki/core@master] Replace cpPosTime cookie/parameter with cpPosIndex

https://gerrit.wikimedia.org/r/396546

Change 396447 abandoned by Aaron Schulz:
Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396447

Is this resolved beyond a temporary fix? If yes, how? If no, is there a ticket describing what a proper solution would look like?