ChronologyProtector breaks if two requests write different sets of databases
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lea_Lacroix_WMDE
	Dec 7 2017, 1:26 PM

Description

If two requests within 60 seconds of each other write to different sets of databases (and the second request's databases are not a superset of the first request's), ChronologyProtector will wind up waiting until timeout for databases that weren't written during the second request to be "updated" to the second request's timestamp. This causes slow (>5s) responses to subsequent requests until ChronologyProtector's cache expires (after 60 seconds of no further writes), both read and write.

It looks like this is a long-standing bug that was made much more visible by the recent fix to actually use the cpPosTime cookie.

See T182322#3823259 for a more detailed analysis.

Details

Subject	Repo	Branch	Lines +/-
Disable cpPosTime injection to LBFactory in Setup.php for now	mediawiki/core	master	+2 -1
Replace cpPosTime cookie/parameter with cpPosIndex	mediawiki/core	master	+198 -139
Disable cpPosTime injection to LBFactory in Setup.php for now	mediawiki/core	wmf/1.31.0-wmf.16	+2 -1
Disable cpPosTime injection to LBFactory in Setup.php for now	mediawiki/core	wmf/1.31.0-wmf.15	+2 -1
Disable cpPosTime injection to LBFactory in Setup.php for now	mediawiki/core	wmf/1.31.0-wmf.12	+2 -1
Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"	mediawiki/core	wmf/1.31.0-wmf.11	+11 -94
Revert "Make ChronologyProtector actually use cpPosTime cookies"	mediawiki/core	wmf/1.31.0-wmf.11	+4 -11
Revert "Make ChronologyProtector actually use cpPosTime cookies"	mediawiki/core	master	+4 -11
Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"	mediawiki/core	master	+11 -94
Disable cpPosTime injection to LBFactory in Setup.php for now	mediawiki/core	wmf/1.31.0-wmf.11	+2 -1

Customize query in gerrit

Related Objects

Mentioned In: T190082: 5-second latency for certain API calls?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Ladsgroup @aude can you help with investigating this? This seems rather urgent.

@aaron Do you have an idea what could be going on here?

In T182322#3822491, @daniel wrote:

Also, is this Wikidata-only, or does the problem affect other sites as well?

I managed to get a similar thing to happen on testwiki T182322#3822451

In T182322#3822496, @Fnielsen wrote:

I see the default of Wikimedia\WaitConditionLoop is 5 seconds which corresponds well with the 5-6 seconds delay I experience with the API when editing on a desktop where there are calls to, e.g., https://www.wikidata.org/w/api.php?action=wbparsevalue

ChronologyProtector which uses WaitConditionLoop passes in its own timeout which is also 5.

const POS_WAIT_TIMEOUT = 5;

The ChronologyProtector is created in LBFactory and the timeout is taken from the request itself.

This was recently changed https://gerrit.wikimedia.org/r/#/c/393983/

daniel renamed this task from Editing Wikidata is slow to API is (intermittently) very slow.Dec 8 2017, 11:53 AM

daniel added a project: MediaWiki-Action-API.

daniel updated the task description. (Show Details)

So, I'm pretty sure reverting https://gerrit.wikimedia.org/r/#/c/393983/ would solve our problems here.

This is to do with the setting and handling of the cpPosTime cookie.

A POST request responds with a set cookie with a cpPosTime

set-cookie:cpPosTime=1512733965.1418; expires=Fri, 08-Dec-2017 11:53:45 GMT; Max-Age=60; path=/; secure; httponly

A subsequent GET request sends this cookie postTime to the server.

cookie:XXXXXXXXXX; cpPosTime=1512733965.1418; UseDC=master; UseCDNCache=false

and then anything that reads from a slave db connection (as far as I can tell) for example central auth getting a session, or indeed wikibase, waits for this time to pass before querying.

We don't really need this for any of the wikibase UI API requests AFAIK.
@Jonas @thiemowmde etc.

Pinging @Krinkle for he merged the patch in question

Thinking about this more this really really doesn't make sense for us (wikibase ui).

We edit and touch the db altering an entity, removing a statement for example
We then quickly click edit again for a different statement, triggering parse and or format value requests (which don't touch the DB directly).
Those GET requests get blocked waiting for centralauth to get a user from the db, when the api module itself doesn't need any db interaction.

This doesn't actually affect just the UI but also bots, if a bot makes and edit, and then tries to make a get request it will be stuck waiting, perhaps for no reason.

Lucas_Werkmeister_WMDE subscribed.Dec 8 2017, 12:06 PM

Addshore renamed this task from API is (intermittently) very slow to API is (intermittently) very slow - blocked waiting for cpPosTime cookie.Dec 8 2017, 12:08 PM

Addshore updated the task description. (Show Details)

Addshore added a project: User-Addshore.

Addshore moved this task from Unsorted 💣 to Watching 👀 on the User-Addshore board.

If we want to revert it looks like we would have to revert both related changes (to be safe):

Or wait for Performance-Team =]

Steinsplitter subscribed.Dec 8 2017, 12:45 PM

Anvilaquarius subscribed.Dec 8 2017, 1:25 PM

pinging @Anomie for user related stuff

Sjoerddebruin subscribed.Dec 8 2017, 3:33 PM

There doesn't seem to be any real "user related stuff" here, or anything actually related to Wikibase or the API. It looks like the bug will be hit on whatever the first thing is that opens a DB connection. In the linked traces that happens to be CentralAuth checking whether the user exists locally.

It seems likely that Addshore is right that the renewed use of the cpPosTime cookie is probably a factor here.

Based on Addshore's comment above that it could be reproduced relatively easily on testwiki, I added a hack on mwdebug1001 to ChronologyProtector::initPositions() to do some additional logging. and fired some write requests immediately followed by read requests at it.

The relevant bit of the hack, inserted here, looked like

wfDebugLog( 'AdHocDebug', "$this->waitForPosTime; " . self::minPosTime( $data ) . "; " . var_export( $data, 1 ) );

A log message not exhibiting the bug: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.12.08/mediawiki?id=AWA2vEiEhTduSC7pvHX8&_g=()

1512746796.6372; 1512746796.6398; array (
  'positions' => 
  array (
    'db1075' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'db1075-bin.002287',
      'pos' => '971204539',
      'gtids' => 
      array (
        0 => '0-171966669-4051016729',
        1 => '180359174-180359174-94123433',
        2 => '171966669-171966669-1497088137',
      ),
      'asOfTime' => 1512746796.6398,
    )),
    '10.64.16.187' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'es1014-bin.001930',
      'pos' => '91982167',
      'gtids' => 
      array (
        0 => '',
      ),
      'asOfTime' => 1512746796.641,
    )),
  ),
)

A log message exhibiting the bug: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.12.08/mediawiki?id=AWA2u2tTOBGWx1mQW9Xy&_g=()

1512746734.7688; 1512746683.2953; array (
  'positions' => 
  array (
    'db1075' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'db1075-bin.002287',
      'pos' => '967383713',
      'gtids' => 
      array (
        0 => '0-171966669-4051016667',
        1 => '180359174-180359174-94123433',
        2 => '171966669-171966669-1497079015',
      ),
      'asOfTime' => 1512746734.7753,
    )),
    '10.64.16.187' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'es1014-bin.001930',
      'pos' => '86433351',
      'gtids' => 
      array (
        0 => '',
      ),
      'asOfTime' => 1512746683.2953,
    )),
    '10.64.0.6' => 
    Wikimedia\Rdbms\MySQLMasterPos::__set_state(array(
      'file' => 'es1011-bin.001932',
      'pos' => '392199459',
      'gtids' => 
      array (
        0 => '',
      ),
      'asOfTime' => 1512746734.785,
    )),
  ),
)

It looks like what's happening here is that the write balancing between ExternalStore cluster24 and cluster25 is confusing ChronologyProtector: if one write hits cluster24 and another hits cluster25 within the 60-second lifetime of ChronologyProtector's memcache entry, ChronologyProtector is going to be waiting for an "update" to whichever of the two was written earlier that is never going to come. This sort of thing can probably happen for any two requests that write different sets of databases.

Trizek-WMF added a project: User-notice.Dec 8 2017, 4:12 PM

Anomie moved this task from Unsorted to Non-core-API stuff on the MediaWiki-Action-API board.Dec 8 2017, 4:53 PM

Kghbln unsubscribed.Dec 8 2017, 5:23 PM

Anomie renamed this task from API is (intermittently) very slow - blocked waiting for cpPosTime cookie to ChronologyProtector breaks if two requests write different sets of databases.Dec 8 2017, 5:30 PM

Anomie removed projects: MediaWiki-Action-API, Wikidata.

Anomie updated the task description. (Show Details)

Somehow during the process of finding out what is happening, the bug description looks somewhat harmless now. Probably it should read "WIKIDATA UNUSABLE RIGHT NOW!!!"...

That would not be an appropriately descriptive bug title.

I've poked @aaron to take a look, but he doesn't appear to be online at the moment. I think it makes sense to revert the commits that @Addshore pointed out if that will relieve things for wikidata, but I can't say that I understand the process or the other impacts of doing so.

Is there someone who's able to do so while we're waiting on Aaron to take a look?

Change 396443 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396443

Change 396444 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396444

Change 396445 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.31.0-wmf.11] Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396445

Change 396446 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.31.0-wmf.11] Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396446

Change 396447 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396447

Change 396451 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.11] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396451

MusikAnimal subscribed.Dec 8 2017, 8:04 PM

Change 396451 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.11] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396451

I'm not sure why the time check logic is so complicated, I guess it got prematurely generalized from the single-DB case.

It really only needs to look at the timestamp of the the cas() write that happened with the merged positions last-time. Ideally it would use a counter though.

• FiZie removed projects: Patch-For-Review, User-notice, User-Addshore, Performance-Team, Wikimedia-Performance-publish, Performance Issue.Dec 8 2017, 8:35 PM

• FiZie updated the task description. (Show Details)

• FiZie removed subscribers: MusikAnimal, gerritbot, • Imarlier and 28 others.

• FiZie closed this task as Declined.Dec 8 2017, 8:38 PM

Reedy reopened this task as Open.Dec 8 2017, 8:42 PM

Reedy added projects: Patch-For-Review, User-notice, User-Addshore, Performance-Team, Wikimedia-Performance-publish, Performance Issue.

Aklapper updated the task description. (Show Details)Dec 8 2017, 8:48 PM

Aklapper added a project: MediaWiki-General.Dec 8 2017, 8:50 PM

Aklapper added subscribers: MusikAnimal, GerritBot, • Imarlier and 28 others.

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)).Dec 8 2017, 9:00 PM

Not sure if it is related to this, but for about 2 days already I was getting incredible lag and slowness in Commons, specially while performing a task with AWB. The feeling was that it was being throttled, as after some rest it would do between 2-5 edits normally, then that incredible slowness again. The manual changes on Commons were taking a loooong time to commit, and autocomplete would only start after an exasperating delay. Seems to be back to normal now, finally.

Change 396546 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] [WIP] Replace cpPosTime cookie/parameter with cpPosIndex

https://gerrit.wikimedia.org/r/396546

Change 396443 abandoned by Reedy:
Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396443

Change 396444 abandoned by Reedy:
Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396444

Change 396446 abandoned by Reedy:
Revert "Make ChronologyProtector actually use cpPosTime cookies"

https://gerrit.wikimedia.org/r/396446

Change 396445 abandoned by Reedy:
Revert "objectcache: Make WANObjectCache interim caching not interfere with ChronologyProtector"

https://gerrit.wikimedia.org/r/396445

In T182322#3824120, @Darwinius wrote:

Not sure if it is related to this, but for about 2 days already I was getting incredible lag and slowness in Commons, specially while performing a task with AWB. The feeling was that it was being throttled, as after some rest it would do between 2-5 edits normally, then that incredible slowness again. The manual changes on Commons were taking a loooong time to commit, and autocomplete would only start after an exasperating delay. Seems to be back to normal now, finally.

Sounds like the correct bug

This was fixed with the deployment of https://gerrit.wikimedia.org/r/#/c/396451/

Marking as resolved for now, as the UBN is resolved.
If we want further actions they should probably live in further tickets.

Addshore added a project: Wikidata.Dec 9 2017, 3:30 AM

Liuxinyu970226 unsubscribed.Dec 9 2017, 10:38 AM

As a user, it's back to normal for me. Thanks !

Krinkle edited projects, added MediaWiki-libs-Rdbms; removed MediaWiki-General, Performance Issue, Wikimedia-Performance-publish, Patch-For-Review.Dec 11 2017, 6:31 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.

Addshore reassigned this task from Addshore to aaron.Dec 12 2017, 9:36 AM

Addshore moved this task from Watching 👀 to Closing ✔️ on the User-Addshore board.

Change 397894 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.12] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/397894

gerritbot added a project: Patch-For-Review.Dec 12 2017, 6:32 PM

Change 397894 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.12] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/397894

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2017-12-12 (1.31.0-wmf.12)); removed MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)).Dec 12 2017, 7:00 PM

@Addshore Thanks for adding User-notice – how would you simplify this for Tech News in a couple of sentences? Asking to make sure I don't misrepresent the effects.

In T182322#3836915, @Johan wrote:

@Addshore Thanks for adding User-notice – how would you simplify this for Tech News in a couple of sentences? Asking to make sure I don't misrepresent the effects.

Something like "Between the 6th and 12th of December users may have noticed POST requests to the API taking longer than usual" could do it.
Could also specifically mention that this was most noticeable with the wikidata UI and some gadgets.

Change 401610 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.15] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/401610

Change 401610 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.15] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/401610

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)); removed MW-1.31-release-notes (WMF-deploy-2017-12-12 (1.31.0-wmf.12)).Jan 2 2018, 9:00 PM

Johan moved this task from To Triage to In current Tech/News draft on the User-notice board.Jan 3 2018, 4:48 PM

RandomDSdevel awarded a token.Jan 9 2018, 1:42 AM

Change 403238 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.16] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/403238

Change 403238 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.16] Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/403238

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)); removed MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)).Jan 9 2018, 11:00 PM

Johan moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Jan 11 2018, 4:12 AM

Change 396546 merged by jenkins-bot:
[mediawiki/core@master] Replace cpPosTime cookie/parameter with cpPosIndex

https://gerrit.wikimedia.org/r/396546

Change 396447 abandoned by Aaron Schulz:
Disable cpPosTime injection to LBFactory in Setup.php for now

https://gerrit.wikimedia.org/r/396447

Is this resolved beyond a temporary fix? If yes, how? If no, is there a ticket describing what a proper solution would look like?

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2018-01-16 (1.31.0-wmf.17)); removed MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)).Jan 13 2018, 8:01 PM

Yes, https://gerrit.wikimedia.org/r/396546 .

Dbrant mentioned this in T190082: 5-second latency for certain API calls?.May 10 2018, 11:37 PM

Maintenance_bot edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:02 PM

Aklapper edited subscribers, added: gerritbot; removed: GerritBot.May 16 2023, 10:18 AM

	F11680019: image.png
	Dec 8 2017, 10:58 AM

	F11679903: image.png
	Dec 8 2017, 10:41 AM

	F11669665: image.png
	Dec 7 2017, 8:54 PM

	F11669545: image.png
	Dec 7 2017, 8:46 PM

ChronologyProtector breaks if two requests write different sets of databasesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

ChronologyProtector breaks if two requests write different sets of databases
Closed, ResolvedPublic
Actions