Page MenuHomePhabricator

Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag
Closed, ResolvedPublic

Description

For about 20 hours Wikidata dispatch lag has been being higher than 10 minutes. Further investigation is needed.

See https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1&from=now-2d&to=now&refresh=1m

Event Timeline

Bugreporter triaged this task as Unbreak Now! priority.May 16 2020, 8:04 PM
Bugreporter updated the task description. (Show Details)

The timing lines up with this db lag on db1101

image.png (273×1 px, 43 KB)

image.png (264×925 px, 28 KB)

Dispatching is slowing down, as the db server is lagged and the dispatch process has a waitForReplication call in the code.
Specifically this is in SqlChangeDispatchCoordinator::releaseClient

This db server probably needs to be depooled.

Addshore renamed this task from Fix Wikidata dispatch to Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag.May 16 2020, 9:16 PM

Change 596824 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] SqlChangeDispatchCoordinator statsd time the wait for replication

https://gerrit.wikimedia.org/r/596824

Addshore claimed this task.
Addshore lowered the priority of this task from Unbreak Now! to High.

Marking as resolved as the impact on wikidata is now gone

10:32 PM <marostegui> This host is lagging just a few seconds but all the time
10:32 PM <marostegui> Due to:
10:32 PM <marostegui> yes
10:33 PM <jynus> Query | 95861
10:33 PM <marostegui> that
10:33 PM <marostegui> I am going to kill all that
10:33 PM <marostegui> the query killer is disabled due to a 10.4 bug

image.png (287×922 px, 32 KB)

image.png (290×925 px, 24 KB)

Some more details.
There were a few long running queries

| 669154401 | wikiuser        | 10.64.0.59:40112   | wikidatawiki       | Query     |   95780 | Sending data                                   | SELECT /* SpecialRecentChanges::doMainQuery  */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_

Those queries were running for hours and as the query killer wasn't enabled due to a mariadb 10.4 bug (T247728), they were causing 3-5 seconds of continued lag.

Change 596824 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Observability for SqlChangeDispatchCoordinator wait for replication

https://gerrit.wikimedia.org/r/596824