Page MenuHomePhabricator

high dispatch lag in Wikidata (27 May 2020)
Closed, ResolvedPublic

Event Timeline

Bugreporter triaged this task as Unbreak Now! priority.May 27 2020, 11:45 AM
Bugreporter lowered the priority of this task from Unbreak Now! to Medium.May 27 2020, 7:48 PM

This seems to be ok again now. @Addshore any idea what was going on?

Addshore claimed this task.

The lag:

image.png (269×923 px, 31 KB)

Replication lag on a database server at that time:

image.png (919×1 px, 209 KB)

Root cause is T238199: SpecialFewestRevisions::reallyDoQuery takes more than 9h to run which saw the page removed in T238199#6169152 which fixed this case of dispatch lag.

This was very similar to T252952: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag in which db lag caused the dispatch process to wait for replication causing higher dispatch lag leading to higher maxlag.
In the case that a db server gets lagged we are likely to see the same pattern happen again, however, the cause of the lag will not be the special page (which was the cause of this case).
One of the followups for the DBAs for those tickets is T253120: Create prometheus alert to detect lag spikes which should help get these tackled faster.
Other than that, we could alter how much dispatch lag is taken into account in maxlag? however in both of these cases the DB lag was over 5 anyway, so maxlag still would've been over 5 for the majority of the time period.

Addshore renamed this task from high dispatch lag in Wikidata to high dispatch lag in Wikidata (27 May 2020).May 28 2020, 8:48 AM