Description
Related Objects
Event Timeline
The lag:
Replication lag on a database server at that time:
Root cause is T238199: SpecialFewestRevisions::reallyDoQuery takes more than 9h to run which saw the page removed in T238199#6169152 which fixed this case of dispatch lag.
This was very similar to T252952: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag in which db lag caused the dispatch process to wait for replication causing higher dispatch lag leading to higher maxlag.
In the case that a db server gets lagged we are likely to see the same pattern happen again, however, the cause of the lag will not be the special page (which was the cause of this case).
One of the followups for the DBAs for those tickets is T253120: Create prometheus alert to detect lag spikes which should help get these tackled faster.
Other than that, we could alter how much dispatch lag is taken into account in maxlag? however in both of these cases the DB lag was over 5 anyway, so maxlag still would've been over 5 for the majority of the time period.