Page MenuHomePhabricator

Reoccuring dispatch problems (May 2018)
Closed, ResolvedPublic

Description

Recently we often get dispatch lag alerts for a few minutes each (takes 5 minutes until they recover, usually).

Currently we start one dispatcher instance, which runs for 10 minutes, every 3 minutes. This means we always have 4 dispatchers running once a new dispatcher starts (for a minute), after a minute one dispatchers stops, thus we have 3 dispatchers running then until the next one starts (for two minutes).

I suggest changing this so that we always have 4 dispatchers running in parallel. Very closely related to T193349: Ongoing dispatch problems (April 2018).

Event Timeline

Change 432778 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Wikidata: Always have 4 change dispatchers running

https://gerrit.wikimedia.org/r/432778

Change 435648 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Wikidata: Always have 4 change dispatchers running

https://gerrit.wikimedia.org/r/435648

@hoo what is the current status of this? Can we move forward with having all 4 dispatchers running?

I'll schedule the dispatching change for some time either this evening or tomorrow.

Change 435648 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikidata: Always have 4 change dispatchers running

https://gerrit.wikimedia.org/r/435648

I just deployed the 4 dispatches at a time patch (forgot to tag the ticket).

Let's review over the next hours.

After about 20 minutes there is no real change in lag.
I'll leave the patch deployed for the rest of the day and see.

A bit more investigation:

Sleep time due to no clients

I added a new graph to grafana based on pre existing data https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch-script

image.png (299×1 px, 97 KB)

This shows us the full amount of time the script spends doing stuff.

Failure to select clients seems to be the thing slowing the dispatching down, in 1 hour of running 4 dispatches (4 hours of dispatching time) 1 of those 4 hours is spent sleeping because the script failed to find a client.

If I compare this to a 1 hour slot 5 hours ago while we were still running 3 dispatchers

image.png (265×1 px, 83 KB)

In 1 hours of real time running 3 dispatchers (3 hours of dispatching time) over 1 hour of that time was spent sleeping because the script failed to find a client.

I'm going to revert the switch from 3 to 4 dispatchers and

This partly comes down to the 10 second default sleep for this, perhaps this should be lowered.
Also the dispatch interval, of 25 seconds, could perhaps be lowered, the intention of the interval is to allow for change coalescing during dispatching.

Lag never really decreased to 0

The average lag will always have some high value if we never actually dispatch all changes we have to a client.
Looking at pswiki as an example over the last 3 hours:

06:34:29 Posted 7 changes to pswiki, up to ID 687490812, timestamp 20180606063322. Lag is 67 seconds. Next ID is 687491496.
06:58:38 Posted 3 changes to pswiki, up to ID 687501565, timestamp 20180606065436. Lag is 242 seconds. Next ID is 687503429.
07:22:13 Posted 3 changes to pswiki, up to ID 687512677, timestamp 20180606072047. Lag is 86 seconds. Next ID is 687513358.
07:32:45 Posted 1 changes to pswiki, up to ID 687517674, timestamp 20180606073158. Lag is 47 seconds. Next ID is 687518097.
07:42:08 Posted 3 changes to pswiki, up to ID 687521826, timestamp 20180606074102. Lag is 66 seconds. Next ID is 687522469.
07:51:58 Posted 6 changes to pswiki, up to ID 687527149, timestamp 20180606075117. Lag is 41 seconds. Next ID is 687527523.
08:04:48 Posted 2 changes to pswiki, up to ID 687533358, timestamp 20180606080318. Lag is 90 seconds. Next ID is 687534191.
08:18:53 Posted 3 changes to pswiki, up to ID 687540113, timestamp 20180606081539. Lag is 194 seconds. Next ID is 687541770.
08:34:11 Posted 4 changes to pswiki, up to ID 687548966, timestamp 20180606083240. Lag is 91 seconds. Next ID is 687549634.
08:48:56 Posted 3 changes to pswiki, up to ID 687554842, timestamp 20180606084343. Lag is 313 seconds. Next ID is 687557642.
09:00:55 Posted 3 changes to pswiki, up to ID 687564172, timestamp 20180606085953. Lag is 62 seconds. Next ID is 687564595.
09:15:51 Posted 6 changes to pswiki, up to ID 687572057, timestamp 20180606091341. Lag is 130 seconds. Next ID is 687573257.
09:28:12 Posted 7 changes to pswiki, up to ID 687578481, timestamp 20180606092435. Lag is 217 seconds. Next ID is 687580509.
09:37:04 Posted 6 changes to pswiki, up to ID 687584827, timestamp 20180606093646. Lag is 18 seconds. Next ID is 687584957.
09:45:25 Posted 3 changes to pswiki, up to ID 687588950, timestamp 20180606094344. Lag is 101 seconds. Next ID is 687589974.

When the dispatcher finally got around to dispatching to the client again at each of the below times the dispatch lag was:

06:58:38 - 24 mins + 67 seconds =~ 25 mins
07:22:13 - 24 mins + 242 seconds =~ 28 mins
07:32:45 - 10 mins + 86 seconds =~ 11 mins
07:42:08 - 10 mins + 47 seconds =~ 11 mins
07:51:58 - 9 mins + 66 seconds =~ 10 mins
08:04:48 - 13 mins + 41 seconds =~14 mins

So pswiki for the hour had the average lag of ~15 mins.
The baseline lag can be seen as roughly 1-2 mins.

On sites like enwiki the base lag that never gets dispatched past is more like ~250-600 seconds (even though 420 changes are dispatched during each pass.

Change 437785 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Make default dispatchIdleDelay configurable

https://gerrit.wikimedia.org/r/437785

So the monitored lag has now dropped to below the 5 minute alarm, let's see if it stays that way.

Change 437785 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make default dispatchIdleDelay configurable

https://gerrit.wikimedia.org/r/437785

Change 439528 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/puppet@production] Switch from 5 mins to 10 mins for wikidata dispatch check

https://gerrit.wikimedia.org/r/439528

Change 439528 merged by Dzahn:
[operations/puppet@production] Switch from 5 mins to 10 mins for wikidata dispatch check

https://gerrit.wikimedia.org/r/439528

Vvjjkkii renamed this task from Reoccuring dispatch problems (May 2018) to p1caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii removed Addshore as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Addshore; removed: gerritbot, Aklapper.
CommunityTechBot renamed this task from p1caaaaaaa to Reoccuring dispatch problems (May 2018).Jul 2 2018, 4:14 PM
CommunityTechBot assigned this task to Addshore.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: gerritbot, Aklapper; removed: Addshore.

It's no longer may, and we havn't had issues recently