Page MenuHomePhabricator

Job Queue growing and then running a lot of jobs at once on commonswiki
Closed, ResolvedPublic

Description

[Title was: db1040 high load and queries in statistics]

Since yesterday (April 10th, accoring to icinga alert history, including soft warnings) mysql processlist on db1040 (s4 master) has spikes of queries in statistics state and at the same time there is a very high spike in load average (probably due to I/O waits).

Event Timeline

Volans created this task.Apr 11 2016, 9:00 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 11 2016, 9:00 AM
Volans triaged this task as High priority.Apr 11 2016, 9:01 AM
Volans added a subscriber: Joe.Apr 11 2016, 11:05 AM

The spike in load and query in state statistics are due because the server is I/O bound at that time, due to spikes in read and writes (updates), and during the peaks the main thread is performing flushing buffer pool pages.

RAID is healthy, I've checked InnoDB variables and status and all looks normal, the spikes can be seen on the slaves too.

Looking for something outside the MySQL cluster there is a clear correlation with the Job Queue size (thanks @Joe ), where the queue size is growing steadily and then dropping suddenly, just before the spikes on the databases. (see attached graphs, https://grafana.wikimedia.org/dashboard/db/job-queue-health for the JobQueueSize, https://tendril.wikimedia.org/host/view/db1040.eqiad.wmnet/3306 for the master and the links there for the slaves)

So far it looks like something is triggering a lot of JobQueues that in the end modifies a lot of pages on commonswiki.

Volans renamed this task from db1040 high load and queries in statistics to Job Queue growing and then running a lot of jobs at once on commonswiki.Apr 11 2016, 3:51 PM
Volans updated the task description. (Show Details)
Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptApr 11 2016, 3:51 PM

It happened again, we saw the Job Queue size growing steadily for few hours and then shrinking quickly while at the same time the query per second (QPS) on s4 master went from ~800 QPS that is it's normal load to ~10k QPS that is what at most it can do, was completely I/O bound and had a load spike as expected.

Doing some grepping on runJobs.log on fluorine didn't show anything very useful to me, showing pretty much only Special:Badtitle/EnqueueJob jobs that were enqueuing jobs of type wikibase-addUsagesForPage for all wikis, including common.
The actual number of jobs enqueued for commonswiki looked pretty small compared to the total, so I'm not sure how to read this.

What looks to be happening though is that at some point (when a condition is met or some lock expire?) the queue size start shrinking and for ~5~10 minutes runs queries against s4 master without any throttling.

As per IRC discussion adding @aaron, @ori, @EBernhardson

Volans removed Volans as the assignee of this task.Apr 11 2016, 4:16 PM
Volans added a project: Performance-Team.

Poked around but not finding any red herrings. I think it's a symptom rather than a cause, but slave-lag-limit looks to be returned for most attempts to run jobs against commonswiki. I can't find any reason why we suddenly get 10 minutes solid of jobs running and then stalled out though ...

EBernhardson added a comment.EditedApr 11 2016, 5:08 PM

fwiw, when we have a spike in the DB the number of refreshLinksDynamic jobs run on commonswiki per minute spikes from ~5k to 50k:

16:23   4931
16:24   5323
16:25   5165
16:26   5137
16:27   8731
16:28   22405
16:29   21187
16:30   30576
16:31   42397
16:32   51236
16:33   52661
16:34   54531
16:35   58520
16:36   48463
16:37   22146
16:38   5219
16:39   4265
16:40   4652
16:41   3295

fwiw I don't see spikes on any other shard, only on s4 (commonswiki), from the databases point of view.

Change 282736 had a related patch set uploaded (by Ori.livneh):
Make "refreshLinksDynamic" low-priority

https://gerrit.wikimedia.org/r/282736

Change 282736 merged by Ori.livneh:
Make "refreshLinksDynamic" low-priority

https://gerrit.wikimedia.org/r/282736

ori moved this task from Inbox to Doing on the Performance-Team board.Apr 11 2016, 6:31 PM

Change 282762 had a related patch set uploaded (by Aaron Schulz):
Increase triggerOpportunisticLinksUpdate() backoff TTL

https://gerrit.wikimedia.org/r/282762

Change 282817 had a related patch set uploaded (by Aaron Schulz):
Flag triggerOpportunisticLinksUpdate() behind $wgMiserMode

https://gerrit.wikimedia.org/r/282817

Change 282820 had a related patch set uploaded (by Ori.livneh):
Increase triggerOpportunisticLinksUpdate() backoff TTL

https://gerrit.wikimedia.org/r/282820

Change 282820 merged by Ori.livneh:
Increase triggerOpportunisticLinksUpdate() backoff TTL

https://gerrit.wikimedia.org/r/282820

Change 282762 merged by jenkins-bot:
Increase triggerOpportunisticLinksUpdate() backoff TTL

https://gerrit.wikimedia.org/r/282762

ori closed this task as Resolved.Apr 11 2016, 10:40 PM
ori assigned this task to aaron.
  • b543b9bf11afd4 should reduce the rate at which refreshLinksDynamic are enqueued.
  • At 22:00 UTC, @aaron deleted all refreshLinksDynamic jobs already in the queue.
  • If the load generated by refreshLinksDynamic jobs continues to be high, we can merge Ieeaf047e7, which puts refreshLinksDynamic jobs behind $wgMiserMode (effectively disabling it for WMF prod).

According to Aaron, the consequence of discarding (or disabling) refreshLinksDynamic jobs is that "some occasional link table changes due to CURRENTTIME magic might not happen"; this was the norm until about a year ago, when 132f7bb89f3 was merged. So, not catastrophic.

Change 282817 merged by Ori.livneh:
Flag triggerOpportunisticLinksUpdate() behind $wgMiserMode

https://gerrit.wikimedia.org/r/282817

Change 282836 had a related patch set uploaded (by Ori.livneh):
Flag triggerOpportunisticLinksUpdate() behind $wgMiserMode

https://gerrit.wikimedia.org/r/282836

Change 282836 merged by Ori.livneh:
Flag triggerOpportunisticLinksUpdate() behind $wgMiserMode

https://gerrit.wikimedia.org/r/282836

Mentioned in SAL [2016-04-12T00:37:56Z] <ori> Deleted refreshLinksDynamic jobs for commonswiki and enwiki (T132318)