Page MenuHomePhabricator

Investigate backlog in RecordLintJob
Closed, ResolvedPublic

Description

There is a script running to reprocess pages for the linter T161556 that creates backlog on RecordLintJob, so the lint fixes might not be visible right away. The backlog could be seen on https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?panelId=5&fullscreen&orgId=1

The reason of the backlog accumulation is not yet completely clear, the concurrency that we've set should be enough to process the load with no backlog, but there's some wall we're hitting on this.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-11-21T10:10:50Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@e35aa05]: Set consumer_batch_size to 10 T181007

Mentioned in SAL (#wikimedia-operations) [2017-11-21T10:11:21Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@e35aa05]: Set consumer_batch_size to 10 T181007 (duration: 00m 31s)

Jdforrester-WMF renamed this task from Investigate backlog in RecorLintJob to Investigate backlog in RecordLintJob.Nov 21 2017, 4:35 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Change 392803 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Report pure job execution timing via statsd.

https://gerrit.wikimedia.org/r/392803

Change 392803 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Report pure job execution timing via statsd.

https://gerrit.wikimedia.org/r/392803

Change 393548 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Make http agent use keepAlive

https://gerrit.wikimedia.org/r/393548

Change 393548 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Make http agent use keepAlive

https://gerrit.wikimedia.org/r/393548

Mentioned in SAL (#wikimedia-operations) [2017-11-27T12:08:13Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@47d27dc]: Enable keep-alive T181007

Mentioned in SAL (#wikimedia-operations) [2017-11-27T12:09:17Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@47d27dc]: Enable keep-alive T181007 (duration: 01m 03s)

Enabling keep-alive was a complete success. We've made a test with artificial backlog