Page MenuHomePhabricator

Job queue backlog time increases for the `wikibase-InjectRCRecords` and `refreshLinks` jobs
Closed, DeclinedPublic

Description

As of the 23rd of April 2022, we have been seeing some irregular increases in the job queue backlog times for the wikibase-InjectRCRecords and refreshLinks jobs.

See the following Grafana boards for the respective backlog times:

wikibase-InjectRCRecords: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=wikibase-InjectRCRecords
refreshLinks: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=refreshLinks&var-dc=eqiad%20prometheus%2Fk8s&from=now-7d&to=now

This has also been reflected by the age of changes in the wb_changes table:
https://grafana.wikimedia.org/d/hGFN2TH7z/edit-dispatching-via-jobs?orgId=1&from=1650567902220&to=1651135174948

Event Timeline

ItamarWMDE renamed this task from Job queue backlog time increses for the `wikibase-InjectRCRecords` and `refreshLinks` jobs to Job queue backlog time increases for the `wikibase-InjectRCRecords` and `refreshLinks` jobs.Apr 28 2022, 8:45 AM
ItamarWMDE updated the task description. (Show Details)

constraintsRunCheck is also 12 hours backlogged at the moment (link with timestamps). I suspect this is a general job queue problem that’s just not as visible for lower-frequency or lower-runtime jobs.

Task Review / Prio Notes:

  • This task will be closed in favor of a task to investigate and define a playbook of how to handle such spikes in data access and queued jobs [TASK TBC @Michael]

Task Review / Prio Notes:

  • This task will be closed in favor of a task to investigate and define a playbook of how to handle such spikes in data access and queued jobs [TASK TBC @Michael]

I created T327641: [TECH][WIKIDATA] Create an incident playbook/flow chart for what to do when Wikidata ChangeDispatching is lagging. It can probably be improved, but I felt done is better than perfect in this case.