Page MenuHomePhabricator

Investigate JobQueue outage from 2020-01-04 22:00 UTC
Closed, DeclinedPublic

Description

As of about 20 minutes ago, it seems several job types are no longer being processed.

I observed this mainly through the Watchlist feature on Wikipedia no longer working correctly. Specifically, marking changes as seen no longer works. "Last seen marker" keeps flip-flopping between the first and second read item but never progressing or persisting. This is a side-effect from the ActivityUpdateJob not executing. The marker has a single-state object cache to at least remember 1 item but that gets lost and overwritten quite quickly.

https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus

Screenshot 2020-01-04 at 22.21.41.png (1×2 px, 128 KB)

Event Timeline

It appears to have recovered now. From the Grafana dash, it's unclear to me what led to this issue. I don't see e.g. a influx of many or slow jobs that would explain a backlog in processing.

I do see that the 20-25 min outage affected all the low-traffic jobs, which supports the theory that it wasn't caused by something in that job group blocking the processing of other jobs in that group:

Screenshot 2020-01-04 at 22.53.26.png (824×1 px, 132 KB)

jijiki renamed this task from JobQueue stuck as of 2020-01-14 22:00 UTC to JobQueue stuck as of 2020-01-04 22:00 UTC.Jan 5 2020, 7:21 PM
Krinkle renamed this task from JobQueue stuck as of 2020-01-04 22:00 UTC to JobQueue was stuck for 25min at 2020-01-04 22:00 UTC.Jan 6 2020, 8:14 PM
Krinkle added a project: Wikimedia-Incident.
Krinkle renamed this task from JobQueue was stuck for 25min at 2020-01-04 22:00 UTC to Investigate JobQueue outage from 2020-01-04 22:00 UTC.Jan 6 2020, 8:19 PM
Eevans triaged this task as Medium priority.Jan 10 2020, 5:21 PM

Is there more to do in this task? Should this still be open? Will there be an incident report? If yes, who is planning to work on it?

It is pending incident documentation by Core Platform, in particular to have documented what caused it, what we learned from investigation/mitigation, and to file tasks for any follow-up/prevention. I'll move it to their Inbox as I now understand Clinic Duty Inbox is not used.

Krinkle moved this task from EventBus infra to Meta on the WMF-JobQueue board.
Krinkle moved this task from Meta to EventBus infra on the WMF-JobQueue board.

Moving onto the board for indicent docs (Wikimedia-Incident is only for the incident itself and/or action items to improve stuff). Note that a document for this one has not yet been created yet.