Page MenuHomePhabricator

ocg alarm ocg_job_status_queue 'flapping'
Closed, InvalidPublic


I've seen ocg warning for ocg_job_status_queue coming up lately, looks like it is constantly close to its warning threshold (ganglia)

Event Timeline

fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a subscriber: fgiunchedi.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2015, 1:02 PM
BBlack assigned this task to cscott.Apr 29 2015, 2:53 PM
BBlack added subscribers: cscott, BBlack.

@cscott: can you advise on whether there's a real issue here and/or the thresholds need changing?

I'll take a look, thanks for bringing it to my attention.

fgiunchedi triaged this task as Medium priority.Jul 22 2015, 11:17 AM

I don't see the flapping alerts and much lower queues, out of curiosity what changed?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 22 2015, 11:17 AM

aannd it jumped again and the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs?

the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs?

^ @cscott: Do you know?

One year later: Still happening? Or obsolete / declined?

The threshold is pretty arbitrary, it just warns us maybe to have a look and see if anything is obviously wrong. We can bump the threshold higher if it seems that the warning is triggering too often.

The status queue length is proportional mostly to the number of *cached* jobs in the system, although it would also increase if the number of *active* jobs increased -- that latter would be worrisome, but the former is what's flapping, as far as I can tell.

Just to remind people that these boxes are still flapping with this alert.

Joe added a subscriber: Joe.Sep 29 2016, 8:53 AM

They're not flapping, they are currently processing an enormous amount of requests suddenly today around 8 AM UTC.

The source of such requests should probably be investigated, but it's not seriously "we're on fire" alarming. So the alert has actually warned us of an ongoing anomaly so I'm pretty happy with it.

When I looked at syslog on ocg1001 earlier the job queue health varied from a little over 3000 to a little over 1000. The period of time over 3000 was about 8 minutes. If the alert level were set at e.g. 3800 we would have seen no alerts at all. This to me indicates that we stil may want to revisit the threshhold level setting; the problem is that I have no idea at what level we would have an impact to the service. Some spikes we should be able to handle; when is it too much? No idea.

BTW Icinga still doesn't report in the channel that it's no longer 'critical', after the queue has dropped back down below 3000.

elukey added a subscriber: elukey.Oct 3 2016, 5:27 PM
greg added a subscriber: greg.Oct 3 2016, 7:01 PM
cscott added a comment.Oct 3 2016, 7:05 PM

Today there was a huge load spike as someone tried to pull all of enwiktionary through OCG. See T147211. It got up to 40K pending jobs in the load queue before I started clearing tasks.

Volans added a subscriber: Volans.Nov 30 2016, 9:13 AM

The alarm is on again since 3 days on Icinga, and looking at the last 6 months trend it seems that the alarm might need some re-tuning if the trend is legitimate and not an indication of some issue.

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on

Aklapper removed cscott as the assignee of this task.Jun 19 2020, 4:25 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Joe closed this task as Invalid.Jul 1 2020, 9:17 AM