Page MenuHomePhabricator

ocg alarm ocg_job_status_queue 'flapping'
Closed, InvalidPublic

Description

I've seen ocg warning for ocg_job_status_queue coming up lately, looks like it is constantly close to its warning threshold (ganglia)

screenshot_t7nwz4.png (242×983 px, 29 KB)

Event Timeline

fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi subscribed.
BBlack added subscribers: cscott, BBlack.

@cscott: can you advise on whether there's a real issue here and/or the thresholds need changing?

I'll take a look, thanks for bringing it to my attention.

I don't see the flapping alerts and much lower queues, out of curiosity what changed?

aannd it jumped again and the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs?

the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs?

^ @cscott: Do you know?

One year later: Still happening? Or obsolete / declined?

The threshold is pretty arbitrary, it just warns us maybe to have a look and see if anything is obviously wrong. We can bump the threshold higher if it seems that the warning is triggering too often.

The status queue length is proportional mostly to the number of *cached* jobs in the system, although it would also increase if the number of *active* jobs increased -- that latter would be worrisome, but the former is what's flapping, as far as I can tell.

Just to remind people that these boxes are still flapping with this alert.

They're not flapping, they are currently processing an enormous amount of requests suddenly today around 8 AM UTC.

The source of such requests should probably be investigated, but it's not seriously "we're on fire" alarming. So the alert has actually warned us of an ongoing anomaly so I'm pretty happy with it.

When I looked at syslog on ocg1001 earlier the job queue health varied from a little over 3000 to a little over 1000. The period of time over 3000 was about 8 minutes. If the alert level were set at e.g. 3800 we would have seen no alerts at all. This to me indicates that we stil may want to revisit the threshhold level setting; the problem is that I have no idea at what level we would have an impact to the service. Some spikes we should be able to handle; when is it too much? No idea.

BTW Icinga still doesn't report in the channel that it's no longer 'critical', after the queue has dropped back down below 3000.

Today there was a huge load spike as someone tried to pull all of enwiktionary through OCG. See T147211. It got up to 40K pending jobs in the load queue before I started clearing tasks.

The alarm is on again since 3 days on Icinga, and looking at the last 6 months trend it seems that the alarm might need some re-tuning if the trend is legitimate and not an indication of some issue.

Screen Shot 2016-11-30 at 10.09.02.png (496×2 px, 186 KB)

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)