ocg alarm ocg_job_status_queue 'flapping'
Open, NormalPublic

Description

I've seen ocg warning for ocg_job_status_queue coming up lately, looks like it is constantly close to its warning threshold (ganglia)

fgiunchedi updated the task description. (Show Details)
fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi added a subscriber: fgiunchedi.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2015, 1:02 PM
BBlack assigned this task to cscott.Apr 29 2015, 2:53 PM
BBlack added subscribers: cscott, BBlack.

@cscott: can you advise on whether there's a real issue here and/or the thresholds need changing?

I'll take a look, thanks for bringing it to my attention.

fgiunchedi triaged this task as Normal priority.Jul 22 2015, 11:17 AM

I don't see the flapping alerts and much lower queues, out of curiosity what changed?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 22 2015, 11:17 AM

aannd it jumped again and the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs?

the warnings are back, @cscott are these actionable or we should be alerting or something else like failed jobs?

^ @cscott: Do you know?

One year later: Still happening? Or obsolete / declined?

The threshold is pretty arbitrary, it just warns us maybe to have a look and see if anything is obviously wrong. We can bump the threshold higher if it seems that the warning is triggering too often.

The status queue length is proportional mostly to the number of *cached* jobs in the system, although it would also increase if the number of *active* jobs increased -- that latter would be worrisome, but the former is what's flapping, as far as I can tell.

Just to remind people that these boxes are still flapping with this alert.

Joe added a subscriber: Joe.Sep 29 2016, 8:53 AM

They're not flapping, they are currently processing an enormous amount of requests suddenly today around 8 AM UTC.

The source of such requests should probably be investigated, but it's not seriously "we're on fire" alarming. So the alert has actually warned us of an ongoing anomaly so I'm pretty happy with it.

When I looked at syslog on ocg1001 earlier the job queue health varied from a little over 3000 to a little over 1000. The period of time over 3000 was about 8 minutes. If the alert level were set at e.g. 3800 we would have seen no alerts at all. This to me indicates that we stil may want to revisit the threshhold level setting; the problem is that I have no idea at what level we would have an impact to the service. Some spikes we should be able to handle; when is it too much? No idea.

BTW Icinga still doesn't report in the channel that it's no longer 'critical', after the queue has dropped back down below 3000.

elukey added a subscriber: elukey.Oct 3 2016, 5:27 PM
greg added a subscriber: greg.Oct 3 2016, 7:01 PM
cscott added a comment.Oct 3 2016, 7:05 PM

Today there was a huge load spike as someone tried to pull all of enwiktionary through OCG. See T147211. It got up to 40K pending jobs in the load queue before I started clearing tasks.

Volans added a subscriber: Volans.Nov 30 2016, 9:13 AM

The alarm is on again since 3 days on Icinga, and looking at the last 6 months trend it seems that the alarm might need some re-tuning if the trend is legitimate and not an indication of some issue.

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.