The threshold is pretty arbitrary, it just warns us maybe to have a look and see if anything is obviously wrong. We can bump the threshold higher if it seems that the warning is triggering too often.
The status queue length is proportional mostly to the number of *cached* jobs in the system, although it would also increase if the number of *active* jobs increased -- that latter would be worrisome, but the former is what's flapping, as far as I can tell.
They're not flapping, they are currently processing an enormous amount of requests suddenly today around 8 AM UTC.
The source of such requests should probably be investigated, but it's not seriously "we're on fire" alarming. So the alert has actually warned us of an ongoing anomaly so I'm pretty happy with it.
When I looked at syslog on ocg1001 earlier the job queue health varied from a little over 3000 to a little over 1000. The period of time over 3000 was about 8 minutes. If the alert level were set at e.g. 3800 we would have seen no alerts at all. This to me indicates that we stil may want to revisit the threshhold level setting; the problem is that I have no idea at what level we would have an impact to the service. Some spikes we should be able to handle; when is it too much? No idea.
BTW Icinga still doesn't report in the channel that it's no longer 'critical', after the queue has dropped back down below 3000.
The alarm is on again since 3 days on Icinga, and looking at the last 6 months trend it seems that the alarm might need some re-tuning if the trend is legitimate and not an indication of some issue.