Before OCG was turned on for everyone, we had a 30k icinga warning limit for the job status queue. Since entries expire from the queue after 5 days and we expect around 10k jobs/day, we raised the limit to a more reasonable 100k.
But we should re-examine this once OCG goes live by default in production and see whether this limit makes sense. We also have:
warn output dir 40GB
critical output dir 50GB
postmortem dir warn 1G, critical 2G
render jobs queue warn 100, critical 500
temp size warn 1G, critical 5G
We should examine these as well. (If changes are needed, see https://gerrit.wikimedia.org/r/162623 )
Finally -- is the 5 day expiry reasonable? Should we instead/also have a "# entries" limit, and expire things as needed until the status queue goes down before NNN entries?