Re-examine icinga warning thresholds and job expiry.
Open, Needs TriagePublic

Description

Before OCG was turned on for everyone, we had a 30k icinga warning limit for the job status queue. Since entries expire from the queue after 5 days and we expect around 10k jobs/day, we raised the limit to a more reasonable 100k.

But we should re-examine this once OCG goes live by default in production and see whether this limit makes sense. We also have:

warn output dir 40GB
critical output dir 50GB
postmortem dir warn 1G, critical 2G
render jobs queue warn 100, critical 500
temp size warn 1G, critical 5G

We should examine these as well. (If changes are needed, see https://gerrit.wikimedia.org/r/162623 )

Finally -- is the 5 day expiry reasonable? Should we instead/also have a "# entries" limit, and expire things as needed until the status queue goes down before NNN entries?


Version: unspecified
Severity: normal

Details

Reference
bz71239
bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz71239.
bzimport added a subscriber: Unknown Object (MLST).
cscott created this task.Sep 24 2014, 4:26 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 10 2015, 11:00 PM

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.