Page MenuHomePhabricator

Add monitoring of upload rate on commons to icinga alerts
Closed, ResolvedPublic

Description

Between 16:17 - 16:57 on march 6 we had uploads broken due to a config error. While this is a pretty good response time, it would be ideal if automatic alarm bells ring when uploads break, instead of having to wait for users to complain (It took about half an hour until user complaints were sufficiently escalated).

Could we perhaps have an icingia (or whatever its called) check http://commons.wikimedia.org/w/api.php?list=logevents&letype=upload&action=query&lelimit=1&lestart=<timestamp 5 minutes ago> every 5 minutes, and raise alarms if there is a 5 minute period where there are 0 uploads [or some other threshold]? Commons seems to average a new upload about once every 5 seconds, so a 5 minute period should get rid of any false positives.

Event Timeline

Bawolff raised the priority of this task from to Needs Triage.
Bawolff updated the task description. (Show Details)
Bawolff added projects: Multimedia, acl*sre-team.
Bawolff subscribed.

is the number of uploads already in graphite somewhere? that'd make creating the alarm very easy.

In theory, all hooks and API requests are logged to graphite; FileUpload and UploadComplete seem good hooks to count started / successful uploads, and the upload API should also be tracked (that misses Special:Upload uploads, but that's probably a small fraction).

In practice though, graphite.wikimedia.org just shows empty dirs for all API/hook/class related keys. Might be just me looking at the wrong place, not too familiar with graphite.

could be that it is missing, what would be the exact path to the metrics?

btw the missing metrics should be related to T85641

fgiunchedi triaged this task as Medium priority.
fgiunchedi renamed this task from Add monitoring of upload rate on commons to icingia alerts to Add monitoring of upload rate on commons to icinga alerts.Jun 16 2015, 1:18 AM
fgiunchedi set Security to None.

Graphite was fixed a while ago but hooks / API requests don't seem to be available anymore (IIRC there was some sort of cleanup for little-used things); I did not find anything that could be useful (but then I am not familiar with what gets logged).

An attempt for logging certain types of API usage, which could be used as a template for setting up similar logging of file uploads:
https://gerrit.wikimedia.org/r/#/c/204209
https://gerrit.wikimedia.org/r/#/c/205864
https://gerrit.wikimedia.org/r/#/c/205869

yup I think hooks are not being pushed as stats anymore, xhprof might give some "proxy metric" but the reviews you linked look like the right way to do things

In T92322#1419551, @Tgr wrote:

It did not, it killed statsd without sampling, and with sampling upload counts are too low to be reliable. Looks like we will need to create custom metrics for this.

Change 240358 had a related patch set uploaded (by Filippo Giunchedi):
swift: aggregate and report container object/byte stats

https://gerrit.wikimedia.org/r/240358

Change 240358 merged by Filippo Giunchedi:
swift: aggregate and report container object/byte stats

https://gerrit.wikimedia.org/r/240358

as a proxy metric from swift and not mw we can now use swift.eqiad-prod.containers.mw-media.originals.objects (also .bytes is available) to keep track of various container "classes" (e.g. deleted, render, temp, thumb, transcoded)

Change 251526 had a related patch set uploaded (by Filippo Giunchedi):
swift: monitor mediawiki originals upload rate

https://gerrit.wikimedia.org/r/251526

Change 251526 merged by Filippo Giunchedi:
swift: monitor mediawiki originals upload rate

https://gerrit.wikimedia.org/r/251526