Page MenuHomePhabricator

Add statsd metric to WANObjectCache
Closed, ResolvedPublic

Description

It would be have hit/miss/set rate stats. They could also be grouped by the class that made the callback (might require passing METHOD). If callback time was included, we could identify things like "most expensive values that have a poor hit rate" or "keys that are rarely used but often set".

Also, when $ttl is changed in the callback, we could bucket it by it's fraction of the nominal value (if lower for sanity) into [20%,40%,60%,80%,90%] and track the ratio of what falls in these buckets.

Event Timeline

Change 385122 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] WIP: add statsd metric support to WANObjectCache

https://gerrit.wikimedia.org/r/385122

aaron triaged this task as Medium priority.Oct 23 2017, 8:08 PM
aaron moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Change 385122 merged by jenkins-bot:
[mediawiki/core@master] Add statsd metric support to WANObjectCache

https://gerrit.wikimedia.org/r/385122

fgiunchedi added a subscriber: fgiunchedi.

Reopening since I noticed statsd metrics from wanobjectcache are creating lots of metrics, e.g.

/var/lib/carbon/whisper/MediaWiki/wanobjectcache/1522a7ce9d46392176d6a7da1a308e0a/hit/good/upper.wsp
/var/lib/carbon/whisper/MediaWiki/wanobjectcache/6ba5128affdef1bdeba3cebec75b8735/miss/compute/mean.wsp
/var/lib/carbon/whisper/MediaWiki/wanobjectcache/1353b9d1c7daf53c46afdf8f9bd1250a/hit/good/count.wsp

It doesn't look like metrics creation will reach a steady state, we shouldn't report per-hash metrics but find a way to aggregate the metrics instead.

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Change 393749 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] graphite: blackhole spam from wanobjectcache

https://gerrit.wikimedia.org/r/393749

Change 393749 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: blackhole spam from wanobjectcache

https://gerrit.wikimedia.org/r/393749

Mentioned in SAL (#wikimedia-operations) [2017-11-28T12:25:55Z] <godog> cleanup wanobjectcache metrics with hashes - T178531

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Thanks for looking into it! In the meantime I've blackholed said metrics to avoid graphite disks filling up

I guess we will need MW side logging now. Probably can just add it to wmf branch.

Change 394493 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.10] Add temporary logging for bad WAN cache statsd keys

https://gerrit.wikimedia.org/r/394493

Change 394493 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.10] Add temporary logging for bad WAN cache statsd keys

https://gerrit.wikimedia.org/r/394493

Change 394493 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.10] Add temporary logging for bad WAN cache statsd keys

https://gerrit.wikimedia.org/r/394493

I see no log entries showing up there (side from the usual lag ones).

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Thanks for looking into it! In the meantime I've blackholed said metrics to avoid graphite disks filling up

Maybe you can try turning off that filter for a while and seeing if they return.

I peeked at the statsd stream and indeed I can't see the metrics with hashes anymore, what changed @aaron ?

Probably some MW fixes actually reaching production.

No logstash logging shows anything.