Page MenuHomePhabricator

Plot User::pingLimiter() actions in Grafana
Closed, DeclinedPublic

Description

User::pingLimiter() let us limit some actions (default: 'edit'). It would be rather nice to have a Graphite dashboard listing each actions.

Depends on MediaWiki bug 65477 T67477: User::pingLimiter should have per action profiling

Context:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20140503-Thumbnails

Details

Reference
bz65478

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:22 AM
bzimport set Reference to bz65478.
bzimport added a subscriber: Unknown Object (MLST).

MediaWiki now has support to vary pingLimiter profile points. Needs to wait both wmf branches to have it then we can adjust the conf in operations/mediawiki-config.git

User::pingLimiter() now profiles per action as well (was bug 65477 , https://gerrit.wikimedia.org/r/134067 ).

The statsd metric hierarchy:

MediaWiki.User.pingLimiter
MediaWiki.User.pingLimiter-edit
MediaWiki.User.pingLimiter-linkpurge
...

We can probably do graphs using MediaWiki.User.pingLimiter-*.count should be done in gdash configuration (somewhere in operations/puppet.git).

(In reply to Antoine "hashar" Musso (WMF) from comment #2)

We can probably do graphs using MediaWiki.User.pingLimiter-*.count should
be done in gdash configuration (somewhere in operations/puppet.git).

Given https://wikitech.wikimedia.org/wiki/File:Wmfcluster-user_pinglimiter-20140101-20140507.png , I guess the best here would be to mimick the "most deviant" graph kind used in other gdash pages.

Change 166511 had a related patch set uploaded by Nemo bis:
Graph User::pingLimiter() actions in gdash

https://gerrit.wikimedia.org/r/166511

mmhh I don't see any metrics for the second link after december 9th

I'm not going to investigate this any time soon, I suggest that someone with access to graphite interface does.

faidon added a subscriber: faidon.

gdash has been retired since ~February 2016, having been replaced with Grafana.

hashar renamed this task from Graph User::pingLimiter() actions in gdash to Graph User::pingLimiter() actions in Graphana.Jul 20 2017, 8:20 PM
hashar reopened this task as Open.
hashar lowered the priority of this task from Medium to Low.
hashar added a subscriber: Krinkle.

I filled this task in the hope someone could figure out the links to be added in Gdash. Nowadays that can be done via Grafana hence I reopened this task and rephrased the topic.

@Krinkle do you know whether we still send to statsd profiling ofo the MediaWiki function calls. statsd metric MediaWiki.User.pingLimiter used to be populated by wfProfileIn() or something similar. Apparently the metrics are no more emitted.

Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

I filled this task in the hope someone could figure out the links to be added in Gdash. Nowadays that can be done via Grafana hence I reopened this task and rephrased the topic.

@Krinkle do you know whether we still send to statsd profiling of the MediaWiki function calls. statsd metric MediaWiki.User.pingLimiter used to be populated by wfProfileIn() or something similar. Apparently the metrics are no more emitted.

Yeah, this dashboard was re-using/misusing the metrics for PHP stracktrace profiling. A few years ago, we've switched our profiler to use XHProf and report to XHGui instead of Statsd. Xenon and XHGui have improved our understanding of the PHP execution a lot. The old graphite data was a flat tree without being able to understand the nesting very well.

The profiler though is primarily for measuring duration of execution, not frequency (except for frequency within a request, but not across the cluster). Mainly because the profiling is randomly enabled on only a sample of requests. So it probably wasn't very accurate anyway. It only worked because Statsd automatically also adds a counter to timing metrics.

If we want this metric back, it'll need to be added properly through something like wfIncrStats() - which goes explicitly to statsd.

Alternatively, Logstash might make more sense. We already have a ratelimit channel setup that records when it is triggered. It should be easy to improve that log entry from plain text, to a PSR log with context data. Then you can have a Logstash dashboard to detect trends by multiple of axis (by wiki, by user, by IP, by action etc.) – which will make it easier to detect and understand trends/regressions. In addition, the surrounding context will also make it easy to understand what is causing a regression since it won't just record a sum number of boolean signals, but rather the complete context. So in light of the original incident that caused it, you'd have seen an increase for commonswiki, for renderfile-nonstandard and many would have the same internal IP.

If the plan is to have alerting (not just dashboard) then Graphite still makes sense. Perhaps both.

Krinkle renamed this task from Graph User::pingLimiter() actions in Graphana to Graph User::pingLimiter() actions in Grafana.Jul 21 2017, 10:55 PM
Krinkle edited projects, added observability; removed WMF-General-or-Unknown.
Krinkle removed a subscriber: wikibugs-l-list.
Krinkle renamed this task from Graph User::pingLimiter() actions in Grafana to Plot User::pingLimiter() actions in Grafana.Jul 4 2019, 1:29 PM