Page MenuHomePhabricator

Deprecation of mw.errors.* metrics
Closed, ResolvedPublic5 Estimated Story Points


While working on the migration of eventlog1001 to eventlog1002 (T185667) I discovered the following puppet role applied to eventlog1001:


The functionality seems to be to get udp logs from mwlog1001, parse them and send some metrics to statsd/graphite via the following script:

From what I gathered uses metrics grabbed from logstash, so I am wondering if anybody uses anymore mw.errors.* somewhere. If not, I'd stop collecting these metrics as soon as possibile.

elukey@neodymium:~$ sudo cumin 'R:class = mediawiki::monitoring::errors' 'uname -a' --dry-run
1 hosts will be targeted:

elukey@neodymium:~$ sudo cumin 'R:class = role::logging::mediawiki::errors' 'uname -a' --dry-run
1 hosts will be targeted:

elukey@neodymium:~$ sudo cumin 'R:File = /usr/local/bin/mwerrors' 'uname -a' --dry-run
1 hosts will be targeted:

Event Timeline

elukey triaged this task as Medium priority.Mar 2 2018, 4:36 PM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2018, 4:36 PM

Change 415887 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::eventlogging::analytics: deprecate mw.errors.* metrics

elukey updated the task description. (Show Details)Mar 2 2018, 4:48 PM

Sounds good to me, we'd also need to audit dashboards in case we're using it somewhere and replace with logstash metrics.

I’ve got a dashboard search tool we could use to easily check that:

It seems that aside from mw.errors.exception, the mw.errors.* metrics were last written to in 2016.

[19:20 UTC] krinkle at graphite1001.eqiad.wmnet in /var/lib/carbon/whisper/mw/errors

 - fatal: Jan 31  2016 rate.wsp ..
 - catchable: May 20  2016 rate.wsp ..
 - query: Sep 22  2016 rate.wsp ..

 - exception: Mar  2 19:20 rate.wsp ..

Found 0 matches for mw.errors in any of our current dashboards.

Just in case the way it queried it was more complex or with templates/wildcards etc, also ran a search just for exception:

Searching for: /[("'{,]exception[^"')}]+/g
... checking db/aqs-cassandra-storage (AQS :: Cassandra :: Storage)
[ '"exceptions' ]
... checking db/production-logging (Production Logging)
[ '{exception,fatal', '{exception,fatal', '{exception,fatal' ]
... checking db/restbase-cassandra-storage (RESTBase :: Cassandra :: Storage)
[ '"exceptions' ]
... checking db/restbase-staging-cassandra-storage (RESTBase staging :: Cassandra :: Storage)
[ '"exceptions' ]

The only related one here is db/production-logging which uses the logstash metrics instead.

Krinkle moved this task from Limbo to Perf issue on the Performance-Team (Radar) board.
elukey added a comment.Mar 3 2018, 6:33 PM

Thanks @Krinkle! @fgiunchedi I think we are ready to go, what do you think?

elukey moved this task from Backlog to In Progress on the User-Elukey board.Mar 5 2018, 2:19 PM

Thanks @Krinkle! @fgiunchedi I think we are ready to go, what do you think?

Sounds good! Thanks to you and @Krinkle for the audit.

Change 415887 merged by Elukey:
[operations/puppet@production] role::eventlogging::analytics: deprecate mw.errors.* metrics

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:34:35Z] <elukey> graphite metrics mw.error.* deprecated in T188749

elukey set the point value for this task to 5.Mar 5 2018, 2:34 PM
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
Krinkle removed a subscriber: Krinkle.Mar 6 2018, 2:25 AM
elukey moved this task from In Progress to Done on the User-Elukey board.Mar 6 2018, 8:40 AM
Nuria closed this task as Resolved.Mar 26 2018, 9:29 PM