Page MenuHomePhabricator

Send metrics of db errors of mediawiki to prometheus
Closed, ResolvedPublic

Description

There is a good Prometheus exporter from logstash and there are several cases that it's been used to send metrics of different cases. We can simply utilize this to have grafana dashboard of db errors (and possibly alerts?). It should be split by shard and type (deadlock, lock wait timeout, readonly, etc. etc.) but not wiki. We can maybe later send metrics of slow queries to grafana too.

Event Timeline

Change 745619 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] Start of DbLogger to centralize logic of logging

https://gerrit.wikimedia.org/r/745619

Yes once you have logs in elasticsearch you can turn search queries into Prometheus metrics, from there you have dashboards and alerts too (either based on Grafana, or as Prometheus alerting rules in operations/alerts.git). HTH!

Change 745890 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Start adding db_log_category field to db logs

https://gerrit.wikimedia.org/r/745890

Marostegui renamed this task from Send metrics of db errors of mediawiki to promethues to Send metrics of db errors of mediawiki to prometheus .Dec 14 2021, 6:35 AM

Change 745890 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Start adding db_log_category field to db logs

https://gerrit.wikimedia.org/r/745890

Ladsgroup moved this task from In progress to Ready on the DBA board.

Wont' be able to do it soon.

Change 825306 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] es_exporter: Add metrics collection for mediawiki's db errors

https://gerrit.wikimedia.org/r/825306

Change 825306 merged by Cwhite:

[operations/puppet@production] es_exporter: Add metrics collection for mediawiki's db errors

https://gerrit.wikimedia.org/r/825306

Ladsgroup claimed this task.
Ladsgroup moved this task from Ready to Done on the DBA board.

We now have something https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&viewPanel=14 It's not perfect but it's a good start.

Change 842935 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] es_exporter: Include channel=rdbms in query_log_mediawiki_mysql

https://gerrit.wikimedia.org/r/842935

Change 842935 merged by Cwhite:

[operations/puppet@production] es_exporter: Include channel=rdbms in query_log_mediawiki_mysql

https://gerrit.wikimedia.org/r/842935

mediawiki_errors_graph.png (404×1 px, 92 KB)

This is what I believe is a better graph from the same data with the instant (1 minute granularity) and the daily count of errors, including a comparison with last week's, in case it wants to be used: https://grafana.wikimedia.org/goto/gsZeT4yVk?orgId=1

Change 745619 abandoned by Ladsgroup:

[mediawiki/core@master] [WIP] Start of DbLogger to centralize logic of logging

Reason:

We went in a different direction

https://gerrit.wikimedia.org/r/745619