Page MenuHomePhabricator

Investigate using PMM (Percona Monitoring and Management) for slow-query analysis
Open, LowPublic

Description

According to https://www.percona.com/doc/percona-monitoring-and-management/2.x/using/query-analytics.html it can use the performance schema instead of the slow query log, which might make it feasible for our environment.

Event Timeline

Marostegui awarded a token.
Marostegui moved this task from Triage to Refine on the DBA board.

For history, we almost enable a similar solution to this (through grafana + prometheus_mysqld_exporter- not sure if PMM uses that for queries or it is a parallel implementation; I know it uses the exporter for regular metrics), but Filippo and I had to disable the query profiling and table size metrics logging to public prometheus due to privacy concerns towards grafana.wm.o.

But this looks exactly what we need, if setup privately on dbmonitor stack or similar.

Adding T143896 epic, even if one can argue that "query monitoring" is metrics or not, but to link it to an epic where this need was mentioned.

I'd like to mention that from a developer perspective, logging slow queries to Logstash and letting a Kibana dashboard surface and count common slow queries would likely suffice. We do this for various other kinds of warnings and code smells already and that seems to work fairly well and offers a lot of flexibility and multi-tennant perspectives on the same data.

I've rarely looked at Tendril over the years, and that might mean I'm not the right audience for it, and probably am not aware of the values and features it offers to others. If those exist and/or if other value adds are behind PMM that the DBAs would benefit from, then by all means ignore this comment :-)

I will not that even it was aimed at developers, having yet another place is by itself going to significantly reduce the amount of attention received, so integration with Kibana might be valuable even in addition to PPM. I'd also be happy to take some time to showcase Kibana if there are concerns, perceived limitations, or past frustrations that one might think make it a bad fit. You may very well be right, or there might be a trick or two up my sleeve.

We had some discussions about this today. On agreement on what @Krinkle has said. Logstash and TransactionProfiler seem a good replacement. Some notable differences between PMM and logstash are:

  • logstash is a central place and it would mean we have one less service to maintain.
  • logstash queries are a bit hard to find but we can build dashboards.
  • logstash would only cover mediawiki queries while pmm would cover other dbs
  • pmm would aggregate based on value but logstash wouldn't (which is a good think to debug slow queries)
  • pmm aggregates and order them based on total time (~=load on the db) but logstash is ordered by time, there might be ways to improve it.

The suggestion is to have both, but logstash for now to unblock dropping tendril and slowly deploy pmm but with lower priority for usecases logstash doesn't cover, we might revisit this if logstash turns out too good.

Here's link to logstash: https://logstash.wikimedia.org/goto/1e556159cd08643f6cda34e5572348c1 I feel it needs some tweaking to make it more useful to devs and DBAs but not too hard.

  • logstash would only cover mediawiki queries while pmm would cover other dbs

We don't really use tendril for any other queries that aren't MW. We do have misc but we don't care much about those queries as they are generally ok and misc has very low load in comparison with MW.

@Ladsgroup : does logstash actually contain the full query? E.g. i'm looking at https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2021.11.11?id=uP_FDX0Bw6ZjC92mWxWc, which seems clearly truncated.

I think this (and so many other things), need fixing before we can properly use it. The good thing is that most of these stuff is under our control.

LSobanski lowered the priority of this task from Medium to Low.Mon, Jan 10, 3:20 PM
LSobanski updated the task description. (Show Details)

Adjusting priority and removing the mention of this being a blocker for Tendril deprecation.