Page MenuHomePhabricator

Investigate slow query logging/digest for Beta Cluster
Open, MediumPublic

Description

HHVM's SlowTimer already logs grossly slow queries to logstash, but we might catch more pre-deploy performance regressions with a slow query digest for Beta Cluster. The raw log would also be useful for analyzing BC outages post-mortem. (See T116447: [postmortem] Beta Cluster outage: deployment-db2 disk filled up, locked db replication)

If such a digest proves valuable, we should consider making its review a formal part of the MW train deployment process.

Event Timeline

dduvall raised the priority of this task from to Needs Triage.
dduvall updated the task description. (Show Details)
dduvall added subscribers: dduvall, mmodell, hashar.

@jcrespo that is a follow up task after the beta cluster outage (T116447).

Dan mentioned the beta cluster databases do not log slow queries. We thought about enabling slow query logs on beta cluster and have them summarized somewhere so one can investigate potential slowness before they hit production.

HHVM does report some slow queries via SlowTimer, but Zend does not. Additionally if a query is killed HHVM is not going to report it.

So the whole purpose of this task is to set up a slow query analyzer on the beta cluster database and take in account its results when doing the deployment train.

Are you committing time to this?

We (RelEng) probably won't be able to commit any time to it right now --greg

I am asking because I can do what you tell me, or I can set up a better solution (the same we are deploying into production T99485).

hashar triaged this task as Medium priority.Nov 2 2015, 8:12 PM

Will poke @jcrespo about it, since we are in the same timezone it is more convenient.

Poked Jaime about it by email.

Clarified with @jcrespo. We can just enable performance_schema just like for production (T99485). The informations will then be available in the beta cluster database instances (db1 / db2).

The instances are still Precise and hence come with MariaDB 5.5. performance_schema starts being useful with 5.6.

We will need a way to collect and send metrics to some central place. Production is apparently going to send metrics to Graphite so we can generate dashboards with Grafana.

After reading T119461 and checking the number of warnings on production (https://gerrit.wikimedia.org/r/#/c/198661/) we should go with performance_schema, which will also solve logging warnings for T119371 at the same time.