Page MenuHomePhabricator

Decide storage backend for performance schema monitoring stats
Closed, ResolvedPublic

Description

While performance_schema has not yet been rolled in to all servers, it is the moment to make usage of its metrics.

The 2 final stops of the process were:

  • Implement collectors
  • Generate graphs

The first one depending on the second. There is some doubts on how to implement that:

  • It could be added to the current system (db1011 tendril db with Toku), but I want to run away from it: while tendril and its database should stay, the graphing backend (mysql) is not very good for large amounts of data and it is currently using a google api for graphing, making it not private and slow. Also, Toku and tendril are crashing once every week due to the extreme loads
  • We could add some metrics to graphite, but I am unsure if the current backend can handle the new load (+150 hosts, 5 minute resolution for 1 day, 1 hour resolution for 7 days, ~100 metrics, more coming, 400-500 GB compressed). Maybe it is time to reconvert db1011 into a dedicated graphite?
  • Graphite may be substituted in the future? I am ok being used as a test, as collector daemons have to be yet written.
  • Maybe MySQL can continue to be used, but we only have to implement it as a fronted for graphana, and skip graphite?

Too many questions that we should answer, test, do a proof of concept, etc.

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to High.
jcrespo updated the task description. (Show Details)
jcrespo added projects: SRE, Patch-For-Review, DBA.
jcrespo added subscribers: fgiunchedi, jcrespo.

graphite side: each metric takes 309KB on disk with current retention of 1m:7d,5m:14d,15m:30d,1h:1y so that would be ~40M/host with 100 metrics

What about performance (reads), is there space to grow right now?

ultimately depends on the graphite queries issued of course, but I'm assuming we wouldn't be reading all metrics from all machines at the same time all the time. so yeah there should be enough iops on the ssd to sustain that

CC'ing Isart, that was interested on maybe helping with mysql metrics.

I've added the following Diamond collector to send P_S metrics to graphite. Not sure how to set the user/pass, so I've set it to $user and $password on the config file.

https://gerrit.wikimedia.org/r/#/c/256007/

jcrespo changed the task status from Open to Stalled.May 2 2016, 9:52 AM
jcrespo removed jcrespo as the assignee of this task.
jcrespo changed the task status from Stalled to Open.Aug 6 2016, 9:03 AM
jcrespo moved this task from Backlog to In progress on the DBA board.

T126757 in progress.

jcrespo changed the task status from Open to Stalled.Nov 3 2016, 9:48 AM
jcrespo moved this task from In progress to Backlog on the DBA board.

Half of this went to public prometheus.

We cannot hold there query data as it can contain PII. The solution is to create a private prometheus and/or a mysql metadata database. We will work on this on the next quarter, if possible.

jcrespo claimed this task.

We will finaly go, because of privacy concerns, for a private prometheus instance for the queries. This will be tracked on a separate task.