Page MenuHomePhabricator

Document performance optimization of servermon and/or puppet reporting tools
Closed, ResolvedPublic

Description

We discussed today some better indexing on puppet tables, we believe it to be fixed, although we should monitor further updates.

For starters, we see less contention on db1016:

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1016&from=1493916785809&to=now&panelId=19&fullscreen

And 150K/s fewer row (range-type) scans:
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1016&from=1493916785809&to=now&panelId=3&fullscreen

And unless I am imagining things, I see a 5x less CPU usage on the full server:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=7&fullscreen&orgId=1&var-server=db1016&var-network=eth0&from=1493982880253&to=now

We have to check if rows are less locked now:
https://tendril.wikimedia.org/report/slow_queries?host=db1016&user=&schema=puppet&qmode=eq&query=&hours=24

The real purpose of this ticket is to make the fix repeatable in case the tools are reinstalled, etc. It was:

use puppet;
ALTER TABLE fact_values drop index index_fact_values_on_host_id, add index index_fact_values_on_host_id(host_id,fact_name_id);

Details

Related Gerrit Patches:
operations/puppet : productionDocument servermon optimization

Event Timeline

jcrespo created this task.May 5 2017, 5:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 5 2017, 5:10 PM
jcrespo updated the task description. (Show Details)May 5 2017, 5:11 PM
jcrespo updated the task description. (Show Details)May 5 2017, 5:16 PM

Change 354207 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Document servermon optimization

https://gerrit.wikimedia.org/r/354207

Change 354207 merged by Alexandros Kosiaris:
[operations/puppet@production] Document servermon optimization

https://gerrit.wikimedia.org/r/354207

The patch above does document everything in the servermon.rb reporter (which is the application that had the issue), so that part is done

However, funnily enough the last link

https://tendril.wikimedia.org/report/slow_queries?host=db1016&user=&schema=puppet&qmode=eq&query=&hours=24

still displays simple updates taking a long time (up to 50s)

An explain on the update listed in the link above says

explain UPDATE hosts SET environment = 'production', updated_at = '2017-05-18 01:36:11 +0000', last_compile = '2017-05-18 01:36:11 +0000' WHERE name='lvs4003.ulsfo.wmnet'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: hosts
         type: range
possible_keys: index_hosts_on_name
          key: index_hosts_on_name
      key_len: 257
          ref: NULL
         rows: 1
        Extra: Using where

whereas an explain on the select

           id: 1
  select_type: SIMPLE
        table: hosts
         type: ref
possible_keys: index_hosts_on_name
          key: index_hosts_on_name
      key_len: 257
          ref: const
         rows: 1
        Extra: Using index condition

So the select is ref: const, using the correct index which is as good as it gets IIRC.

SHOW ENGINE INNODB status does not show any deadlocks so at least we are now ok on this front.

@jcrespo: Any hints on how to debug this further ?

Let's close this as the scope is for me done, and let's open a new ticket with lower priority with basically your own last comment. This is personally a very interesting problem to me, although not high priority-it has been there for months or years and only shown now that we have limited query time to 60 seconds everywhere.

akosiaris closed this task as Resolved.May 18 2017, 1:13 PM
akosiaris claimed this task.

Agreed. Relevant stuff copied over to T165674 (marked lowest priority). Resolving this.

The "separate task" I usually suggest in these cases has a double reason- it makes clear the scope for people that first read it (instead of tickets that are only left with a small part of the scope), and it is personally a "moral" boost in "we have accompished something", leaving the rest of the things as a separate concern.