Page MenuHomePhabricator

Investigate slow servermon updating queries on db1016
Closed, DeclinedPublic

Description

Copied over from T164604

In https://tendril.wikimedia.org/report/slow_queries?host=db1016&user=&schema=puppet&qmode=eq&query=&hours=24

we see simple updates taking a long time (up to 50s). Those are very simple queries on a simple table that does not have FOREIGN_KEYS itself to any other table (other tables do have FOREIGN KEYS to it and are tables that are heavily updated)

The queries in question are of the form

UPDATE hosts SET environment = 'production', updated_at = '2017-05-18 01:36:11 +0000', last_compile = '2017-05-18 01:36:11 +0000' WHERE name='lvs4003.ulsfo.wmnet'

A SHOW CREATE TABLE on that table says

Create Table: CREATE TABLE `hosts` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `ip` varchar(255) DEFAULT NULL,
  `environment` varchar(255) DEFAULT NULL,
  `last_compile` datetime DEFAULT NULL,
  `last_freshcheck` datetime DEFAULT NULL,
  `last_report` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `source_file_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_hosts_on_source_file_id` (`source_file_id`),
  KEY `index_hosts_on_name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=42684 DEFAULT CHARSET=latin1

So the fields updated are not keys and a key is used to look up the specific row

An explain on the update listed in the link above says

explain UPDATE hosts SET environment = 'production', updated_at = '2017-05-18 01:36:11 +0000', last_compile = '2017-05-18 01:36:11 +0000' WHERE name='lvs4003.ulsfo.wmnet'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: hosts
         type: range
possible_keys: index_hosts_on_name
          key: index_hosts_on_name
      key_len: 257
          ref: NULL
         rows: 1
        Extra: Using where

whereas an explain on the select

           id: 1
  select_type: SIMPLE
        table: hosts
         type: ref
possible_keys: index_hosts_on_name
          key: index_hosts_on_name
      key_len: 257
          ref: const
         rows: 1
        Extra: Using index condition

So the select is ref: const, using the correct index which is as good as it gets IIRC.

SHOW ENGINE INNODB status does not show any deadlocks, neither any recent foreign key errors.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2017, 1:13 PM
akosiaris triaged this task as Lowest priority.May 18 2017, 1:14 PM
akosiaris added projects: Operations, DBA.

Note that this is occuring seldomly and not causing any issues whatsoever. It's mostly out of personal interest that we are looking into this, hence the very low priority.

Volans added a subscriber: Volans.May 22 2017, 6:41 PM

Per https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=7680e3d95eee2aa98b1c461dbc0dcc5c&host=db1016&user=&schema=puppet&hours=24 this seems to be occuring only on XX:3[567] timeframe. That is only on every 35th, 36th, 37th minute of the hour. Which times very well with servermon's make_updates cron. The code for that is at https://github.com/servermon/servermon/blob/master/servermon/updates/management/commands/make_updates.py. The entire command is practically one long running transaction (running for like 3minutes). The command runs across the entire set of hosts. Looking at SHOW ENGINE INNODB STATUS confirms this suspicion with the following output

---TRANSACTION 19225813873, ACTIVE 44 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1184, 2 row lock(s)
MySQL thread id 56589727, OS thread handle 0x7f222ca46700, query id 15314324582 10.64.16.159 puppet updating
UPDATE hosts SET             environment = 'production',             updated_at = '2017-05-23 09:35:20 +0000',             last_compile = '2017-05-23 09:35:20 +0000'             WHERE name='db2039.codfw.wmnet'
Trx #rec lock waits 1 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
------- TRX HAS BEEN WAITING 44 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1154 page no 34 n bits 264 index `PRIMARY` of table `puppet`.`hosts` trx table locks 1 total table locks 13  trx id 19225813873 lock_mode X locks rec but not gap waiting lock hold time 44 wait time before grant 0

Show if I read this correctly, the UPDATE awaits for a record lock X (can record locks be S? I don't think so, it sounds wrong) on the PRIMARY key of the table. However that lock is held by the make_updates command already given that this is a FOREIGN KEY for some heavily updated tables.

I 've reproduced this manually, running the command and trying an update command like the ones above.

Note that in the past the transcation was not held through the entire command but rather per host. That was changed back in https://github.com/servermon/servermon/commit/6b450d4f13e6085f4b41e458f0a431ee6652d09f as it would create missing entries due to race conditions with puppet runs.

Fun fact: The order hosts are traversed in is alphabetical and a record lock is not issued for a host until it's their turn which makes it a tad more difficult to reproduce the issue (it's easier to reproduce it manually on say acamar than on zosma).

I know this is old but maybe this was affected by: T166344 and its performance degradation?

I doubt that. Looking at tendril for the last 72 hours (>2 days before T166344 happened again), it's clear that this is happening around the time the make_updates cron is running. I am pretty certain it's a bug in the logic of the make_updates command and I will have to fix it, it's just low priority as it does not substantially affect day-to-day operations.

Cool - just wanted to make sure we had that in mind :-)

Marostegui moved this task from Triage to Backlog on the DBA board.Jun 14 2017, 2:05 PM
jcrespo moved this task from Backlog to Backlog (help welcome) on the DBA board.Feb 14 2018, 11:58 AM

I caught one slow query while doing other monitoring, putting it here for extra debugging:

db1063	22457446	puppet	dbproxy1001	puppet	31s
UPDATE hosts SET environment = 'production', updated_at = '2018-07-25 06:35:58 +0000', last_compile = '2018-07-25 06:35:58 +0000' WHERE name='elastic1039.eqiad.wmnet'

FYI servermon will probably be decommissioned soon, see T198939.

Any objections to decline this ticket as per volans comment above? (T165674#4449561)

jcrespo closed this task as Declined.Aug 29 2018, 10:09 AM