db1028 paged for high lag today
01:54 <kaldari> !log foreachwiki extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php 02:21 -icinga-wm:#wikimedia-operations- PROBLEM - MariaDB Slave Lag: s7 on db1028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds 03:38 <godog> kaldari: is it possible populateLocalAndGlobalIds.php is spamming centralauth ? we got a slave lag page 03:38 <godog> also traffic increased on s7 globally https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s7&var-role=All&from=1481233111803&to=1481254711803 03:38 <kaldari> godog: Maybe, I'll kill it for now.... 03:39 <godog> kaldari: ok thanks! I'll keep an eye on it and see if that was the cause 03:40 <godog> only db1028 suffered though, the other slaves didn't have a problem with it 03:41 <kaldari> godog: killed it 03:42 <godog> yeah written rows are dropping 03:44 <kaldari> godog: that's not surprising, all the script does is write a lot of rows, but it's supposed to waitForSlaves after each batch of 1000. 03:45 <kaldari> it did about 4 million writes this afternoon before I killed it 03:47 <godog> kaldari: ah, db1028 is weighted at 0 in mw I wonder if that's related 03:53 <godog> I wonder if db1028 will get a chance of catching up on the lag now 03:53 <godog> kaldari: anyways thanks for killing it, so far my best lead is db1028 being at weight 0 and waitforslave not waiting for it 03:54 <kaldari> looks like the lag is leveling off at least (rather than continuing to climb) 03:55 <kaldari> and now actually went down a little bit 03:56 <kaldari> godog: thanks for pinging me, looks like we might be on the way back to normal now. 03:57 <kaldari> I was looking at the slave lag when I first started running the script, but hadn't checked it since 03:58 <godog> kaldari: no worries, yeah it took about 30m to page I think 03:58 <kaldari> hmm, now it went back up again 03:59 <kaldari> godog: https://tendril.wikimedia.org/chart?hosts=db1028&vars=seconds_behind_master&mode=value 04:01 <godog> indeed, I'm comparing it with e.g. db1062 in grafana 04:01 <godog> the "write query stats" panel, which it dropped for db1062 but not db1028 04:04 <godog> looks like it might be still flushing to disk, writing a lot 04:04 <kaldari> yeah, maybe it still has a lot to catch up on 04:05 -icinga-wm:#wikimedia-operations- RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds 04:06 <godog> there we go 04:07 -icinga-wm:#wikimedia-operations- RECOVERY - MariaDB Slave Lag: s7 on db1028 is OK: OK slave_sql_lag Replication lag: 0.00 seconds 04:08 <godog> I'll wait a bit to make sure db1028 is ok
So it looks like lag increased only on db1028 which is the only non-master at weight 0