Page MenuHomePhabricator

db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php
Closed, ResolvedPublic

Description

db1028 paged for high lag today

01:54  <kaldari> !log foreachwiki extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php
02:21 -icinga-wm:#wikimedia-operations- PROBLEM - MariaDB Slave Lag: s7 on db1028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.01 seconds
03:38  <godog> kaldari: is it possible populateLocalAndGlobalIds.php is spamming centralauth ? we got a slave lag page
03:38  <godog> also traffic increased on s7 globally 
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s7&var-role=All&from=1481233111803&to=1481254711803
03:38  <kaldari> godog: Maybe, I'll kill it for now....
03:39  <godog> kaldari: ok thanks! I'll keep an eye on it and see if that was the cause
03:40  <godog> only db1028 suffered though, the other slaves didn't have a problem with it
03:41  <kaldari> godog: killed it
03:42  <godog> yeah written rows are dropping
03:44  <kaldari> godog: that's not surprising, all the script does is write a lot of rows, but it's supposed to waitForSlaves 
                 after each batch of 1000.
03:45  <kaldari> it did about 4 million writes this afternoon before I killed it
03:47  <godog> kaldari: ah, db1028 is weighted at 0 in mw I wonder if that's related
03:53  <godog> I wonder if db1028 will get a chance of catching up on the lag now
03:53  <godog> kaldari: anyways thanks for killing it, so far my best lead is db1028 being at weight 0 and waitforslave not                waiting for it
03:54  <kaldari> looks like the lag is leveling off at least (rather than continuing to climb)
03:55  <kaldari> and now actually went down a little bit
03:56  <kaldari> godog: thanks for pinging me, looks like we might be on the way back to normal now.
03:57  <kaldari> I was looking at the slave lag when I first started running the script, but hadn't checked it since
03:58  <godog> kaldari: no worries, yeah it took about 30m to page I think
03:58  <kaldari> hmm, now it went back up again
03:59  <kaldari> godog: https://tendril.wikimedia.org/chart?hosts=db1028&vars=seconds_behind_master&mode=value
04:01  <godog> indeed, I'm comparing it with e.g. db1062 in grafana
04:01  <godog> the "write query stats" panel, which it dropped for db1062 but not db1028
04:04  <godog> looks like it might be still flushing to disk, writing a lot
04:04  <kaldari> yeah, maybe it still has a lot to catch up on
04:05 -icinga-wm:#wikimedia-operations- RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 
          0.00 seconds
04:06  <godog> there we go
04:07 -icinga-wm:#wikimedia-operations- RECOVERY - MariaDB Slave Lag: s7 on db1028 is OK: OK slave_sql_lag Replication lag: 0.00 
          seconds
04:08  <godog> I'll wait a bit to make sure db1028 is ok

So it looks like lag increased only on db1028 which is the only non-master at weight 0

Event Timeline

Thanks guys for taking care of this.
A quick HW check reveals no issue with db1028, just to discard issues.

db1028 has weight 0 as Filipo was saying, so that means "we do not care" for lag in that sense, so it is likely that the script kept running because it wasn't checking it.
Apart from having weight 0, it is serving dump and vslow queries, which are normally quite heavy and can add some extra load to the server.

At the time of the issue, there were also some slow queries running: https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=8ed568786388834db93cdbca5a7d074d&host=%5Edb1028&user=&schema=&hours=9 which probably didn't help the server.

Would it be possible to reduce the size of the batch?
I am not sure what are the implications of having the vslow and dump servers caring for lag (weight=1)...so not completely sure if we can just set db1028 to "1" so it is taken in consideration when running scripts.
Let's see what @jcrespo thinks

@kaldari I do not see long-running script being referenced on https://wikitech.wikimedia.org/wiki/Deployments#Week_of_December_5th This was a requirement imposed by an agreement between Operations and Release Engineering to precisely avoid these issues. Check the email sent to Operations, Engineering, and Wikitech lists called "[Engineering] Long running tasks/scripts now included on [[wikitech:Deployments]]", and explained on the top of the Deployments page (Long running tasks/scripts). Failing to do so can lead repeating issues like this, which eventually can cause bad outages.

db1028 is on purpose with weight 0 so it can lag under stress (there is one of those per shard, corresponding to the vslow/dumps) because the dumping process causes such a stress that it would otherwise make all other slaves go slow. Sadly, there is no possibility of turning off at the moment the alerts for depooled servers, because the logic of the depooling is on a mediawiki configuration file, not on a puppet repo, and we cannot temporarily disable it on icinga, because such a role is very dynamic.

There are several efforts to make this better- on the mediawiki side, Aaron has been trying to do T95799 (which would cause the same issues, alerting-wise), and then there is the epics T149617, which would be needed to disable alerting while a server is not pooled.

For this specific event, I would disable the alert and continue the script, if it does not affect the other slaves.

Please coordinate with me to do so.

@jcrespo: Thanks for reminding me to list long-running script runs on the calendar. I had completely forgotten about that.

For this specific event, I would disable the alert and continue the script, if it does not affect the other slaves.

We didn't see any lag increase on other slaves, just db1028. I estimate it will take about 10 hours to finish running the script on English Wikipedia. Most of the other large wikis are already finished: German, Spanish, French, Italian, Commons (but not Russian, Polish, or Portuguese). The other wikis should be much shorter though. Russian Wikipedia, for example, only has 2 million users compared to 29 million on English Wikipedia. I would estimate 15 hours total to finish all the wikis.

I don't know anything about how the database alerts work. Is that something that you could handle? If not, maybe you could point me in the right direction.

As far as scheduling, I'm flexible on date and time. When is your earliest availability?

@kaldari, this doesn't have to be synchronous. Please schedule a time with some advance notice on the Deployments page, give a heads up here and I will handle putting off the alarm (downtime) for, e.g. 24 hours on the vslow s7 slave. I will also avoid any schema changes there for that time (that is the main reason for the coordination on the deployments page).

@jcrespo: OK, I've scheduled it for 18:30 UTC on Monday. If that's not enough advance notice, let me know.

ok - I have downtimed db1028 from Monday 12th Dec starting at 18:30:00 UTC until Tuesday 13th finishing at 18:30:00

@kaldari What is the status. Has the script finished? Is it running still? This is to make the maintenance window larger or close this ticket as resolved.

@jcrespo: The script is still running, but I expect it to finish by the end of the window (10 hours from now).

@kaldari how is it going? you reckon it will end soon? Just asking to see if we need to extend the downtime or not - there is still 1:30h left of icinga downtime

@Marostegui: Apparently, it's taking a long time to complete the script on loginwiki, which I had completely forgotten about. Please extend the window by 12 hours if possible.

@kaldari ok! downtimed db1028 for another 12h

@Marostegui: Apparently, it's taking a long time to complete the script on loginwiki, which I had completely forgotten about. Please extend the window by 12 hours if possible.

Sorry to bother again, just let us know if this still needs further downtime extension or the script is done
Cheers!

I have extended this just in case and in order to avoid bothering our US folks with a page.

@Marostegui: It's taking much longer than I expected. One thing that I've learned is that I'm terrible at estimating maintenance script run times :P The script is currently running on specieswiki, which is wiki 708 out of 901 in alphabetical order. I haven't seen any lag spikes since it finished running on enwiki though.

No worries - I have extended the downtime for the lag checks until Monday. However if it finishes before that, please let me know so I can remove them

kaldari claimed this task.

@Marostegui: The script is finished. Feel free to reinstate the lag checks.

Thanks for the heads up - I have now removed the downtimes.