Page MenuHomePhabricator

x2-mainstash replication lag (25min Codfw, disk saturated)
Closed, ResolvedPublic

Description

Splitting off from:

A few hours after the last part of the roll-out for T113916, @jcrespo pointed out there was a significant replication spike in Codfw. I wasn't thinking to proactively look for replication issues, and by the time we found it, it had already subsided.

Not only that, by the time I signed off for the day, the first daily peak on the new backend was starting to form, and replag stayed nicely at zero:

Screenshot 2022-07-12 at 12.47.56.png (1×1 px, 172 KB)

Event Timeline

The next day, towards the actual peak, replication did start to become an issue in the secondary DC, again, and this time it stayed:

Screenshot 2022-07-12 at 14.48.20.png (1×1 px, 248 KB)

I tried to infer from the Write Query Stats and Write Handler Stats in Grafana which of the MainStash/SqlBagOStuff might be responsible for this, but came up short. Best I can tell, there are no UPDATE queries, and all add/cas/set/incr/delete operations are represented as INSERT ... ON DUPE KEY UPDATE. My assumption at this point is that the "write query stats" from MariaDB-prometheus collector represent what the query effectively ended up doing, so INSERT if it resulted in a created row and UPDATE if it changed a row.

@Ladsgroup supplied the db2142 binlog to me on mwmaint1002 for analysis:

$ cat binlog | grep -ao 'SqlBagOStuff::[a-zA-Z]*' | sort | uniq -c | sort -n
    163 SqlBagOStuff::modifyTableSpecificBlobsForAdd
    206 SqlBagOStuff::deleteServerObjectsExpiringBefore
    254 SqlBagOStuff::modifyTableSpecificBlobsForIncrInit
    751 SqlBagOStuff::modifyTableSpecificBlobsForDelete
   5829 SqlBagOStuff::modifyTableSpecificBlobsForCas
  27570 SqlBagOStuff::modifyTableSpecificBlobsForSet
 256878 SqlBagOStuff::modifyTableSpecificBlobsForChangeTTL

modifyTableSpecificBlobsForChangeTTL
# 'enwiki:ResourceLoaderModule-dependencies###

modifyTableSpecificBlobsForSet
# 'enwiki:ResourceLoaderModule-dependencies###

modifyTableSpecificBlobsForCas
# 'enwiki:abusefilter:emergency:filter:###

modifyTableSpecificBlobsForDelete
# 'global:loginnotify:new:###','','20220712###
# 'ruwiki:captcha:###
# 'global:watchlist-recent-updates:metawiki:

modifyTableSpecificBlobsForIncrInit
# 'zhwiki:abusefilter:throttle:###:###','1','202207###'

The question on my mind was: Why is it that the Eqiad server (both primary and two replicas) seem fine, but the Codfw one gets lagged? There are no multi-statement transactions here. SqlBagOStuff is all auto-commit mode, including its async garbage collection cycles. And the individual write queries afaik operate on a single row only with some very unusual upsert-like logic, but otherwise nothing that would make it potentially more complex to apply on one server vs another.

<Krinkle> do we have different replication modes intra vs inter-dc? e.g. they are both statement based right? for x2?
<Amir1> both are row-based

This seems problematic, as the design (with some details at T274174) is dependent on statement-based replication to ensure eventual consistency under the multi-master bi-direction model. This is fine for now though because 1) we're not originating writes from Codfw yet, and 2) to my knowlege row-based replication actually makes query complexity less of a potential factor in the cause of current lag we see.

ACTION: Ensure x2 replication is statement-based (at least between the primary DBs).

Switching to looking at hardware. Grafana: Host overview - db2142

Screenshot 2022-07-12 at 15.07.35.png (426×1 px, 71 KB)

This suggests to me the problem isn't query related in so far that it isn't about the logical contents but the sheer volume of it. I don't know whether it's binlog file writes or db internal writes yet, but either way, the disk is unable to keep up.

Interestingly, this is actually the case on the Eqiad host as well. Grafana: Host overview - db1151

Screenshot 2022-07-12 at 15.07.18.png (436×1 px, 69 KB)

Change 813120 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Remove DependencyStore::renew

https://gerrit.wikimedia.org/r/813120

The question on my mind was: Why is it that the Eqiad server (both primary and two replicas) seem fine, but the Codfw one gets lagged?

Disk utilisation on db1152 reached ~92% so it was probably pretty close to lagging. The disk write rate achieved was very similar between db2142 and db1152, on both the MySQL IOPS metric and bytes written.

These servers each have 10 SSDs. I am curious to know whether ~5k IOPS is a reasonable figure to expect out of such an array.

Change 813120 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Remove DependencyStore::renew

https://gerrit.wikimedia.org/r/813120

I tried the fsync loop benchmark suggested at https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/ . db2142 is fast by the standards of that article, ~40µs per iteration, and only 10µs of that is system time. So the RAID controller is doing its job. I can even see with megacli that the BBU voltage is 3.881V.

I wondered why the disk saturation metric on the host overview was zero even when the server was clearly overloaded. It should have been at least 1. It comes from field 9 of iostats. My suspicion is that the metric is not correct when async IO is used.

5000 IOPS with a RAID controller running on WB is possible quite close to its max. I never though we'd be doing so many.

Change 813670 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@wmf/1.39.0-wmf.19] ResourceLoader: Remove DependencyStore::renew

https://gerrit.wikimedia.org/r/813670

objectstash.ibd is only 532MB and most of the data was migrated from Redis. The server has 500GB of RAM. We could put the whole data directory except for the binlogs on a tmpfs.

Of course DependencyStore::renew was inefficient and its removal was warranted. I'm just saying, if we wanted to optimise x2, a fully in-memory MySQL would be one way to do it.

Change 813670 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.19] ResourceLoader: Remove DependencyStore::renew

https://gerrit.wikimedia.org/r/813670

The disk utilization has increased but not too much (at 70%):
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1151&var-datasource=thanos&var-cluster=mysql&from=1657552473013&to=1657789995779&viewPanel=6

We probably can take a look at improving write patterns even further but not sure if it's worth it.

As I have mentioned in IRC I would prefer not to make this hosts special within their partitioning scheme and configuration wise, that is, I would not like to have to run these hosts on tmpfs, that would make them a very special snowflake (more than they are already in terms of config).
We've worked hard to eliminate snowflakes from our infra (especially HW related) as they are very painful to maintain on a daily basis making them very dangerous error-wise.

I would prefer to see if we can tweak sync_binlog and/or innodb-flush-log-at-trx-commit (which would already require to have special config files for this section, but I would be ok with that).

Just ran the following on db2142:

root@db2142.codfw.wmnet[(none)]> set global sync_binlog=0;
Query OK, 0 rows affected (0.031 sec)

root@db2142.codfw.wmnet[(none)]> set global innodb_flush_log_at_trx_commit = 0;
Query OK, 0 rows affected (0.030 sec)

Just ran the following on db2142:

root@db2142.codfw.wmnet[(none)]> set global sync_binlog=0;
Query OK, 0 rows affected (0.031 sec)

root@db2142.codfw.wmnet[(none)]> set global innodb_flush_log_at_trx_commit = 0;
Query OK, 0 rows affected (0.030 sec)

Codesearch: sync_binlog OR innodb_flush_log_at_trx_commit

https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#sync_binlog
https://mariadb.com/docs/reference/mdb/system-variables/innodb_flush_log_at_trx_commit/

The upstream default appears to be sync_binlog=0 as well, which means that instead of forcing a flush to disk every N transactions, it is left to the operating system to flush as it normally would. In production we set this to 1 on most hosts it seems, except for parsercache where it was already to 0.

The upstream default for innodb_flush_log_at_trx_commit is 1 already however, thus more conservative, and is also set that way in most of prod, and similar to the other setting, set to 0 for parsercache already.

As I understand it, it does not mean more generally that binlogs are more ephemeral, but rather that basically the "last second" of writes may be lost during a crash, which is fine for the purpose of MainStash.

@Marostegui works for me. Do you need me to puppetize that?

There is patch already but not done yet and I'm picking it up since Manuel is quite busy https://gerrit.wikimedia.org/r/c/operations/puppet/+/813917

Krinkle triaged this task as High priority.

By the way, I made this change live everywhere so the mitigation is in place now. The puppet patch will make things consistent if a host gets rebooted, but the initial triage is done:

root@cumin1001:~# /home/marostegui/section x2 | while read host port; do echo "==== $host:$port ===="; db-mysql $host:$port -e "show global variables like '%sync_binlog%'; show global variables like 'innodb_flush_log_at_trx_commit'";done
==== db2144.codfw.wmnet:3306 ====
Variable_name	Value
sync_binlog	0
Variable_name	Value
innodb_flush_log_at_trx_commit	0
==== db2143.codfw.wmnet:3306 ====
Variable_name	Value
sync_binlog	0
Variable_name	Value
innodb_flush_log_at_trx_commit	0
==== db2142.codfw.wmnet:3306 ====
Variable_name	Value
sync_binlog	0
Variable_name	Value
innodb_flush_log_at_trx_commit	0
==== db1153.eqiad.wmnet:3306 ====
Variable_name	Value
sync_binlog	0
Variable_name	Value
innodb_flush_log_at_trx_commit	0
==== db1152.eqiad.wmnet:3306 ====
Variable_name	Value
sync_binlog	0
Variable_name	Value
innodb_flush_log_at_trx_commit	0
==== db1151.eqiad.wmnet:3306 ====
Variable_name	Value
sync_binlog	0
Variable_name	Value
innodb_flush_log_at_trx_commit	0

The upstream default appears to be sync_binlog=0 as well, which means that instead of forcing a flush to disk every N transactions, it is left to the operating system to flush as it normally would. In production we set this to 1 on most hosts it seems, except for parsercache where it was already to 0.
[...]
As I understand it, it does not mean more generally that binlogs are more ephemeral, but rather that basically the "last second" of writes may be lost during a crash, which is fine for the purpose of MainStash.

FYI, If someone (e.g. @Krinkle) wants to dive deeper about the consequences of non-fully-reliable binlog settings, a friend of Manuel made a full analysis of that on a MySQL/MariaDB presentation I saw at a Percona conference, I think this is the equivalent of that on blog post form. Some of the behavior may be surprising/not great:

I skimmed those blog posts.

x2 has a few properties that help with crash recovery:

  • The whole data set is only 500MB so recloning is trivial.
  • Any single server can serve the full traffic load, so replication is not required during incident recovery.
  • After about 3 hours of downtime, there is no value in recovering the data, you may as well just start from empty.