Investigate if there is a replication issue for dbstore2001 and dbstore2002.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Change replication filters on dbstore hosts to use one per line | operations/puppet | production | +8 -2 |
Related Objects
- Mentioned In
- T106647: mariadb multi-source replication glitch with site_identifiers
rOMWC27b4e12f5b5c: Repool db2047 after maintenance
rOPUPb705e3a75e0c: Change replication filters on dbstore hosts to use one per line
rOMWCa6851485d061: Repool db2029; depool db2047 for maintenance - Mentioned Here
- M3: Structured data section on Commons
Event Timeline
Discarding obvious issues:
- Exec_Master_Log_Pos is growing
- The too masters on top of it have a different server_id
- log_slave_updates is enabled on db2029.codfw.wmnet
Changes on s3 and s1 are there, the ones on s7 are not so it is not a global problem.
I can confirm a partial or total lack of replication working on dbstore2001 and dbstore2002:
root@db2029:~$ mysql hewiki MariaDB PRODUCTION s7 localhost hewiki > create table __wmf_test(i int PRIMARY KEY); Query OK, 0 rows affected (0.01 sec) root@iron:~$ mysql -h db2040.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'" +-------------------------------+ | Tables_in_hewiki (__wmf_test) | +-------------------------------+ | __wmf_test | +-------------------------------+ root@iron:~$ mysql -h db2047.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'" +-------------------------------+ | Tables_in_hewiki (__wmf_test) | +-------------------------------+ | __wmf_test | +-------------------------------+ root@iron:~$ mysql -h db2054.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'" +-------------------------------+ | Tables_in_hewiki (__wmf_test) | +-------------------------------+ | __wmf_test | +-------------------------------+ root@iron:~$ mysql -h dbstore2001.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'" root@iron:~$ mysql -h dbstore2002.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
Confirmed the issue is only on the replications:
- db2029 -> dbstore2001 (s7)
- db2029 -> dbstore2002 (s7)
Not other shards or servers anywhere.
More confirmation:
root@iron:~$ mysql -h db2029.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" && mysql -h dbstore2001.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" +-------------+ | max(rev_id) | +-------------+ | 17165522 | +-------------+ +-------------+ | max(rev_id) | +-------------+ | 17006882 | +-------------+ root@iron:~$ mysql -h db2029.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" && mysql -h dbstore2001.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" +-------------+ | max(rev_id) | +-------------+ | 17165525 | +-------------+ +-------------+ | max(rev_id) | +-------------+ | 17006882 | +-------------+
The only problems on the error log are related to m4, not to s7.
Next step -restarting mysql for dbstore2001 and 2002. If you have time, @Springle, can you take it from here until my tomorrow?
Tried a restart of dbstore2002, but s7 replication behavior was unchanged: Yes/Yes for replication threads, master exec position advancing, yet no changes appearing.
Tried:
stop slave 's7'; set global s7.replication_wild_do_table = '%wik%.%'; start slave 's7';
... and replication changes for s7 wikis started appearing again. Then:
stop slave 's7'; set global s7.replication_wild_do_table = '%wik%.%,centralauth.%'; start slave 's7';
... and replication changes for both wikis and centralauth now appear normally (though with a massive gap in data, so s7 still needs reloading on dbstore200x).
Something odd to do with replication rules combined with multi-source replication? s1-6 have only %wik%.% while s7 has multiple rules %wik%.%,centralauth.%. The only other shard to have multiple rules is m3 (phabricator) with phab%.%,phlegal%.%; we should investigate and checksum it.
Additionally, for some reason, the rules appear to get doubled up for all shards after mysqld restart. Probably unrelated to this bug, since s1-6 seem unaffected.
S1: Replicate_Wild_Do_Table: %wik%.%,%wik%.% S7: Replicate_Wild_Do_Table: %wik%.%,centralauth.%,%wik%.%,centralauth.% M3: Replicate_Wild_Do_Table: phab%.%,phlegal%.%,phab%.%,phlegal%.%
Just a note here: the rules were already doubled before the restart.
The concern here is that if this is a bug, it may affect dbstores in eqiad- but it is not happening right now. My hypothesis -thanks to your notes- would be, however, a configuration problem
I do not think this is a valid conf:
s7.replication_wild_do_table = '%wik%.%,centralauth.%'
I would say it has to be spread over several lines (otherwise it is taken as only 1 filter).
When I said it does not affect other shards, I only tested s[1-7], I would bet it happens for M3 too.
But dbstore1 has the same config. Either I am right and dbstore has not yet been restarted or it is the bug you mentioned (multisource + filtering).
Independently of that, I would:
- Get rid of filters (not easy, but doable)- they are a source of all kinds of problems.
- Re-evaluate heartbeat or something similar
i confirm the issue is on m3 and s7, and it is due to no setting the replication filters over several lines.
Requires a puppet patch and a reload of m3 and s7.
Import finished and replication restarted for s7.
As a side note, icinga seems to be ok with large lags for the dbstores:
"OK 2015-07-03 07:00:40 0d 15h 12m 4s 1/3 OK slave_sql_lag Seconds_Behind_Master: 52865"
Change 222562 had a related patch set uploaded (by Jcrespo):
Change replication filters on dbstore hosts to use one per line
Change 222562 merged by Jcrespo:
Change replication filters on dbstore hosts to use one per line