Page MenuHomePhabricator

Tables created on the s7 master have not been replicated to dbstore2001 and dbstore2002. Replication issue?
Closed, ResolvedPublic

Description

Investigate if there is a replication issue for dbstore2001 and dbstore2002.

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a project: DBA.
jcrespo added subscribers: jcrespo, Springle.

Discarding obvious issues:

  • Exec_Master_Log_Pos is growing
  • The too masters on top of it have a different server_id
  • log_slave_updates is enabled on db2029.codfw.wmnet

Changes on s3 and s1 are there, the ones on s7 are not so it is not a global problem.

I can confirm a partial or total lack of replication working on dbstore2001 and dbstore2002:

root@db2029:~$ mysql hewiki
MariaDB PRODUCTION s7 localhost hewiki > create table __wmf_test(i int PRIMARY KEY);
Query OK, 0 rows affected (0.01 sec)

root@iron:~$ mysql -h db2040.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
+-------------------------------+
| Tables_in_hewiki (__wmf_test) |
+-------------------------------+
| __wmf_test                    |
+-------------------------------+
root@iron:~$ mysql -h db2047.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
+-------------------------------+
| Tables_in_hewiki (__wmf_test) |
+-------------------------------+
| __wmf_test                    |
+-------------------------------+
root@iron:~$ mysql -h db2054.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
+-------------------------------+
| Tables_in_hewiki (__wmf_test) |
+-------------------------------+
| __wmf_test                    |
+-------------------------------+
root@iron:~$ mysql -h dbstore2001.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
root@iron:~$ mysql -h dbstore2002.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"

Confirmed the issue is only on the replications:

  • db2029 -> dbstore2001 (s7)
  • db2029 -> dbstore2002 (s7)

Not other shards or servers anywhere.

More confirmation:

root@iron:~$ mysql -h db2029.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" && mysql -h dbstore2001.codfw.wmnet hewiki -e "select max(rev_id) FROM revision"
+-------------+
| max(rev_id) |
+-------------+
|    17165522 |
+-------------+
+-------------+
| max(rev_id) |
+-------------+
|    17006882 |
+-------------+
root@iron:~$ mysql -h db2029.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" && mysql -h dbstore2001.codfw.wmnet hewiki -e "select max(rev_id) FROM revision"
+-------------+
| max(rev_id) |
+-------------+
|    17165525 |
+-------------+
+-------------+
| max(rev_id) |
+-------------+
|    17006882 |
+-------------+

The only problems on the error log are related to m4, not to s7.

Next step -restarting mysql for dbstore2001 and 2002. If you have time, @Springle, can you take it from here until my tomorrow?

Tried a restart of dbstore2002, but s7 replication behavior was unchanged: Yes/Yes for replication threads, master exec position advancing, yet no changes appearing.

Tried:

stop slave 's7';
set global s7.replication_wild_do_table = '%wik%.%';
start slave 's7';

... and replication changes for s7 wikis started appearing again. Then:

stop slave 's7';
set global s7.replication_wild_do_table = '%wik%.%,centralauth.%';
start slave 's7';

... and replication changes for both wikis and centralauth now appear normally (though with a massive gap in data, so s7 still needs reloading on dbstore200x).

Something odd to do with replication rules combined with multi-source replication? s1-6 have only %wik%.% while s7 has multiple rules %wik%.%,centralauth.%. The only other shard to have multiple rules is m3 (phabricator) with phab%.%,phlegal%.%; we should investigate and checksum it.

Additionally, for some reason, the rules appear to get doubled up for all shards after mysqld restart. Probably unrelated to this bug, since s1-6 seem unaffected.

S1: Replicate_Wild_Do_Table: %wik%.%,%wik%.%
S7: Replicate_Wild_Do_Table: %wik%.%,centralauth.%,%wik%.%,centralauth.%
M3: Replicate_Wild_Do_Table: phab%.%,phlegal%.%,phab%.%,phlegal%.%

Just a note here: the rules were already doubled before the restart.

The concern here is that if this is a bug, it may affect dbstores in eqiad- but it is not happening right now. My hypothesis -thanks to your notes- would be, however, a configuration problem

I do not think this is a valid conf:
s7.replication_wild_do_table = '%wik%.%,centralauth.%'
I would say it has to be spread over several lines (otherwise it is taken as only 1 filter).

When I said it does not affect other shards, I only tested s[1-7], I would bet it happens for M3 too.

But dbstore1 has the same config. Either I am right and dbstore has not yet been restarted or it is the bug you mentioned (multisource + filtering).

Independently of that, I would:

  • Get rid of filters (not easy, but doable)- they are a source of all kinds of problems.
  • Re-evaluate heartbeat or something similar

i confirm the issue is on m3 and s7, and it is due to no setting the replication filters over several lines.

Requires a puppet patch and a reload of m3 and s7.

Import of s7 is ongoing for dbstore2001 and dbstore2002.

Import finished and replication restarted for s7.

As a side note, icinga seems to be ok with large lags for the dbstores:

"OK 2015-07-03 07:00:40 0d 15h 12m 4s 1/3 OK slave_sql_lag Seconds_Behind_Master: 52865"

Importing now s3 on dbstore2001.

s3 replication started on dbstore2001. Only puppet remains to be fixed now, I think.

Change 222562 had a related patch set uploaded (by Jcrespo):
Change replication filters on dbstore hosts to use one per line

https://gerrit.wikimedia.org/r/222562

Change 222562 merged by Jcrespo:
Change replication filters on dbstore hosts to use one per line

https://gerrit.wikimedia.org/r/222562

Good to know we discover this before it hit production!