Tables created on the s7 master have not been replicated to dbstore2001 and dbstore2002. Replication issue?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Jul 1 2015, 5:36 PM

Description

Investigate if there is a replication issue for dbstore2001 and dbstore2002.

Details

	Subject	Repo	Branch	Lines +/-
	Change replication filters on dbstore hosts to use one per line	operations/puppet	production	+8 -2

Customize query in gerrit

Related Objects

Mentioned In: T106647: mariadb multi-source replication glitch with site_identifiers
rOMWC27b4e12f5b5c: Repool db2047 after maintenance
rOPUPb705e3a75e0c: Change replication filters on dbstore hosts to use one per line
rOMWCa6851485d061: Repool db2029; depool db2047 for maintenance
Mentioned Here: M3: Structured data section on Commons

Event Timeline

• jcrespo created this task.Jul 1 2015, 5:36 PM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added a project: DBA.

• jcrespo added subscribers: • jcrespo, • Springle.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2015, 5:36 PM

Discarding obvious issues:

Exec_Master_Log_Pos is growing
The too masters on top of it have a different server_id
log_slave_updates is enabled on db2029.codfw.wmnet

Changes on s3 and s1 are there, the ones on s7 are not so it is not a global problem.

I can confirm a partial or total lack of replication working on dbstore2001 and dbstore2002:

root@db2029:~$ mysql hewiki
MariaDB PRODUCTION s7 localhost hewiki > create table __wmf_test(i int PRIMARY KEY);
Query OK, 0 rows affected (0.01 sec)

root@iron:~$ mysql -h db2040.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
+-------------------------------+
| Tables_in_hewiki (__wmf_test) |
+-------------------------------+
| __wmf_test                    |
+-------------------------------+
root@iron:~$ mysql -h db2047.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
+-------------------------------+
| Tables_in_hewiki (__wmf_test) |
+-------------------------------+
| __wmf_test                    |
+-------------------------------+
root@iron:~$ mysql -h db2054.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
+-------------------------------+
| Tables_in_hewiki (__wmf_test) |
+-------------------------------+
| __wmf_test                    |
+-------------------------------+
root@iron:~$ mysql -h dbstore2001.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"
root@iron:~$ mysql -h dbstore2002.codfw.wmnet hewiki -e "SHOW TABLES like '__wmf_test'"

• jcrespo triaged this task as High priority.Jul 1 2015, 6:41 PM

Confirmed the issue is only on the replications:

db2029 -> dbstore2001 (s7)
db2029 -> dbstore2002 (s7)

Not other shards or servers anywhere.

More confirmation:

root@iron:~$ mysql -h db2029.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" && mysql -h dbstore2001.codfw.wmnet hewiki -e "select max(rev_id) FROM revision"
+-------------+
| max(rev_id) |
+-------------+
|    17165522 |
+-------------+
+-------------+
| max(rev_id) |
+-------------+
|    17006882 |
+-------------+
root@iron:~$ mysql -h db2029.codfw.wmnet hewiki -e "select max(rev_id) FROM revision" && mysql -h dbstore2001.codfw.wmnet hewiki -e "select max(rev_id) FROM revision"
+-------------+
| max(rev_id) |
+-------------+
|    17165525 |
+-------------+
+-------------+
| max(rev_id) |
+-------------+
|    17006882 |
+-------------+

The only problems on the error log are related to m4, not to s7.

Next step -restarting mysql for dbstore2001 and 2002. If you have time, @Springle, can you take it from here until my tomorrow?

Tried a restart of dbstore2002, but s7 replication behavior was unchanged: Yes/Yes for replication threads, master exec position advancing, yet no changes appearing.

Tried:

stop slave 's7';
set global s7.replication_wild_do_table = '%wik%.%';
start slave 's7';

... and replication changes for s7 wikis started appearing again. Then:

stop slave 's7';
set global s7.replication_wild_do_table = '%wik%.%,centralauth.%';
start slave 's7';

... and replication changes for both wikis and centralauth now appear normally (though with a massive gap in data, so s7 still needs reloading on dbstore200x).

Something odd to do with replication rules combined with multi-source replication? s1-6 have only %wik%.% while s7 has multiple rules %wik%.%,centralauth.%. The only other shard to have multiple rules is m3 (phabricator) with phab%.%,phlegal%.%; we should investigate and checksum it.

Additionally, for some reason, the rules appear to get doubled up for all shards after mysqld restart. Probably unrelated to this bug, since s1-6 seem unaffected.

S1: Replicate_Wild_Do_Table: %wik%.%,%wik%.%
S7: Replicate_Wild_Do_Table: %wik%.%,centralauth.%,%wik%.%,centralauth.%
M3: Replicate_Wild_Do_Table: phab%.%,phlegal%.%,phab%.%,phlegal%.%

Just a note here: the rules were already doubled before the restart.

The concern here is that if this is a bug, it may affect dbstores in eqiad- but it is not happening right now. My hypothesis -thanks to your notes- would be, however, a configuration problem

I do not think this is a valid conf:
s7.replication_wild_do_table = '%wik%.%,centralauth.%'
I would say it has to be spread over several lines (otherwise it is taken as only 1 filter).

When I said it does not affect other shards, I only tested s[1-7], I would bet it happens for M3 too.

But dbstore1 has the same config. Either I am right and dbstore has not yet been restarted or it is the bug you mentioned (multisource + filtering).

Independently of that, I would:

Get rid of filters (not easy, but doable)- they are a source of all kinds of problems.
Re-evaluate heartbeat or something similar

• jcrespo moved this task from Triage to In progress on the DBA board.Jul 2 2015, 7:46 AM

• jcrespo claimed this task.Jul 2 2015, 1:40 PM

• jcrespo mentioned this in rOMWCa6851485d061: Repool db2029; depool db2047 for maintenance.Jul 2 2015, 1:43 PM

i confirm the issue is on m3 and s7, and it is due to no setting the replication filters over several lines.

Requires a puppet patch and a reload of m3 and s7.

Import of s7 is ongoing for dbstore2001 and dbstore2002.

Import finished and replication restarted for s7.

As a side note, icinga seems to be ok with large lags for the dbstores:

"OK 2015-07-03 07:00:40 0d 15h 12m 4s 1/3 OK slave_sql_lag Seconds_Behind_Master: 52865"

Importing now s3 on dbstore2001.

s3 replication started on dbstore2001. Only puppet remains to be fixed now, I think.

Change 222562 had a related patch set uploaded (by Jcrespo):
Change replication filters on dbstore hosts to use one per line

https://gerrit.wikimedia.org/r/222562

gerritbot added a project: Patch-For-Review.Jul 3 2015, 10:48 AM

Change 222562 merged by Jcrespo:
Change replication filters on dbstore hosts to use one per line

https://gerrit.wikimedia.org/r/222562

• jcrespo mentioned this in rOPUPb705e3a75e0c: Change replication filters on dbstore hosts to use one per line.Jul 3 2015, 10:53 AM

Good to know we discover this before it hit production!

• jcrespo mentioned this in rOMWC27b4e12f5b5c: Repool db2047 after maintenance.Jul 3 2015, 1:26 PM

• Springle mentioned this in T106647: mariadb multi-source replication glitch with site_identifiers.Jul 23 2015, 6:06 AM

Tables created on the s7 master have not been replicated to dbstore2001 and dbstore2002. Replication issue?Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Tables created on the s7 master have not been replicated to dbstore2001 and dbstore2002. Replication issue?
Closed, ResolvedPublic
Actions