Page MenuHomePhabricator

Checksum data on s7
Closed, ResolvedPublic

Description

Run mydumper+diff on s7

The following databases are placed there:

  • arwiki (fixed inconsistencies on: archive, tag_summary, change_tag)
  • cawiki (fixed inconsistencies on: change_tag, tag_summary, user_newtalk)
  • centralauth (no inconsistencies)
  • eswiki (fixed inconsistencies on: change_tag, tag_summary)
  • fawiki (fixed inconsistencies on: change_tag, tag_summary)
  • frwiktionary (fixed inconsistencies on: change_tag, tag_summary)
  • hewiki (fixed inconsistencies on: archive, change_tag, tag_summary)
  • huwiki fixed (inconsistencies on: change_tag, tag_summary)
  • kowiki (fixed inconsistencies on: change_tag, tag_summary)
  • metawiki (inconsistencies on: change_tag, tag_summary)
  • rowiki (inconsistencies on: archive, change_tag, tag_summary)
  • ukwiki (inconsistencies on: change_tag, tag_summary, user_newtalk)
  • viwiki (inconsistencies on: change_tag, tag_summary, user_newtalk)

Situation with db1041:

db1041's data has been checksummed against db1079. For those differences (see T163190#3540214), the data has been checked across all the hosts in eqiad and codfw. db1062 and db1041 have the same data, but the rest of hosts, have different data (consistent amongst them), meaning that we are serving the same content for reads. As db1041 has been depooled for a long time and db1062 is the master, which doesn't serve reads.
With that in mind, we should probably decomission db1041 as its data is still on db1062 (which needs to be failed over at some point due to: T172459). And if we believe that the differences are, in the end, worth the time of fixing them, we could use that host as the source for running the changes needed.

db1041's data is at: dbstore1001.eqiad.wmnet:/srv/tmp/db1041.tar.gz

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have finished with T165743, so I am going to attempt to run pt-table-checksum on frwiktionary again

Mentioned in SAL (#wikimedia-operations) [2017-05-23T15:50:05Z] <marostegui> Stop replication on dbstore1002 s7 thread for maintenance - T163190

dbstore1002 was missed on: T165743 so it messed up the checksum when it arrived to revision table, I have fixed it and now I am waiting for dbstore1002 to catch up to start the run again

And finally frwiktionary is done.
The only difference (among all the hosts is on the archive table).

This shard is ready for compare.py to run.

Change 372830 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/372830

Change 372830 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/372830

Mentioned in SAL (#wikimedia-operations) [2017-08-21T12:52:16Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T163190 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2017-08-21T12:52:46Z] <marostegui> Stop replication on db1079 and db1041 to compare their data - T163190

Mentioned in SAL (#wikimedia-operations) [2017-08-21T14:52:33Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 - T163190 (duration: 00m 44s)

I am checking db1041 against db1079. Although looks like this shard is pretty consisten across slaves as per the comments on this task

s7 is an interesting case. All the slaves look consistent among them.
However, for the differences existing on db1041 and db1079 ie: (arwiki.tag_summary) all the slaves look the same, but db141 (old master2) and db1062 (current master) look the same. db1033 (old master) even look like the rest of the slaves.
So db1041 and db1062 do have the same data, whereas the old master db1033 and the rest of the slaves (including codfw) have the same data among them.
In some cases, db1034 has the same data as db1062 (master) and db1041 (old master 2).

These are the tables that differ:

arwiki.archive.sql.gz
arwiki.change_tag.sql.gz
arwiki.tag_summary.sql.gz
arwiki.user_newtalk.sql.gz
cawiki.archive.sql.gz
cawiki.change_tag.sql.gz
cawiki.tag_summary.sql.gz
cawiki.user_newtalk.sql.gz
eswiki.archive.sql.gz
eswiki.change_tag.sql.gz
eswiki.tag_summary.sql.gz
eswiki.user_newtalk.sql.gz
fawiki.change_tag.sql.gz
fawiki.tag_summary.sql.gz
fawiki.user_newtalk.sql.gz
frwiktionary.archive.sql.gz
frwiktionary.change_tag.sql.gz
frwiktionary.tag_summary.sql.gz
frwiktionary.user_newtalk.sql.gz
hewiki.change_tag.sql.gz
hewiki.tag_summary.sql.gz
hewiki.user_newtalk.sql.gz
huwiki.change_tag.sql.gz
huwiki.tag_summary.sql.gz
huwiki.user_newtalk.sql.gz
kowiki.archive.sql.gz
kowiki.change_tag.sql.gz
kowiki.tag_summary.sql.gz
kowiki.user_newtalk.sql.gz
metawiki.archive.sql.gz
metawiki.change_tag.sql.gz
metawiki.tag_summary.sql.gz
metawiki.user_newtalk.sql.gz
rowiki.archive.sql.gz
rowiki.change_tag.sql.gz
rowiki.tag_summary.sql.gz
rowiki.user_newtalk.sql.gz
ukwiki.change_tag.sql.gz
ukwiki.tag_summary.sql.gz
ukwiki.user_newtalk.sql.gz
viwiki.change_tag.sql.gz
viwiki.tag_summary.sql.gz
viwiki.user_newtalk.sql.gz

Differences are not really super impacting, ie (tag_summary):

-(xxx,xxx,NULL,x,"visualeditor"),
-(xxx,xxx,NULL,xxx,"mobile edit,mobile web edit"),
+(xxx,xxx,NULL,xxx,"mobile edit,mobile web edit"),
+(xxx,xxx,NULL,xxx,"xxvisualeditor"),

Considering the fact that only db1041 and db1062 (master) had the same data for a few rows, I would suggest we'd need to pick one of these options

  1. Rebuild all the slaves from db1041 (including codfw)
  2. Assume that as db1041 has been depooled for a long time and db1062 is the master (and hence doesn't get reads), take a full dump of db1041 and assume that all the slaves are correct and once we have to do a switch over, rebuild db1062 from a slave.

I would be inclined for option #2 basically because that different data has not been read in months probably (again, host has been depooled for long).

Change 397351 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db103{4,9}: Disable notifications

https://gerrit.wikimedia.org/r/397351

Change 397351 merged by Marostegui:
[operations/puppet@production] db103{4,9}: Disable notifications

https://gerrit.wikimedia.org/r/397351

Mentioned in SAL (#wikimedia-operations) [2017-12-11T08:52:37Z] <marostegui> Stop replication in sync on db1034 and db1039 - T163190

Marostegui renamed this task from Run pt-table-checksum on s7 to Checksum data on s7.Dec 11 2017, 4:00 PM
Marostegui claimed this task.
Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2017-12-12T09:34:12Z] <marostegui> Stop replication in sync on db1034 and db1039 for data consistency check - T163190

Change 397756 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1086

https://gerrit.wikimedia.org/r/397756

Change 397756 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1086

https://gerrit.wikimedia.org/r/397756

Mentioned in SAL (#wikimedia-operations) [2017-12-12T09:43:52Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1086 - T163190 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2017-12-12T09:44:17Z] <marostegui> Stop db1039 and db1086 in sync - T163190

Mentioned in SAL (#wikimedia-operations) [2017-12-12T11:56:16Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1086 with low weight - T163190 (duration: 00m 55s)

The first iteration of checks for the s7 databases is completed.
I am going to go again and check it more using compare.py

Fixed more drifts on the following tables:
arwiki.archive
eswiki.archive
frwiktionary.archive
kowiki.archive
metawiki.archive
rowiki.archive

Fixed drifts on change_tags on:
eswiki
fawiki
hewiki
huwiki
kowiki
metawiki
ukwiki
viwiki

Fixed drifts on tag_summary on:
arwiki
cawiki
eswiki
fawiki
hewiki
huwiki
kowiki
metawiki
ukwiki
viwiki

s7 is looking pretty good now after fixing drifts on: text, pagelinks, user_newtalk, watchlist on a few different wikis.
Still a few more tables to double check with compare.py but I reckon we are in a much better state already after all the fixes.

Mentioned in SAL (#wikimedia-operations) [2018-01-04T08:53:04Z] <marostegui> Fixing inconsistencies on s7 - T163190

Change 401925 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1101:3317

https://gerrit.wikimedia.org/r/401925

Change 401925 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1101:3317

https://gerrit.wikimedia.org/r/401925

Mentioned in SAL (#wikimedia-operations) [2018-01-04T09:43:33Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 - T163190 (duration: 03m 09s)

Change 401930 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/401930

Change 401930 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1079

https://gerrit.wikimedia.org/r/401930

Mentioned in SAL (#wikimedia-operations) [2018-01-04T10:39:05Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T163190 (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-01-04T10:39:17Z] <marostegui> Stop replication in sync on db1079 and db1101:3317 - T163190

Mentioned in SAL (#wikimedia-operations) [2018-01-04T10:51:45Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 - T163190 (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2018-01-04T13:35:55Z] <marostegui> Stop replication in sync db1079 db1101:3317 T163190

Mentioned in SAL (#wikimedia-operations) [2018-01-04T13:36:05Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1079 - T163190 (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-01-04T13:43:58Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1079 - T163190 (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2018-01-04T15:09:10Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 - T163190 (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-01-05T06:23:56Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1094 - T163190 (duration: 00m 51s)

Change 402182 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1098:3317

https://gerrit.wikimedia.org/r/402182

Change 402182 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1098:3317

https://gerrit.wikimedia.org/r/402182

Mentioned in SAL (#wikimedia-operations) [2018-01-05T06:48:16Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 - T163190 (duration: 00m 27s)

Mentioned in SAL (#wikimedia-operations) [2018-01-05T06:49:15Z] <marostegui> Stop replication in sync on db1039 and db1098:3317 - T163190

Mentioned in SAL (#wikimedia-operations) [2018-01-05T08:39:08Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1094 - T163190 (duration: 00m 28s)

Change 402315 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1098:3317

https://gerrit.wikimedia.org/r/402315

Change 402315 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1098:3317

https://gerrit.wikimedia.org/r/402315

Mentioned in SAL (#wikimedia-operations) [2018-01-05T09:30:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 - T163190 (duration: 00m 27s)

I have fixed all the drifts reported by pt-table-checksum as well as all the ones found by compare.py.
It is hard to say this is 100% fixed, but we are in a lot better state than before as lots of tables have been fixed.
After re-running all the tests again there are no apparent drifts anymore.
Going to consider this resolved for now.

db2040 and db1094 eswiki.archive had drifts. Checking them and correcting them.

This is bad (split brain between eqiad and codfw):

./compare.py eswiki archive ar_id db1062 db1069 db1079 db1086 db1094 db1098:3317 db1101:3317 db2040 db2029 db2047 db2054 db2061 db2068 db2077 db2086:3317 db2087:3317 --from-value=12680001 --to-value=12690000 --step=100
Starting comparison between id 12680001 and 12690000
DIFFERENCE on db2040.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2029.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2047.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2054.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2061.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2068.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2077.codfw.wmnet:3306: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2086.codfw.wmnet:3317: WHERE ar_id BETWEEN 12682901 AND 12683000
DIFFERENCE on db2087.codfw.wmnet:3317: WHERE ar_id BETWEEN 12682901 AND 12683000
2018-01-18T16:43:49.242416: row id 12689901/12690000, ETA: 00m00s, 1 chunk(s) found different
Execution ended, a total of 1 chunk(s) are different.

It is 2 rows that got swapped ids , either caused by a bad query on the codfw master, or was fixed on eqiad only:

mysql -h db2040.codfw.wmnet $db -e "SELECT * FROM $table WHERE $pk IN (12682920, 12682921)"

I will just swap them on the codfw master and let that replicate.

Should be fixed now, I will wait until s7 check is complete to close the ticket.