Page MenuHomePhabricator

run pt-tablechecksum on s5/s8
Closed, ResolvedPublic

Description

s2 (T154485) and s6 (T160509) has been checksumed.
It is time to checksum s5 for the following hosts to be decommissioned: db1049,db1026,db1045

The __wmf_checksums is placed on every database of the shard. That means

dewiki.__wmf_checksums
wikidatawiki.__ wmf_checksums

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 398216 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109 and db1101:3318

https://gerrit.wikimedia.org/r/398216

Change 398216 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109 and db1101:3318

https://gerrit.wikimedia.org/r/398216

Mentioned in SAL (#wikimedia-operations) [2017-12-14T07:08:09Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 and db1109 - T161294 (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2017-12-14T07:08:49Z] <marostegui> Stop replication in sync on db1109 and db1101:3318 - T161294

I have fixed dewiki.archive across the board.
Next on the list:
dewiki.change_tag
dewiki.geo_tags
dewiki.tag_summary

Once those are done, I will do another checksum to verify.

The following tables have been fixed on dewiki:

dewiki.change_tag
dewiki.geo_tags

Marostegui renamed this task from run pt-tablechecksum on s5 to run pt-tablechecksum on s5/s8.Dec 14 2017, 4:57 PM

I have fixed tag_summary too.

dewiki should be consistent now on s5 and s8 - but I will do some more checks tomorrow before starting with wikidatawiki.

Mentioned in SAL (#wikimedia-operations) [2017-12-15T06:26:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 01m 16s)

wikidatawiki.archive has been fixed.

tag_summary broke on db1100 and on most or all of codfw- we need to reload that table almost everywhere; also check for existing code issues.

For the record, I did:

delete from tag_summary WHERE ts_id=17666046;
UPDATE tag_summary SET ts_id = 17666046 WHERE ts_rev_id = 182949546;

PKs are broken, and probably now missing a row.

I started with wikidatawiki.change_tag on Friday (and still fixing it - lots of differences across all the servers). Hopefully I will be done by tomorrow. Next on the list was actually wikidatawiki.tag_summary, which I was expecting to start with on Monday. Too late looks like :-(

tag_summary broke on db1100 and on most or all of codfw- we need to reload that table almost everywhere; also check for existing code issues.

For the record, I did:

delete from tag_summary WHERE ts_id=17666046;
UPDATE tag_summary SET ts_id = 17666046 WHERE ts_rev_id = 182949546;

PKs are broken, and probably now missing a row.

For the record, servers that crashed:

db1100
14:47 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.33 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 615.59 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.47 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.34 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.51 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.65 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.67 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.19 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.74 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.27 seconds
14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.47 seconds

wikidatawiki.change_tag is finished.

Tomorrow I will re-import tag_summary on all the hosts that failed yesterday and continue with the next wikidata tables. Once everything is done, I will run another diff to double check possible drifts on both, dewiki and wikidata

Also db1100 and dbstore1001.

Yeah db1100 I wrote it down above - right above the codfw hosts, dbstore1001 I was expecting it to crash today :(

I have fixed wikidatawiki.text

wikidatawiki.wb_items_per_site done.

Change 398789 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/398789

Change 398789 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/398789

Mentioned in SAL (#wikimedia-operations) [2017-12-18T06:17:21Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2017-12-18T06:17:32Z] <marostegui> Stop replication in sync on db1106 and db1100 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-18T06:28:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2017-12-18T08:58:50Z] <marostegui> Stop replication in sync on db1100 and db2052 - T161294

Change 398804 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109

https://gerrit.wikimedia.org/r/398804

Change 398804 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109

https://gerrit.wikimedia.org/r/398804

Mentioned in SAL (#wikimedia-operations) [2017-12-18T10:30:07Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1109 - T161294 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2017-12-18T10:30:19Z] <marostegui> Stop replication on db1109 and db2045 in sync - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-18T10:46:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1109 - T161294 (duration: 00m 56s)

Change 398827 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/398827

Change 398827 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/398827

Mentioned in SAL (#wikimedia-operations) [2017-12-18T12:58:03Z] <marostegui> Stop replication in sync on db1106 and db1100 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:00:47Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 03m 03s)

Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:10:13Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 03m 06s)

Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:18:36Z] <marostegui> Stop replication in sync on db1100 and db2052 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:29:56Z] <marostegui> Stop replicaiton in sync on db1109 and db2045 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-18T15:37:32Z] <marostegui> Stop db1100 and dbstore1002 in sync - T161294

Change 399138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/399138

Change 399138 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106

https://gerrit.wikimedia.org/r/399138

Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:26:51Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:29:07Z] <marostegui> Stop replication in sync on db1100 and db1106 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:40:21Z] <marostegui> Stop replication in sync on db1106 and dbstore1002 s5 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:51:28Z] <marostegui> Stop replication in sync on db1106 and db2052 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-19T07:00:03Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-19T08:28:44Z] <marostegui> Stop replication in sync on db2045 and db1109 - T161294

I am fairly confident that dewiki is now really consistent.
Will keep doing a bit more tests to check it out.

Still fixing and checking wikidata

Mentioned in SAL (#wikimedia-operations) [2017-12-19T08:56:53Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 51s)

Change 399148 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097:3315

https://gerrit.wikimedia.org/r/399148

Change 399148 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097:3315

https://gerrit.wikimedia.org/r/399148

Mentioned in SAL (#wikimedia-operations) [2017-12-19T09:12:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:01:41Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 52s)

Change 399191 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109, db1099:3318

https://gerrit.wikimedia.org/r/399191

Change 399191 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109, db1099:3318

https://gerrit.wikimedia.org/r/399191

Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:10:45Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 and db1109 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:10:54Z] <marostegui> Stop replication in sync on db1109 and db1099:3318 - https://phabricator.wikimedia.org/T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-19T16:54:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 - T161294 (duration: 00m 51s)

Change 399342 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1101:3318, repool s8 db1099

https://gerrit.wikimedia.org/r/399342

Change 399342 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1101:3318, repool s8 db1099

https://gerrit.wikimedia.org/r/399342

Mentioned in SAL (#wikimedia-operations) [2017-12-20T06:53:36Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 depool db1101:3318 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-20T06:53:47Z] <marostegui> Stop replication in sync on db1101:3318 and db1109 - T161294

Change 399343 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3315

https://gerrit.wikimedia.org/r/399343

Change 399343 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3315

https://gerrit.wikimedia.org/r/399343

Mentioned in SAL (#wikimedia-operations) [2017-12-20T07:08:22Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-20T07:09:01Z] <marostegui> Stop replication in sync on db1096:3315 and db1100 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-20T08:11:37Z] <marostegui> Stop replication in sync on db1100 and dbstore1002 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-20T11:11:19Z] <marostegui> Stop replication in sync on db1100 and db1071 - T161294

Change 399387 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad: Repool db1101:3318, db1109

https://gerrit.wikimedia.org/r/399387

Change 399387 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad: Repool db1101:3318, db1109

https://gerrit.wikimedia.org/r/399387

Mentioned in SAL (#wikimedia-operations) [2017-12-20T12:27:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-20T12:34:50Z] <marostegui> Enable notifications for db1100 - T161294

Change 399564 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109

https://gerrit.wikimedia.org/r/399564

Change 399564 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109

https://gerrit.wikimedia.org/r/399564

Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:33:43Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1109 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:35:37Z] <marostegui> Stop replication in sync db1100 and db1071 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:42:33Z] <marostegui> Stop replication in sync on db1100 - db2052 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:50:10Z] <marostegui> Stop replication in sync on dbstore1002 - db1100 - T161294

Change 399566 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109 and db1096:3315

https://gerrit.wikimedia.org/r/399566

Change 399566 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1109 and db1096:3315

https://gerrit.wikimedia.org/r/399566

Mentioned in SAL (#wikimedia-operations) [2017-12-21T07:16:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1109 and db1096:3315 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-21T07:37:23Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1100- T161294 (duration: 00m 52s)

Change 399581 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1100

https://gerrit.wikimedia.org/r/399581

Change 399581 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1100

https://gerrit.wikimedia.org/r/399581

Mentioned in SAL (#wikimedia-operations) [2017-12-21T08:34:50Z] <marostegui> Stop replication in sync on db1100 and dbstore1002 - T161294

Mentioned in SAL (#wikimedia-operations) [2017-12-21T08:35:20Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1100 - T161294 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-12-21T09:22:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1100 - T161294 (duration: 00m 51s)

Change 399590 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100

https://gerrit.wikimedia.org/r/399590

Change 399590 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100

https://gerrit.wikimedia.org/r/399590

Mentioned in SAL (#wikimedia-operations) [2017-12-21T09:39:51Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 - T161294 (duration: 00m 51s)

Change 399597 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100

https://gerrit.wikimedia.org/r/399597

Change 399597 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100

https://gerrit.wikimedia.org/r/399597

Mentioned in SAL (#wikimedia-operations) [2017-12-21T10:03:20Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 and start restoring original weight for db1082 - T161294 (duration: 00m 48s)

Change 399599 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1100,db1082

https://gerrit.wikimedia.org/r/399599

Change 399599 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1100,db1082

https://gerrit.wikimedia.org/r/399599

Mentioned in SAL (#wikimedia-operations) [2017-12-21T10:30:48Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore db1082 and db1100 original weight - T161294 (duration: 00m 51s)

After fixing lots of differences on wikidata tables, it is impossible to say that it is fixed 100% but it is in a much better consistent state that it was before and I believe it is quite consistent now.
I will do some more checks around next week before closing this task.

The last tests I have done look good, so I am going to close this as resolved.
It is obviously impossible to say that this is fixed 100%, but undoubtedly after fixing lots and lots of rows on numerous tables, we are in a lot better state that we were before.