Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T134476 Decommission old coredb machines (<=db1050) | |||
Resolved | Marostegui | T161294 run pt-tablechecksum on s5/s8 |
Event Timeline
Change 398216 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109 and db1101:3318
Change 398216 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109 and db1101:3318
Mentioned in SAL (#wikimedia-operations) [2017-12-14T07:08:09Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 and db1109 - T161294 (duration: 01m 07s)
Mentioned in SAL (#wikimedia-operations) [2017-12-14T07:08:49Z] <marostegui> Stop replication in sync on db1109 and db1101:3318 - T161294
I have fixed dewiki.archive across the board.
Next on the list:
dewiki.change_tag
dewiki.geo_tags
dewiki.tag_summary
Once those are done, I will do another checksum to verify.
I have fixed tag_summary too.
dewiki should be consistent now on s5 and s8 - but I will do some more checks tomorrow before starting with wikidatawiki.
Mentioned in SAL (#wikimedia-operations) [2017-12-15T06:26:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 01m 16s)
tag_summary broke on db1100 and on most or all of codfw- we need to reload that table almost everywhere; also check for existing code issues.
For the record, I did:
delete from tag_summary WHERE ts_id=17666046; UPDATE tag_summary SET ts_id = 17666046 WHERE ts_rev_id = 182949546;
PKs are broken, and probably now missing a row.
I started with wikidatawiki.change_tag on Friday (and still fixing it - lots of differences across all the servers). Hopefully I will be done by tomorrow. Next on the list was actually wikidatawiki.tag_summary, which I was expecting to start with on Monday. Too late looks like :-(
For the record, servers that crashed:
db1100 14:47 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.33 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 615.59 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.47 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.34 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.51 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.65 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.67 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.19 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.74 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.27 seconds 14:48 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.47 seconds
wikidatawiki.change_tag is finished.
Tomorrow I will re-import tag_summary on all the hosts that failed yesterday and continue with the next wikidata tables. Once everything is done, I will run another diff to double check possible drifts on both, dewiki and wikidata
Yeah db1100 I wrote it down above - right above the codfw hosts, dbstore1001 I was expecting it to crash today :(
Change 398789 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106
Change 398789 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106
Mentioned in SAL (#wikimedia-operations) [2017-12-18T06:17:21Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 57s)
Mentioned in SAL (#wikimedia-operations) [2017-12-18T06:17:32Z] <marostegui> Stop replication in sync on db1106 and db1100 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-18T06:28:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 56s)
Mentioned in SAL (#wikimedia-operations) [2017-12-18T08:58:50Z] <marostegui> Stop replication in sync on db1100 and db2052 - T161294
Change 398804 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109
Change 398804 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109
Mentioned in SAL (#wikimedia-operations) [2017-12-18T10:30:07Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1109 - T161294 (duration: 00m 56s)
Mentioned in SAL (#wikimedia-operations) [2017-12-18T10:30:19Z] <marostegui> Stop replication on db1109 and db2045 in sync - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-18T10:46:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1109 - T161294 (duration: 00m 56s)
Change 398827 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106
Change 398827 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106
Mentioned in SAL (#wikimedia-operations) [2017-12-18T12:58:03Z] <marostegui> Stop replication in sync on db1106 and db1100 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:00:47Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 03m 03s)
Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:10:13Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 03m 06s)
Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:18:36Z] <marostegui> Stop replication in sync on db1100 and db2052 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-18T13:29:56Z] <marostegui> Stop replicaiton in sync on db1109 and db2045 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-18T15:37:32Z] <marostegui> Stop db1100 and dbstore1002 in sync - T161294
Change 399138 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106
Change 399138 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1106
Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:26:51Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 53s)
Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:29:07Z] <marostegui> Stop replication in sync on db1100 and db1106 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:40:21Z] <marostegui> Stop replication in sync on db1106 and dbstore1002 s5 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-19T06:51:28Z] <marostegui> Stop replication in sync on db1106 and db2052 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-19T07:00:03Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-19T08:28:44Z] <marostegui> Stop replication in sync on db2045 and db1109 - T161294
I am fairly confident that dewiki is now really consistent.
Will keep doing a bit more tests to check it out.
Still fixing and checking wikidata
Mentioned in SAL (#wikimedia-operations) [2017-12-19T08:56:53Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1106 - T161294 (duration: 00m 51s)
Change 399148 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097:3315
Change 399148 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1097:3315
Mentioned in SAL (#wikimedia-operations) [2017-12-19T09:12:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:01:41Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1106 - T161294 (duration: 00m 52s)
Change 399191 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109, db1099:3318
Change 399191 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109, db1099:3318
Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:10:45Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 and db1109 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:10:54Z] <marostegui> Stop replication in sync on db1109 and db1099:3318 - https://phabricator.wikimedia.org/T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-19T16:54:17Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 - T161294 (duration: 00m 51s)
Change 399342 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1101:3318, repool s8 db1099
Change 399342 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1101:3318, repool s8 db1099
Mentioned in SAL (#wikimedia-operations) [2017-12-20T06:53:36Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 depool db1101:3318 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-20T06:53:47Z] <marostegui> Stop replication in sync on db1101:3318 and db1109 - T161294
Change 399343 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3315
Change 399343 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3315
Mentioned in SAL (#wikimedia-operations) [2017-12-20T07:08:22Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-20T07:09:01Z] <marostegui> Stop replication in sync on db1096:3315 and db1100 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-20T08:11:37Z] <marostegui> Stop replication in sync on db1100 and dbstore1002 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-20T11:11:19Z] <marostegui> Stop replication in sync on db1100 and db1071 - T161294
Change 399387 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad: Repool db1101:3318, db1109
Change 399387 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad: Repool db1101:3318, db1109
Mentioned in SAL (#wikimedia-operations) [2017-12-20T12:27:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-20T12:34:50Z] <marostegui> Enable notifications for db1100 - T161294
Change 399564 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109
Change 399564 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109
Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:33:43Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1109 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:35:37Z] <marostegui> Stop replication in sync db1100 and db1071 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:42:33Z] <marostegui> Stop replication in sync on db1100 - db2052 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-21T06:50:10Z] <marostegui> Stop replication in sync on dbstore1002 - db1100 - T161294
Change 399566 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1109 and db1096:3315
Change 399566 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1109 and db1096:3315
Mentioned in SAL (#wikimedia-operations) [2017-12-21T07:16:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1109 and db1096:3315 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-21T07:37:23Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1100- T161294 (duration: 00m 52s)
Change 399581 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1100
Change 399581 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1100
Mentioned in SAL (#wikimedia-operations) [2017-12-21T08:34:50Z] <marostegui> Stop replication in sync on db1100 and dbstore1002 - T161294
Mentioned in SAL (#wikimedia-operations) [2017-12-21T08:35:20Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1100 - T161294 (duration: 00m 51s)
Mentioned in SAL (#wikimedia-operations) [2017-12-21T09:22:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1100 - T161294 (duration: 00m 51s)
Change 399590 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100
Change 399590 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100
Mentioned in SAL (#wikimedia-operations) [2017-12-21T09:39:51Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 - T161294 (duration: 00m 51s)
Change 399597 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100
Change 399597 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Increase traffic for db1100
Mentioned in SAL (#wikimedia-operations) [2017-12-21T10:03:20Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 and start restoring original weight for db1082 - T161294 (duration: 00m 48s)
Change 399599 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1100,db1082
Change 399599 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1100,db1082
Mentioned in SAL (#wikimedia-operations) [2017-12-21T10:30:48Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore db1082 and db1100 original weight - T161294 (duration: 00m 51s)
After fixing lots of differences on wikidata tables, it is impossible to say that it is fixed 100% but it is in a much better consistent state that it was before and I believe it is quite consistent now.
I will do some more checks around next week before closing this task.
The last tests I have done look good, so I am going to close this as resolved.
It is obviously impossible to say that this is fixed 100%, but undoubtedly after fixing lots and lots of rows on numerous tables, we are in a lot better state that we were before.