Page MenuHomePhabricator

db1082 power loss resulted on mysql crash
Closed, ResolvedPublic

Description

See parent task for details.

  • db1082 crashed, may need to be reimaged.
  • s5 eqiad sanitarium replication as well as its children (labsdb1009/10/11) have s5 replication stopped, will catch up once db1082 is restarted or (preferible) switched over.

db1124 s5 replication should be stopped before attempting to start db1082 to prevent drift to replicate to cloud dbs.

CC @Bstorm so someone from cloud is aware of the issue on s5 replication but not needing anything from cloud at the moment

Event Timeline

jcrespo triaged this task as High priority.Jan 7 2019, 6:36 PM
jcrespo created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2019, 6:36 PM
jcrespo claimed this task.Jan 7 2019, 6:36 PM
jcrespo added a subscriber: Cmjohnson.

I plan to take care of this tomorrow morning.

jcrespo moved this task from Triage to Next on the DBA board.Jan 7 2019, 6:37 PM
jcrespo removed a subscriber: Cmjohnson.

Maybe it is worth to start replication on db1082 (not on sanitarium), let it catch up, once it is synced compare.py it against the host you will reimage it from to make sure they are the same - so we avoid issues with the sanitarium host and row based replication. Just to make double sure.
Doing this will take longer of course to resolve this task, it is an idea, feel free to dismiss it :). I just want to avoid the same issue we had with sanitarium host on s8 and wikidata.pagelinks table (T212574)

Mentioned in SAL (#wikimedia-operations) [2019-01-08T09:22:42Z] <jynus> stop replication on db1124:s5 T213108

db1124:s5 stopped at db1082-bin.002490:667685191

root@db1124[(none)]> show global variables like '%gtid%';      
+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name          | Value                                                                                                                                                                                                                                                                                  |
+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gtid_binlog_pos        | 171966558-171966558-99,171970577-171970577-67218,171978777-171978777-1405404331,180363367-180363367-133158799,180367364-180367364-67917352                                                                                                                                             |
| gtid_binlog_state      | 171966558-171966558-99,171970577-171970577-67218,171978777-171978777-1405404331,180363367-180363367-133158799,180367364-180367364-67917352                                                                                                                                             |
| gtid_current_pos       | 0-180359179-5734605861,171966558-171966558-99,171970577-171970577-67218,171970704-171970704-351094624,171974884-171974884-1473084269,171978768-171978768-202416,171978777-171978777-1405404332,180359179-180359179-96523837,180363367-180363367-133158799,180367364-180367364-67917352 |
| gtid_domain_id         | 171970577                                                                                                                                                                                                                                                                              |
| gtid_ignore_duplicates | OFF                                                                                                                                                                                                                                                                                    |
| gtid_slave_pos         | 0-180359179-5734605861,171966558-171966558-99,171970704-171970704-351094624,171974884-171974884-1473084269,171978768-171978768-202416,171978777-171978777-1405404332,180359179-180359179-96523837,180363367-180363367-133158799,180367364-180367364-67917352                           |
| gtid_strict_mode       | OFF                                                                                                                                                                                                                                                                                    |
| wsrep_gtid_domain_id   | 0                                                                                                                                                                                                                                                                                      |
| wsrep_gtid_mode        | OFF                                                                                                                                                                                                                                                                                    |
+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
9 rows in set (0.00 sec)

This is mostly fixed, except gtid must be enabled on 82 and 1124, plus 82 must be repooled.

Change 483102 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1082 with minimal traffic

https://gerrit.wikimedia.org/r/483102

Change 483102 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1082 with minimal traffic

https://gerrit.wikimedia.org/r/483102

Change 483108 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Fully repool db1082 after recovery

https://gerrit.wikimedia.org/r/483108

Change 483108 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Fully repool db1082 after recovery

https://gerrit.wikimedia.org/r/483108

jcrespo closed this task as Resolved.Jan 9 2019, 4:24 PM

db1082 is fully repooled, it and db1124 had gtid reeenabled.