Page MenuHomePhabricator

es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)]
Closed, ResolvedPublic

Description

Caused by row C switch maint (T331882)

I ran this:

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.180 sec)

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.069 sec)

And it seems it's fixed now but a data integrity check won't hurt

Event Timeline

Will schedule an integrity check of the last million rows per wiki- normally corruption would be very very weird, so feel free to leave it pooled for now.

jcrespo triaged this task as Medium priority.Apr 4 2023, 1:43 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-04T13:44:16Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool es1022 T333961', diff saved to https://phabricator.wikimedia.org/P46027 and previous config saved to /var/cache/conftool/dbconfig/20230404-134415-ladsgroup.json

root@es1021[(none)]> show slave hosts;
+-----------+--------------------+------+-----------+
| Server_id | Host               | Port | Master_id |
+-----------+--------------------+------+-----------+
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171966665 | es1020.eqiad.wmnet | 3306 | 171970708 |
+-----------+--------------------+------+-----------+
27 rows in set (0.000 sec)
jcrespo renamed this task from Replication broke on es1022 (es4) to es1021-> es1022 ended up in a weird state after switch maintenance [was: Replication broke on es1022 (es4)].Apr 4 2023, 2:34 PM

Change 905648 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/mediawiki-config@master] external store: Depool es4 (cluster26) from writes for maintenance

https://gerrit.wikimedia.org/r/905648

Change 905648 merged by jenkins-bot:

[operations/mediawiki-config@master] external store: Depool es4 (cluster26) from writes for maintenance

https://gerrit.wikimedia.org/r/905648

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:06:45Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:905648|external store: Depool es4 (cluster26) from writes for maintenance (T333961)]]

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:08:10Z] <ladsgroup@deploy2002> ladsgroup and jynus: Backport for [[gerrit:905648|external store: Depool es4 (cluster26) from writes for maintenance (T333961)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:18:16Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:905648|external store: Depool es4 (cluster26) from writes for maintenance (T333961)]] (duration: 11m 30s)

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:31:14Z] <jynus> restart es1021, several connections in a "stuck" state T333961

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:58:09Z] <jynus> restart es1021, several connections in a "stuck" state T333961

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:59:06Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on es1022.eqiad.wmnet with reason: T333961

Mentioned in SAL (#wikimedia-operations) [2023-04-04T15:59:21Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es1022.eqiad.wmnet with reason: T333961

processlist weirdness on es1021 showing the stuck processes from es1022 replication:

Screenshot_20230404_180333.png (1×2 px, 138 KB)

Change 905623 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/mediawiki-config@master] Revert "external store: Depool es4 (cluster26) from writes for maintenance"

https://gerrit.wikimedia.org/r/905623

Change 905623 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "external store: Depool es4 (cluster26) from writes for maintenance"

https://gerrit.wikimedia.org/r/905623

Mentioned in SAL (#wikimedia-operations) [2023-04-04T16:10:05Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:905623|Revert "external store: Depool es4 (cluster26) from writes for maintenance" (T333961)]]

Mentioned in SAL (#wikimedia-operations) [2023-04-04T16:11:30Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:905623|Revert "external store: Depool es4 (cluster26) from writes for maintenance" (T333961)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-04-04T16:17:37Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:905623|Revert "external store: Depool es4 (cluster26) from writes for maintenance" (T333961)]] (duration: 07m 31s)

jcrespo removed jcrespo as the assignee of this task.EditedApr 4 2023, 4:51 PM
jcrespo raised the priority of this task from Medium to High.

Summary: es2022 is left shutdown, as clearly it is the trigger of the stuck connections to es2021. For now es2020 and 22 should provide enough redundancy.

I don't know what is the cause (I saw https://phabricator.wikimedia.org/T333961#8755045 but I didn't comment on it because I was busy putting the fire out- now someone else can research what was the root cause of it), but let's not start es2022/start replication there if we don't want to affect es2022 again, at least unless mw is fully depooled or we have a plan. The fact that this happened only on es2022 is very weird, while it is true it is the only es host that uses 10.6, there are other replicas that don't have that issue. I would discard some weird regression on replication, triggered by the loss of network (e.g. maybe resetting the replica and redoing it).

My biggest worry for now are backups. I don't want to start those while we have reduced redundancy, so I may copy those this one time from codfw.

This was a scary prospect, happily it got resolved with almost no user impact (200-300 read query errors at most, and no edit impact).

Info for debugging:

mysql(es2021)> SHOW PROCESSLIST;
+-----------+-----------------+----------------------+--------------+-------------+---------+-------------------------------->
| Id        | User            | Host                 | db           | Command     | Time    | State                          >
+-----------+-----------------+----------------------+--------------+-------------+---------+-------------------------------->
|         1 | system user     |                      | NULL         | Daemon      |    NULL | InnoDB purge coordinator       >
|         2 | system user     |                      | NULL         | Daemon      |    NULL | InnoDB shutdown handler        >
|  76834449 | root            | localhost            | heartbeat    | Sleep       |       0 |                                >
| 545138451 | event_scheduler | localhost            | NULL         | Daemon      | 5878344 | Waiting for next activation    >
| 741375900 | system user     |                      | NULL         | Slave_IO    | 2530733 | Waiting for master to send even>
| 741375901 | system user     |                      | NULL         | Slave_SQL   |       0 | Slave has read all relay log; w>
| 776676463 | orchestrator    | 208.80.155.103:43718 | NULL         | Sleep       |       0 |                                >
| 776676464 | orchestrator    | 208.80.155.103:43714 | NULL         | Sleep       |       0 |                                >
| 776676465 | orchestrator    | 208.80.155.103:43724 | NULL         | Sleep       |       0 |                                >
| 776676466 | <replication_user> | 10.64.0.201:51956    | NULL         | Binlog Dump |  604235 | Master has sent all binlog to s>     <-------------- Note es2020 io_thread worked, es2021 ones stuck
| 790945603 | <replication_user> | 10.64.32.102:44004   | NULL         | Killed      |    2231 | Init                           >
| 790945672 | <replication_user> | 10.64.32.102:49876   | NULL         | Killed      |    1991 | Init                           >
| 790945706 | <replication_user> | 10.64.32.102:47772   | NULL         | Killed      |    1871 | Waiting to finalize termination>
| 790945736 | <replication_user> | 10.64.32.102:54000   | NULL         | Killed      |    1783 | Init                           >
| 790945754 | <replication_user> | 10.64.32.102:48962   | NULL         | Killed      |    1723 | Init                           >
| 790945793 | <replication_user> | 10.64.32.102:46206   | NULL         | Killed      |    1603 | Init                           >
| 790945877 | <replication_user> | 10.64.32.102:42586   | NULL         | Killed      |    1363 | Init                           >
| 790945896 | <replication_user> | 10.64.32.102:58200   | NULL         | Killed      |    1313 | Waiting to finalize termination>
| 790945921 | <replication_user> | 10.64.32.102:44426   | NULL         | Killed      |    1253 | Init                           >
| 790945969 | <replication_user> | 10.64.32.102:46484   | NULL         | Killed      |    1133 | Init                           >
| 790946010 | <replication_user> | 10.64.32.102:38438   | NULL         | Killed      |    1013 | Init                           >
| 790946100 | <replication_user> | 10.64.32.102:53630   | NULL         | Killed      |     893 | Init                           >
| 790946144 | <replication_user> | 10.64.32.102:53798   | NULL         | Killed      |     773 | Init                           >
| 790946187 | <replication_user> | 10.64.32.102:35644   | NULL         | Killed      |     653 | Init                           >
| 790946226 | <replication_user> | 10.64.32.102:52170   | NULL         | Killed      |     532 | Init                           >
| 790946232 | <mw_user>       | 10.64.16.16:55734    | NULL         | Sleep       |     503 |                                >
| 790946233 | <mw_user>       | 10.64.16.16:55748    | wikidatawiki | Sleep       |       0 |                                >
| 790946265 | <replication_user>   | 10.64.32.102:34236   | NULL         | Killed      |     412 | Init                           >
| 790946305 | <replication_user>   | 10.64.32.102:56348   | NULL         | Killed      |     292 | Init                           >
| 790946354 | <replication_user>   | 10.64.32.102:49262   | NULL         | Killed      |     172 | Init                           >
| 790946391 | <replication_user>   | 10.64.32.102:41208   | NULL         | Killed      |      73 | Init                           >
| 790946405 | <mw_user>       | 10.64.16.16:36664    | NULL         | Sleep       |      33 |                                >
| 790946406 | <mw_user>       | 10.64.16.16:36670    | wikidatawiki | Sleep       |       0 |                                >
| 790946417 | <replication_user>   | 10.64.32.102:33410   | NULL         | Binlog Dump |      13 | Init                           >
| 790946422 | root            | localhost            | NULL         | Query       |       0 | Init                           >
+-----------+-----------------+----------------------+--------------+-------------+---------+-------------------------------->
35 rows in set (0.031 sec)

The killed threads were "unkillable".

root@es1021[(none)]> show slave hosts;
+-----------+--------------------+------+-----------+
| Server_id | Host               | Port | Master_id |
+-----------+--------------------+------+-----------+
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171974758 | es1022.eqiad.wmnet | 3306 | 171970708 |
| 171966665 | es1020.eqiad.wmnet | 3306 | 171970708 |
+-----------+--------------------+------+-----------+
27 rows in set (0.000 sec)

Screenshot_20230404_180333.png (1×2 px, 138 KB)

io_thread stuck "Connecting" (no other errors shown):

root@es1022[(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting to reconnect after a failed master event read
                   Master_Host: es1021.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: es1021-bin.008673
           Read_Master_Log_Pos: 600107737
                Relay_Log_File: es1022-relay-bin.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File: es1021-bin.008673
              Slave_IO_Running: Connecting
             Slave_SQL_Running: Yes

Both primary and replica were restarted (so not a pure transient state, it happens after replication start).

I thought that, when es1022 gets re-started, it could be replicating from another host (es1020, or even a replica host in codfw), something other than an es4 primary, so that in case we have to restart it it is much easier (just a read repool) than with a primary, as we had to struggle to do here.

Change 905893 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/905893

Change 905893 merged by Marostegui:

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/905893

I have disabled semi-sync on es1021 (master) and then started es1022 (the affected host). Everything seems to be fine. I don't recall the details but I do have seen this happening previously when there were network blips on these hosts (and only on esX hosts - I don't have an explanation for it).

So far everything seems to be fine.

  • es1022 is catching up nicely and
  • es1021 (dc master) has no connections stuck
  • es2021 (primary master) isn't affected.

I don't want to leave es4 with reduced capacity for the weekend (as most EU countries will have days off due to Easter)

Once es1022 is in sync, I will re-enable semi-sync on es1021 and we'll see.

Marostegui moved this task from Done to In progress on the DBA board.

Running a data check on es1020 against es1022

The first round of data checks on es1022 worked fine.
I have also enabled back semi-sync on es1021 and es1022 and everything seems to be also working fine.

Started another set of checks on es1022 to be double sure before repooling es1022 slowly back.

@Marostegui It's not repooled yet, shall I as tomorrow the public holiday begins?

@Ladsgroup No, there is still one check running (I want to be 150% sure).

Change 906803 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1022: Enable notifications

https://gerrit.wikimedia.org/r/906803

Change 906803 merged by Marostegui:

[operations/puppet@production] es1022: Enable notifications

https://gerrit.wikimedia.org/r/906803