Page MenuHomePhabricator

s8 replication on an-redacteddb1001 is broken
Closed, ResolvedPublic

Description

I restarted an-redacteddb1001 recently for T376800, but there may have been an active ALTER TABLE statement running against s8.
Upon restarting, all sections started cleanly and replication started on everything except s8.

The error in replication is shown below.

image.png (279×1 px, 74 KB)

CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 2 of table 'wikidatawiki.revision' cannot be converted from type 'bigint' to type 'int(10) unsigned'

CRITICAL slave_sql_lag Replication lag: 766504.97 seconds

I know that @Marostegui had previously been working on this host, but I had thought that the work had been completed or reverted at the time of the reboot..

Event Timeline

The usual fix would be to stop replication, run the schema change to alter table and change the data type and restart replication but doing that takes two days :/

The usual fix would be to stop replication, run the schema change to alter table and change the data type and restart replication but doing that takes two days :/

Two days would be fine, thanks. If we can be back in sync by the end of the month, that would be great.

I had thought that I was OK to restart it because Manuel said on November 7th:

an-redacteddb1001.eqiad.wmnet broke replication because of the schema change, I will revert it and we should be good (it will take around 2 days)

So I though that on November 11th I would have a window to reboot it before trying again. Unfortunately, I didn't think to check the process list or ask your team before proceeding.

Would you like me to run the schema change, or would you prefer to do it @Ladsgroup ?

I can take care of it. I probably want to use INPLACE to speed things up if it's doable. Just need to double check it.

I can take care of it.

Thanks ever so much.

I probably want to use INPLACE to speed things up if it's doable. Just need to double check it.

At first glance, this doesn't look like a supported operation:

https://mariadb.com/kb/en/innodb-online-ddl-operations-with-the-inplace-alter-algorithm/#changing-the-data-type-of-a-column

...but maybe I'm missing something.

Yeah, it's not possible to do INPLACE or INSTANT :/

Started it in a screen, let's see how it goes.

root@an-redacteddb1001.eqiad.wmnet[wikidatawiki]> ALTER TABLE /*_*/revision
    ->   CHANGE rev_id rev_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
    ->   CHANGE rev_comment_id rev_comment_id BIGINT UNSIGNED NOT NULL,
    ->   CHANGE rev_actor rev_actor BIGINT UNSIGNED NOT NULL,
    ->   CHANGE rev_parent_id rev_parent_id BIGINT UNSIGNED DEFAULT NULL;
Query OK, 2251657814 rows affected (2 days 15 hours 11 min 1.889 sec)
Records: 2251657814  Duplicates: 0  Warnings: 0

root@an-redacteddb1001.eqiad.wmnet[wikidatawiki]> start slave;

It still shows a new error now:

Last_SQL_Error: Column 0 of table 'wikidatawiki.revision' cannot be converted from type 'int' to type 'bigint(20) unsigned'

I wonder if we should just rebuild it from another wikireplica?

It still shows a new error now:

Last_SQL_Error: Column 0 of table 'wikidatawiki.revision' cannot be converted from type 'int' to type 'bigint(20) unsigned'

I wonder if we should just rebuild it from another wikireplica?

Probably yes. Depool a clouddb* host and just copy s8 section. That will take a 2-3 hours. Operating the revision table (Altering takes way to long to experiment).

Depooled clouddb1020.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-11-26T13:54:17Z] <ladsgroup@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Reclone (T379724)

Mentioned in SAL (#wikimedia-operations) [2024-11-26T13:54:30Z] <ladsgroup@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Reclone (T379724)

Mentioned in SAL (#wikimedia-operations) [2024-11-26T13:54:40Z] <ladsgroup@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Reclone (T379724)

Mentioned in SAL (#wikimedia-operations) [2024-11-26T13:54:54Z] <ladsgroup@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Reclone (T379724)

reclone is finished. It's catching up.

I'm not seeing any replication filters so there isn't anything to be done here. I just repool clouddb1020:3318

Ladsgroup edited projects, added DBA; removed Data-Persistence.
Ladsgroup moved this task from Triage to Done on the DBA board.