Page MenuHomePhabricator

High replication lag for enwiki (db1154 s1 replication crashed)
Closed, ResolvedPublic

Description

Previously: T362732 and T352010

This was resolved 18 April 2024. The issue has now returned. AFD stats have not been updating all day.

Event Timeline

Pppery renamed this task from Repeat of Task T362732 and Task T352010 to High replication lag for enwiki.Apr 21 2024, 9:08 PM
Pppery added a project: Data-Services.
Pppery updated the task description. (Show Details)
Pppery moved this task from Backlog to Wiki replicas on the Data-Services board.
taavi subscribed.
root@db1154:s1[(none)]> show slave status\G
                    Last_Error: Could not execute Write_rows_v1 event on table enwiki.pagelinks; Index for table 'pagelinks' is corrupt; try to repair it, Error_code: 1034; handler error HA_ERR_CRASHED; the event's master log db1196-bin.003644, end_log_pos 987579339
taavi renamed this task from High replication lag for enwiki to High replication lag for enwiki (db1154 s1 replication crashed).Apr 21 2024, 9:20 PM
taavi edited projects, added DBA; removed Data-Persistence.

I started an optimize table on pagelinks, It's unlikely it would fix the issue but worth a try.

FYI comments complaining about how the lag is getting longer etc. aren't helpful. Everyone knows it is getting worse, and it will continue to get worse until it is fixed.

Well, my "complaints" are actually a request for information on when this problem will be fixed. Where else can I ask? I was directed to come here.

The comment included no question and no request. The problem will be fixed when this task status is set to "resolved" as regularly posting "someone is working on it" is not a good use of anyone's time and as every comment creates notifications that someone needs to read. Thanks for your understanding.

FWIW, this is a data corruption issue. Last time it happened on sanitarium hosts the whole system went down for a week. I'm trying to figure out if I can avoid re-cloning the whole host (which would take a lot of time, potentially weeks) and only reclone the corrupted table but there is no easy way to do this AFAICS. I keep you posted.

Sorry to butt in, but, my understanding from the above is that the corrupted table coincidentally happens to be the same table which was just recently "normalized", correct?

FWIW, this is a data corruption issue. Last time it happened on sanitarium hosts the whole system went down for a week. I'm trying to figure out if I can avoid re-cloning the whole host (which would take a lot of time, potentially weeks) and only reclone the corrupted table but there is no easy way to do this AFAICS. I keep you posted.

Try first to drop+create the problematic index

Another option is to force a table rebuilt entirely if the above doesn't work: alter table pagelinks engine=InnoDB,force;

Worst case scenario, we can try to add a replication filter to skip that table, let replication catch up, and then reimport only that table from the master.
That way we don't have to reclone the host. Try those two things and we can see how it goes

With those two things I meant the: alter table to rebuild the table and the index drop+creation.

The replication filter thing we can evaluate once the above is tried

I started an optimize table on pagelinks, It's unlikely it would fix the issue but worth a try.

This actually seems to have fixed the issue. The replication is flowing so far.

Sorry to butt in, but, my understanding from the above is that the corrupted table coincidentally happens to be the same table which was just recently "normalized", correct?

1- By normalization we mean this: https://en.wikipedia.org/wiki/Database_normalization
2- The alter table for that table was finished around a week before the crash
3- hundreds of other hosts have had that alter, only db1154:3311 crashed afterwards. Including thirty other s1 hosts (which they hold the same data)
4- The normalization is not done yet, we just changed the PK. The actual drop hasn't happened yet.

Does this mean they are not related? Not really. Usually such corruptions are because you have an underlying issue (hardware, etc.) and then a large write triggers the faulty section to crash. Imagine having a bad sector on the disk which trigger a crash if you read from it. You don't usually read it unless you have a large file that's everywhere.

In other words, they are related and not related at the same time.

Anyway, the lag is decreasing, It'll take two hours or so to fully recover (unless there is another replication crash). I will close this once the lag goes below one second.

Ladsgroup claimed this task.
Ladsgroup moved this task from Triage to Done on the DBA board.

The rebuild fixed the issue. The replag is zero.