High replication lag for enwiki (db1154 s1 replication crashed)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Maile66
	Apr 21 2024, 8:56 PM

Description

Previously: T362732 and T352010

This was resolved 18 April 2024. The issue has now returned. AFD stats have not been updating all day.

Related Objects

Mentioned In: T363089: Links to some existing drafts and articles are red
T352010: Gradually drop old pagelinks columns
Mentioned Here: T352010: Gradually drop old pagelinks columns
T362732: enwiki_p database replica has stopped updating

Event Timeline

Maile66 created this task.Apr 21 2024, 8:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2024, 8:56 PM

Pppery renamed this task from Repeat of Task T362732 and Task T352010 to High replication lag for enwiki.Apr 21 2024, 9:08 PM

Pppery added a project: Data-Services.

Pppery updated the task description. (Show Details)

Pppery moved this task from Backlog to Wiki replicas on the Data-Services board.

RhinosF1 subscribed.Apr 21 2024, 9:13 PM

root@db1154:s1[(none)]> show slave status\G
                    Last_Error: Could not execute Write_rows_v1 event on table enwiki.pagelinks; Index for table 'pagelinks' is corrupt; try to repair it, Error_code: 1034; handler error HA_ERR_CRASHED; the event's master log db1196-bin.003644, end_log_pos 987579339

taavi renamed this task from High replication lag for enwiki to High replication lag for enwiki (db1154 s1 replication crashed).Apr 21 2024, 9:20 PM

taavi edited projects, added DBA; removed Data-Persistence.

Pppery subscribed.Apr 21 2024, 9:30 PM

NightWolf1223 subscribed.Apr 21 2024, 9:47 PM

Ladsgroup mentioned this in T352010: Gradually drop old pagelinks columns.Apr 21 2024, 10:46 PM

I started an optimize table on pagelinks, It's unlikely it would fix the issue but worth a try.

Now a 7 hour system lag.

FYI comments complaining about how the lag is getting longer etc. aren't helpful. Everyone knows it is getting worse, and it will continue to get worse until it is fixed.

Well, my "complaints" are actually a request for information on when this problem will be fixed. Where else can I ask? I was directed to come here.

The comment included no question and no request. The problem will be fixed when this task status is set to "resolved" as regularly posting "someone is working on it" is not a good use of anyone's time and as every comment creates notifications that someone needs to read. Thanks for your understanding.

FWIW, this is a data corruption issue. Last time it happened on sanitarium hosts the whole system went down for a week. I'm trying to figure out if I can avoid re-cloning the whole host (which would take a lot of time, potentially weeks) and only reclone the corrupted table but there is no easy way to do this AFAICS. I keep you posted.

Sorry to butt in, but, my understanding from the above is that the corrupted table coincidentally happens to be the same table which was just recently "normalized", correct?

In T363077#9732808, @Ladsgroup wrote:

FWIW, this is a data corruption issue. Last time it happened on sanitarium hosts the whole system went down for a week. I'm trying to figure out if I can avoid re-cloning the whole host (which would take a lot of time, potentially weeks) and only reclone the corrupted table but there is no easy way to do this AFAICS. I keep you posted.

Try first to drop+create the problematic index

Another option is to force a table rebuilt entirely if the above doesn't work: alter table pagelinks engine=InnoDB,force;

Chlod mentioned this in T363089: Links to some existing drafts and articles are red.Apr 22 2024, 11:39 AM

Novem_Linguae subscribed.Apr 22 2024, 11:40 AM

Chlod subscribed.Apr 22 2024, 11:45 AM

Thanks I will try that!

Worst case scenario, we can try to add a replication filter to skip that table, let replication catch up, and then reimport only that table from the master.
That way we don't have to reclone the host. Try those two things and we can see how it goes

With those two things I meant the: alter table to rebuild the table and the index drop+creation.

The replication filter thing we can evaluate once the above is tried

Eejit43 subscribed.Apr 22 2024, 1:35 PM

In T363077#9732470, @Ladsgroup wrote:

I started an optimize table on pagelinks, It's unlikely it would fix the issue but worth a try.

This actually seems to have fixed the issue. The replication is flowing so far.

In T363077#9732870, @Wbm1058 wrote:

Sorry to butt in, but, my understanding from the above is that the corrupted table coincidentally happens to be the same table which was just recently "normalized", correct?

1- By normalization we mean this: https://en.wikipedia.org/wiki/Database_normalization
2- The alter table for that table was finished around a week before the crash
3- hundreds of other hosts have had that alter, only db1154:3311 crashed afterwards. Including thirty other s1 hosts (which they hold the same data)
4- The normalization is not done yet, we just changed the PK. The actual drop hasn't happened yet.

Does this mean they are not related? Not really. Usually such corruptions are because you have an underlying issue (hardware, etc.) and then a large write triggers the faulty section to crash. Imagine having a bad sector on the disk which trigger a crash if you read from it. You don't usually read it unless you have a large file that's everywhere.

In other words, they are related and not related at the same time.

Anyway, the lag is decreasing, It'll take two hours or so to fully recover (unless there is another replication crash). I will close this once the lag goes below one second.

The rebuild fixed the issue. The replag is zero.

Novem_Linguae awarded a token.Apr 22 2024, 8:20 PM

High replication lag for enwiki (db1154 s1 replication crashed)Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

High replication lag for enwiki (db1154 s1 replication crashed)
Closed, ResolvedPublic
Actions