Page MenuHomePhabricator

db1175 and db1189 corrupted pagelinks index - broke replication
Closed, ResolvedPublic

Description

Creating this retroactively.
We just got this affecting two hosts from s3 (db1175 and db1189):

[15:40:15]  <+icinga-wm> PROBLEM - MariaDB Replica SQL: s3 #page on db1175 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: mswiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:40:16]  <+icinga-wm> PROBLEM - MariaDB Replica SQL: s3 #page on db1189 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table pagelinks is corrupt: try to repair it on query. Default database: mswiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

The issue was the same on both:

Last_SQL_Error: Error 'Index for table 'pagelinks' is corrupt; try to repair it' on query. Default database: 'mswiktionary'. Query: 'INSERT /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ IGNORE INTO `pagelinks` (pl_from_namespace,pl_namespace,pl_title,pl_target_id,pl_from) VALUES (0,6,'Nl-bekijken.ogg',213370,50848),(0,0,'be-',29101,50848),(0,0,'kijken',213089,50848),(0,0,'lihat',14601,50848),(0,0,'tonton',16843,50848),(0,4,'Abjad_Fonetik_Antarabangsa',39952,50848),(0,100,'Sebutan_bahasa_Belanda',39978,50848),(0,102,'Bahasa_Belanda/ɛi̯kən',213088,50848)'

It was fixed by issuing the following on each host:

cumin2024@db1175.eqiad.wmnet[(none)]> set session sql_log_bin=0;
Query OK, 0 rows affected (0.001 sec)

cumin2024@db1175.eqiad.wmnet[mswiktionary]> alter table pagelinks engine=innodb,force;
Query OK, 0 rows affected (4.077 sec)
Records: 0  Duplicates: 0  Warnings: 0

cumin2024@db1175.eqiad.wmnet[mswiktionary]> stop slave; start slave;

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

We probably should alter table on all tables everywhere. The index got corrupted three times on s1 as well. It will eventually happen with T352010: Gradually drop old pagelinks columns