PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table archive is corrupt: try to repair it on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
Description
Event Timeline
Depooling: (https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica)
ssh cumin1002.eqiad.wmnet sudo dbctl instance db2217 depool sudo dbctl config commit -m "Depool db2217" sudo cookbook sre.hosts.downtime --hours 72 -r "Corrupt Index" 'db2217*'
"db2193": 100, "db2193": 100, "db2217": 200, "db2224": 400, "db2224": 400, "db2229": 400 "db2229": 400 } } ] ] Enter y or yes to confirm: yes Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20241110-122532-slyngshede.json WARNING:conftool.announce:dbctl commit (dc=all): 'Depool db2217', diff saved to https://phabricator.wikimedia.org/P70997 and previous config saved to /var/cache/conftool/dbconfig/20241110-122532-slyngshede.json
I ran optimize table archive (11M records, seemed safe enough) after stopping the slave, and it seems to have recovered. I wasn't confident enough to declare it "ready for prod" so we decided not to repool, leaving the decision to data persistence :)
Thank you! If "start slave" is issued and it looks okay in orchestrator, it's safe to repool (via the cookbook we have). You don't need to do it. I will take care of it soon if noone beats me to it.
Mentioned in SAL (#wikimedia-operations) [2024-11-12T09:59:54Z] <arnaudb@cumin1002> START - Cookbook sre.mysql.pool db2217 gradually with 4 steps - T379491
Mentioned in SAL (#wikimedia-operations) [2024-11-12T10:45:17Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2217 gradually with 4 steps - T379491