PROBLEM - MariaDB Replica SQL: s6 on db2217 is CRITICAL: CRITICAL
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	SLyngshede-WMF
	Nov 10 2024, 12:22 PM

Description

PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table archive is corrupt: try to repair it on query. Default database: frwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

Event Timeline

SLyngshede-WMF created this task.Nov 10 2024, 12:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2024, 12:22 PM

Depooling: (https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica)

ssh cumin1002.eqiad.wmnet

sudo dbctl instance db2217 depool
sudo dbctl config commit -m "Depool db2217"

sudo cookbook sre.hosts.downtime --hours 72 -r "Corrupt Index" 'db2217*'

        "db2193": 100,                                    "db2193": 100,                           
        "db2217": 200,                                                                             
        "db2224": 400,                                    "db2224": 400,                           
        "db2229": 400                                     "db2229": 400                            
    }                                                 }                                            
]                                                 ]                                                

Enter y or yes to confirm: yes
Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20241110-122532-slyngshede.json
WARNING:conftool.announce:dbctl commit (dc=all): 'Depool db2217', diff saved to https://phabricator.wikimedia.org/P70997 and previous config saved to /var/cache/conftool/dbconfig/20241110-122532-slyngshede.json

I ran optimize table archive (11M records, seemed safe enough) after stopping the slave, and it seems to have recovered. I wasn't confident enough to declare it "ready for prod" so we decided not to repool, leaving the decision to data persistence :)

Thank you! If "start slave" is issued and it looks okay in orchestrator, it's safe to repool (via the cookbook we have). You don't need to do it. I will take care of it soon if noone beats me to it.

LSobanski subscribed.Nov 12 2024, 9:14 AM

Mentioned in SAL (#wikimedia-operations) [2024-11-12T09:59:54Z] <arnaudb@cumin1002> START - Cookbook sre.mysql.pool db2217 gradually with 4 steps - T379491

all green on icinga, repooling

Mentioned in SAL (#wikimedia-operations) [2024-11-12T10:45:17Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2217 gradually with 4 steps - T379491

RhinosF1 renamed this task from PROBLEM - MariaDB Replica SQL: s6 #page on db2217 is CRITICAL: CRITICAL to PROBLEM - MariaDB Replica SQL: s6 on db2217 is CRITICAL: CRITICAL.Nov 12 2024, 10:47 AM

Restricted Application added a subscriber: RhinosF1. · View Herald TranscriptNov 12 2024, 10:47 AM

PROBLEM - MariaDB Replica SQL: s6 on db2217 is CRITICAL: CRITICALClosed, ResolvedPublicActions

Description

Event Timeline

PROBLEM - MariaDB Replica SQL: s6 on db2217 is CRITICAL: CRITICAL
Closed, ResolvedPublic
Actions