Page MenuHomePhabricator

[misc] db1159 data corruption
Open, HighPublic

Description

spotted in P69411 for T375186:

(1) db1159.eqiad.wmnet                                                                                                                                                                                                                  
----- OUTPUT of 'journalctl --no-...n 10 -u mariadb ' -----                                                                                                                                                                             
Aug 21 08:53:44 db1159 mysqld[31793]:  0: len 8; hex 000000000120040e; asc         ;;                                                                                                                                                   
Aug 21 08:53:44 db1159 mysqld[31793]:  1: len 6; hex 000000000000; asc       ;;                                                                                                                                                         
Aug 21 08:53:44 db1159 mysqld[31793]:  2: len 7; hex 80000000000000; asc        ;;
Aug 21 08:53:44 db1159 mysqld[31793]:  3: len 4; hex 0000000b; asc     ;;
Aug 21 08:53:44 db1159 mysqld[31793]:  4: len 4; hex 000480e3; asc     ;;
Aug 21 08:53:44 db1159 mysqld[31793]:  5: len 4; hex 0000a227; asc    ';;
Aug 21 08:53:44 db1159 mysqld[31793]:  6: len 8; hex 7fffffffffffffff; asc         ;;
Aug 21 08:53:44 db1159 mysqld[31793]:  7: len 4; hex 60f932da; asc ` 2 ;;
Aug 21 08:53:44 db1159 mysqld[31793]: 2024-08-21  8:53:44 2619239875 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
Aug 21 09:17:54 db1159 mysqld[31793]: 2024-08-21  9:17:54 2619336917 [ERROR] InnoDB: Flagged corruption of `key_dimension` in table `phabricator_fact`.`fact_intdatapoint` in CHECK TABLE; Wrong count

Event Timeline

this instance hosts phabricator's db for reference

ABran-WMF triaged this task as Medium priority.Sep 25 2024, 8:21 AM
ABran-WMF moved this task from Triage to Pending comment on the DBA board.

@Ladsgroup wdyt about mariadb's suggested course of action?

We don't switchover misc cluster databases (dc switchover I mean not master) so I'm not sure why we checked it for dc switchover :D

If this corruption doesn't exist in db1217, we should switchover the primaries (it's a bit of work for misc clusters, we can't use most of the current tools I think) and then try to clone it from the replica of codfw (db2160) if it doesn't have the corruption either.

ABran-WMF moved this task from Pending comment to Ready on the DBA board.

We don't switchover misc cluster databases (dc switchover I mean not master) so I'm not sure why we checked it for dc switchover :D

it was spotted in the broad check of all instances logs, not specifically targetted!

If this corruption doesn't exist in db1217, we should switchover the primaries (it's a bit of work for misc clusters, we can't use most of the current tools I think) and then try to clone it from the replica of codfw (db2160) if it doesn't have the corruption either.

ack, will check soon then

ABran-WMF renamed this task from db1159 data corruption to [misc] db1159 data corruption.Oct 8 2024, 6:48 AM

haven't had the time to get to it yet!

This is a master so we definitely need to give this a lot more priority, if there are issues with the host, it could crash.

ABran-WMF raised the priority of this task from Medium to High.

will get to it

glue script to play this procedure has been written

Can we get the script reviewed via normal ways before getting it to be run in production?
Even though the idea is good, that script will not work in misc and it is basically a wrapper of db-switchover (which does work with misc). I am failing to see what it is doing apart from it.

It is, indeed, a wrapper!
Please refresh the page as I think you might have reviewed an earlier stage of this script:
You'll be able to find the documentation you've linked me in the README, or at least its current implementation. Lets focus on improving both wrapper and procedure so this is added to the stack of automated procedures. The main goal is to provide a copy/pasta source that is templated, not to run the code on production (or not with the consent of the operator at least). This is less error prone than the written procedure, will spare some time.

It is, indeed, a wrapper!
Please refresh the page as I think you might have reviewed an earlier stage of this script:
You'll be able to find the documentation you've linked me in the README, or at least its current implementation. Lets focus on improving both wrapper and procedure so this is added to the stack of automated procedures. The main goal is to provide a copy/pasta source that is templated, not to run the code on production (or not with the consent of the operator at least). This is less error prone than the written procedure, will spare some time.

Can you add me as a reviewer so we can do it properly?
The script won't work, there are things that do not apply in misc:

  • dbctl
  • triggers

There is no mention of proxies which is probably the most important thing to handle here. The SQL commands are basically the ones we use in db-switchover. I think you'd need to clarify what you want the final state to looks like as I don't understand the goal of this script.
If you want to provide a template copy/paste you can probably process gather/scrp the information from any phabricator ticket that mentions misc as I've suggested before. - they have copy/paste commands depending on the misc section even.

I will get the master replaced - we shouldn't go to the break with this master in this state

I will move db1213 here, replace db1159 and place db1159 in s5. (vslow and 150 in main traffic)

Change #1100040 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1213 to m3

https://gerrit.wikimedia.org/r/1100040

Change #1100041 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] instances.yaml: Remove db1213 from dbctl

https://gerrit.wikimedia.org/r/1100041

Change #1100040 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1213 to m3

https://gerrit.wikimedia.org/r/1100040

Change #1100041 merged by Marostegui:

[operations/puppet@production] instances.yaml: Remove db1213 from dbctl

https://gerrit.wikimedia.org/r/1100041

Mentioned in SAL (#wikimedia-operations) [2024-12-03T08:14:34Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Remove db1213 from dbctl T375593', diff saved to https://phabricator.wikimedia.org/P71489 and previous config saved to /var/cache/conftool/dbconfig/20241203-081434-marostegui.json

db1213 is now up and running replicating in m3. I am going to give it two days and do a switchover on Thursday.