Page MenuHomePhabricator

Check for errors on all tables on some hosts
Closed, ResolvedPublic

Description

The following hosts errored weeks ago on some tables,

Mar 11 08:21:35 db2148 mysqld[3190]: 2021-03-11  8:21:35 9 [ERROR] InnoDB: Record in index `pl_namespace` of table `zhwiki`.`pagelinks` was not found on update: TUPLE (info_bits=0, 3 fields): {[4]    (0x80000000),[12]            (0xE799BDE6B2B3E6ACA1E9838E),[4] : #(0x003A2023)} at: COMPACT RECORD(info_bits=0, 3 fields): {[4]    (0x80000000),[12]            (0xE799BDE6B2B3E6ACA1E9838E),[4] 5D (0x0035440E)}
Mar 11 08:26:04 db2148 mysqld[3190]: 2021-03-11  8:26:04 0 [ERROR] InnoDB: Unable to find a record to delete-mark
Mar 11 08:26:04 db2148 mysqld[3190]: InnoDB: tuple DATA TUPLE: 2 fields;
Mar 11 08:26:04 db2148 mysqld[3190]:  0: len 4; hex 000016bd; asc     ;;
Mar 11 08:26:04 db2148 mysqld[3190]:  1: len 4; hex 00502339; asc  P#9;;
Mar 11 08:26:04 db2148 mysqld[3190]: InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 0
Mar 11 08:26:04 db2148 mysqld[3190]:  0: len 4; hex 000016bd; asc     ;;
Mar 11 08:26:04 db2148 mysqld[3190]:  1: len 4; hex 004f8ffc; asc  O  ;;
Mar 11 08:26:04 db2148 mysqld[3190]: 2021-03-11  8:26:04 0 [ERROR] InnoDB: page [page id: space=7711, page number=60672] (631 records, index id 28998).
Mar 11 08:26:04 db2148 mysqld[3190]: 2021-03-11  8:26:04 0 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

Those usually mark tables that might have corrupted indexes.
Running: mysqlcheck $database should mark those tables as corrupted.

Let's check and rebuild the affected tables:

  • db1134
  • db1150:3315
  • db1150:3314
  • db1166
  • db1168
  • db1175
  • db1146
  • db2092
  • db2145
  • db2146
  • db2116
  • db2102 (core test)
  • db2148
  • db2108
  • db2120
  • db2150

Event Timeline

Marostegui triaged this task as Medium priority.Mar 8 2021, 6:19 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2021-03-08T06:23:51Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1166 T276742', diff saved to https://phabricator.wikimedia.org/P14649 and previous config saved to /var/cache/conftool/dbconfig/20210308-062350-marostegui.json

Marostegui renamed this task from Check all tables on some hosts to Check for errors on all tables on some hosts.Mar 8 2021, 6:29 AM
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-03-08T06:37:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1168 T276742', diff saved to https://phabricator.wikimedia.org/P14651 and previous config saved to /var/cache/conftool/dbconfig/20210308-063700-marostegui.json

On-going checks and fixes:

  • db1166
  • db1168
  • db2102

db2102 is getting some of its tables fixed (rebuilt)

Mentioned in SAL (#wikimedia-operations) [2021-03-09T05:16:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1175 for table check T276742', diff saved to https://phabricator.wikimedia.org/P14675 and previous config saved to /var/cache/conftool/dbconfig/20210309-051646-marostegui.json

db1175 and db2102 checked, now rebuilding some tables.

Mentioned in SAL (#wikimedia-operations) [2021-03-10T08:11:25Z] <marostegui> Check tables on db1150:3315 - T276742

Marostegui updated the task description. (Show Details)

All checked and cleaned

Marostegui reopened this task as Open.EditedMar 11 2021, 9:44 AM

Checking two more hosts: db2108 and db2148

Mentioned in SAL (#wikimedia-operations) [2021-03-12T06:50:08Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1146:3314 for table checking T276742', diff saved to https://phabricator.wikimedia.org/P14807 and previous config saved to /var/cache/conftool/dbconfig/20210312-065008-marostegui.json

Starting a table check on db1146:3314 after seeing some errors.

Change 671033 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2148: Disable notifications

https://gerrit.wikimedia.org/r/671033

Mentioned in SAL (#wikimedia-operations) [2021-03-12T07:02:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2148 T276742', diff saved to https://phabricator.wikimedia.org/P14809 and previous config saved to /var/cache/conftool/dbconfig/20210312-070219-marostegui.json

Change 671033 merged by Marostegui:
[operations/puppet@production] db2148: Disable notifications

https://gerrit.wikimedia.org/r/671033

Mentioned in SAL (#wikimedia-operations) [2021-03-12T07:16:28Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2108 T276742', diff saved to https://phabricator.wikimedia.org/P14811 and previous config saved to /var/cache/conftool/dbconfig/20210312-071628-marostegui.json

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-03-16T08:47:00Z] <marostegui> Check tables on db2150 db2120 T276742

@jcrespo I have noticed this on db2100 (10.1 backup source):

Apr 27 13:28:37 db2100 mysqld[883]: InnoDB: tried to purge sec index entry not marked for deletion in
Apr 27 13:28:37 db2100 mysqld[883]: InnoDB: index "wl_user_notificationtimestamp" of table "metawiki"."watchlist"
Apr 27 13:28:37 db2100 mysqld[883]: InnoDB: tuple DATA TUPLE: 3 fields;
Apr 27 13:28:37 db2100 mysqld[883]:  0: len 4; hex 00e79262; asc    b;;
Apr 27 13:28:37 db2100 mysqld[883]:  1: len 14; hex 3230323130343235303830343533; asc 20210425080453;;
Apr 27 13:28:37 db2100 mysqld[883]:  2: len 4; hex 02361091; asc  6  ;;
Apr 27 13:28:37 db2100 mysqld[883]: InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
Apr 27 13:28:37 db2100 mysqld[883]:  0: len 4; hex 00e79262; asc    b;;
Apr 27 13:28:37 db2100 mysqld[883]:  1: len 14; hex 3230323130343235303830343533; asc 20210425080453;;
Apr 27 13:28:37 db2100 mysqld[883]:  2: len 4; hex 02361091; asc  6  ;;

Probably worth checking all the tables and rebuilt the ones with errors, or directly, rebuilt this host from a logical dump.

Thanks for the heads up. The fact that this is recent and dumps showed no error/warning message makes me think it is not a fatal error and that data was kept intact. I will make sure to rebuilt it for 10.4 and at the very least repair and check the current 10.1 db.

It looks like a single entry on the index, forcing table rebuilt:

db2100[(none)]> check table metawiki.watchlist;
+--------------------+-------+----------+----------------------------------------------------------------------------------------------+
| Table              | Op    | Msg_type | Msg_text                                                                                     |
+--------------------+-------+----------+----------------------------------------------------------------------------------------------+
| metawiki.watchlist | check | Warning  | InnoDB: Index 'wl_user_notificationtimestamp' contains 46470882 entries, should be 46470881. |
| metawiki.watchlist | check | error    | Corrupt                                                                                      |
+--------------------+-------+----------+----------------------------------------------------------------------------------------------+

I checked all other tables, they were good.