Page MenuHomePhabricator

Run check table periodically on backup source hosts
Open, MediumPublic

Description

While discussing the InnoDB crashes with MariaDB devs, they were asking whether we were running periodically check table.
Maybe we should start running a check table on a weekly (or whatever) basis and report on those tables that are corrupted, at least on backup source hosts.

This would allow us to catch possible index or any other corruptions beforehand.
I recloned db2125 a few days ago, and while running it across all its wiki, I have noticed some indexes corruptions on some tables.
They don't necessarily need to be present on the backup source hosts (I haven't checked), they could be the result of the upgrade, crashes or any other thing, but ensuring we know our snapshot status would be good.

First manual run:

  • db1095:s2
  • db1095:s3
  • db1102:x1
  • db1116:s7
  • db1116:s8
  • db1139:s1
  • db1139:s6
  • db1140:s1
  • db1140:s6
  • db1145:s4
  • db1145:s5
  • db1150:s4
  • db1150:s5
  • db2098:s2
  • db2098:s3
  • db2097:s1
  • db2097:s6
  • db2099:s4 Index 'globalimagelinks_wiki' corrupt
  • db2099:s5
  • db2100:s7
  • db2100:s8
  • db2101:x1
  • db2139:s4
  • db2139:s5
  • db2141:s1
  • db2141:s6

Event Timeline

Implementing this should be relatively easy, just running mysqlcheck -c -A on the host. The problem is how to deal with potential replication delays?

That's why I suggested backup source hosts for now as lag there isn't that big of a deal

100% agree with this, sadly, the lag part is not configurable yet on icinga alerts. :-( We'll see what I can get done, however for now I will just run it manually on all hosts to discard ongoing issues.

Note I think check tables is done on every upgrade (mysql_upgrade), so I don't expect any issues on 10.1 hosts.

100% agree with this, sadly, the lag part is not configurable yet on icinga alerts. :-( We'll see what I can get done, however for now I will just run it manually on all hosts to discard ongoing issues.

Note I think check tables is done on every upgrade (mysql_upgrade), so I don't expect any issues on 10.1 hosts.

But the backup source hosts have notifications disabled for lag, no?

But the backup source hosts have notifications disabled for lag, no?

Manually. They still show up on ongoing issues.

The error I got on s1 so far:

enwiki.profiling
note     : The storage engine for the table doesn't support check

Is s2 completed? It will be interesting to see if there were some corruption there, as I found some on the host I cloned (and upgraded) from it.

Is s2 completed

I was doing s6 next, but I can prioritize s2 right now. Note it take several hours (half a day) to do a full table check, more than a full dump, provably due to our countless number of indexes.

No rush, it was more curiosity than anything else. s2 will take lots of hours too, it took me yesterday around 6h I think.

@Marostegui s2 on codfw gave no errors. This is what I expected- given we had no issues in the past with 10.1, I think it has to be the combination of corruption and the upgrade to 10.4. We can do a test of moving s2 to 10.4 on a test host, and then running the test? Or we can establish to do so after every upgrade ( a full check tables and not only the one done for upgrade, even if it takes longer).

Meanwhile I will continue with a full coverage of all sections.

We should drop the profiling table from source backup hosts before setting up the regular checking to prevent extra log spam.

We finally have a positive:

Warning  : InnoDB: Index 'globalimagelinks_wiki' contains 547521581 entries, should be 547521580.
error    : Corrupt

Sadly it also breaks replication:

Error 'Index globalimagelinks is corrupted' on query. Default database: 'commonswiki'. Query: 'DELETE /* GlobalUsage::deleteLinksFromPage  */

I will handle this tomorrow, it is very late today CC @Kormat @Marostegui

Unfortunately that is expected. A table rebuild (alter table engine=innodb,force fixes it in most cases, and then replication should flow again.

I am dropping and then recreating the index on different transactions with the hope that that will be a bit faster than recreating the full table- I will do a check tables at the end to check that fixes it.

After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).

With this I don't say we shouldn't do this, but it is going to be harder to implement than even T104459 and as a large project to get right, even if only restricted to source backup hosts, due to its impact on lag and backup taking.

We should probably schedule a long term project/goal to implement this AND the data and schema checking.

After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).

With this I don't say we shouldn't do this, but it is going to be harder to implement than even T104459 and as a large project to get right, even if only restricted to source backup hosts, due to its impact on lag and backup taking.

Note that I never said it was an easy task - I suggested running it on a weekly basis (rather than daily or yearly, never as we sort of were doing).
I think we both underestimated how difficult this can be given our current infrastructure complexity (T265866#6560115)

We should probably schedule a long term project/goal to implement this AND the data and schema checking.

I think once it has run on all the backup sources, that's a good start. As we now know their state either good or fixed if they were bad. This is already a good starting point for me.
Also, we can include these check tables as part of the action items we do when a slave crash, same as we do with compare.py, including a check tables to make sure the host is fully ok, and if not, reclone it.

First run has been done on all hosts, all clean now as far as mysqlcheck / CHECK TABLES is concerned (only commonswiki on db2099 had a bad index, now fixed and rechecked).

Now the hard part is left- making this periodic.