⚓ T265866 Run check table periodically on backup source hosts

		Status	Subtype	Assigned	Task
		Open		None	T265866 Run check table periodically on backup source hosts
		Resolved		• Marostegui	T266125 Drop table profiling from WMF wiki mariadb servers

• Marostegui created this task.Oct 19 2020, 7:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 19 2020, 7:28 AM

Implementing this should be relatively easy, just running mysqlcheck -c -A on the host. The problem is how to deal with potential replication delays?

That's why I suggested backup source hosts for now as lag there isn't that big of a deal

100% agree with this, sadly, the lag part is not configurable yet on icinga alerts. :-( We'll see what I can get done, however for now I will just run it manually on all hosts to discard ongoing issues.

Note I think check tables is done on every upgrade (mysql_upgrade), so I don't expect any issues on 10.1 hosts.

In T265866#6560140, @jcrespo wrote:

100% agree with this, sadly, the lag part is not configurable yet on icinga alerts. :-( We'll see what I can get done, however for now I will just run it manually on all hosts to discard ongoing issues.

Note I think check tables is done on every upgrade (mysql_upgrade), so I don't expect any issues on 10.1 hosts.

But the backup source hosts have notifications disabled for lag, no?

But the backup source hosts have notifications disabled for lag, no?

Manually. They still show up on ongoing issues.

The error I got on s1 so far:

enwiki.profiling
note     : The storage engine for the table doesn't support check

Is s2 completed? It will be interesting to see if there were some corruption there, as I found some on the host I cloned (and upgraded) from it.

• Marostegui triaged this task as Medium priority.Oct 20 2020, 8:22 AM

Is s2 completed

I was doing s6 next, but I can prioritize s2 right now. Note it take several hours (half a day) to do a full table check, more than a full dump, provably due to our countless number of indexes.

No rush, it was more curiosity than anything else. s2 will take lots of hours too, it took me yesterday around 6h I think.

jcrespo updated the task description. (Show Details)Oct 20 2020, 1:11 PM

jcrespo updated the task description. (Show Details)Oct 21 2020, 9:07 AM

@Marostegui s2 on codfw gave no errors. This is what I expected- given we had no issues in the past with 10.1, I think it has to be the combination of corruption and the upgrade to 10.4. We can do a test of moving s2 to 10.4 on a test host, and then running the test? Or we can establish to do so after every upgrade ( a full check tables and not only the one done for upgrade, even if it takes longer).

Meanwhile I will continue with a full coverage of all sections.

jcrespo updated the task description. (Show Details)Oct 21 2020, 11:24 AM

jcrespo mentioned this in T266125: Drop table profiling from WMF wiki mariadb servers.Oct 21 2020, 1:07 PM

We should drop the profiling table from source backup hosts before setting up the regular checking to prevent extra log spam.

jcrespo updated the task description. (Show Details)Oct 21 2020, 1:14 PM

jcrespo updated the task description. (Show Details)Oct 21 2020, 5:45 PM

jcrespo updated the task description. (Show Details)Oct 21 2020, 6:35 PM

We finally have a positive:

Warning  : InnoDB: Index 'globalimagelinks_wiki' contains 547521581 entries, should be 547521580.
error    : Corrupt

Sadly it also breaks replication:

Error 'Index globalimagelinks is corrupted' on query. Default database: 'commonswiki'. Query: 'DELETE /* GlobalUsage::deleteLinksFromPage  */

I will handle this tomorrow, it is very late today CC @Kormat @Marostegui

Unfortunately that is expected. A table rebuild (alter table engine=innodb,force fixes it in most cases, and then replication should flow again.

jcrespo updated the task description. (Show Details)Oct 22 2020, 5:50 AM

I am dropping and then recreating the index on different transactions with the hope that that will be a bit faster than recreating the full table- I will do a check tables at the end to check that fixes it.

jcrespo updated the task description. (Show Details)Oct 22 2020, 8:35 AM

After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).

With this I don't say we shouldn't do this, but it is going to be harder to implement than even T104459 and as a large project to get right, even if only restricted to source backup hosts, due to its impact on lag and backup taking.

We should probably schedule a long term project/goal to implement this AND the data and schema checking.

In T265866#6571046, @jcrespo wrote:

After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).

With this I don't say we shouldn't do this, but it is going to be harder to implement than even T104459 and as a large project to get right, even if only restricted to source backup hosts, due to its impact on lag and backup taking.

Note that I never said it was an easy task - I suggested running it on a weekly basis (rather than daily or yearly, never as we sort of were doing).
I think we both underestimated how difficult this can be given our current infrastructure complexity (T265866#6560115)

We should probably schedule a long term project/goal to implement this AND the data and schema checking.

I think once it has run on all the backup sources, that's a good start. As we now know their state either good or fixed if they were bad. This is already a good starting point for me.
Also, we can include these check tables as part of the action items we do when a slave crash, same as we do with compare.py, including a check tables to make sure the host is fully ok, and if not, reclone it.

jcrespo updated the task description. (Show Details)Oct 23 2020, 8:22 AM

First run has been done on all hosts, all clean now as far as mysqlcheck / CHECK TABLES is concerned (only commonswiki on db2099 had a bad index, now fixed and rechecked).

Now the hard part is left- making this periodic.

jcrespo moved this task from Triage to Refine on the Data-Persistence-Backup board.Dec 21 2020, 10:35 PM

• Marostegui closed subtask T266125: Drop table profiling from WMF wiki mariadb servers as Resolved.Jan 4 2021, 8:42 AM

Run check table periodically on backup source hosts
Open, MediumPublic
Actions

Description

First manual run:

Related Objects
Search...

Event Timeline

Run check table periodically on backup source hostsOpen, MediumPublicActions

Description

First manual run:

Related ObjectsSearch...

Event Timeline

Run check table periodically on backup source hosts
Open, MediumPublic
Actions

Related Objects
Search...