Thu, Sep 7
db1100 has been cloned from db1049 and migrated to file per table.
It is now catching up, once it has caught up I will create the decommission task for db1049 but I won't act on it as I will be gone for two weeks, just create it as a reminder that it is ready to be decommissioned.
This is now all done
@Papaul please change the disk whenever you can.
Wed, Sep 6
Is this meant to be a database schema change? https://wikitech.wikimedia.org/wiki/Schema_changes
If it is: can you please follow this template for the task, so DBAs can better understand it and thus try to get it resolved faster and without any issues?: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change
Truncation still happening on s3. I have throttled it quite a bit, because db1095 (sanitarium) was struggling to be able to replicate without any delay.
Dropped from all the shards except from s3, which is being done now slowly
I will migrate db1100 to file per table once the copy is done. db1049 was not using it.
This is all good now, thanks a lot Chris!
Tue, Sep 5
Thanks for the initial list!
This is the complete list of wikis where it exists:
This table exists on:
You think we need to backup them (in case they are not empty?)
I was thinking that maybe we can reclone db1092 (for example) using db1049's data, (old master), so we can just decommission it...
root@db1083:/srv/sqldata/enwiki# ls -lh pagelinks.ibd -rw-rw---- 1 mysql mysql 146G Sep 5 15:29 pagelinks.ibd
devwikiinternal and rel13testwiki and last write happened on 2015.
I have taken a mysqldump from those two databases and placed them at:
root@dbstore1001:/srv/tmp/T118764# pwd /srv/tmp/T118764 root@dbstore1001:/srv/tmp/T118764# ls -lh total 420K -rw-r--r-- 1 root root 208K Sep 5 15:07 devwikiinternal.sql -rw-r--r-- 1 root root 212K Sep 5 15:07 rel13testwiki.sql
Actually, we could also optimize the table without depooling as it is an INPLACE operation (and I actually remember optimizing a big table not long ago without any lag): https://dev.mysql.com/doc/refman/5.6/en/innodb-create-index-overview.html
We can double check anyways.
Tables dropped from: s5,s6 and s7.
Tables dropped from s3
Tables dropped from s2
Sounds good to me. I will finish db1083 for as the defgramentation is already started on pagelinks.
This is template links after the defragmentation:
root@db1083:/srv/sqldata/enwiki# ls -lh templatelinks.ibd -rw-rw---- 1 mysql mysql 85G Sep 5 11:33 templatelinks.ibd
The ALTER itself takes seconds to run (less than 10 seconds on a SSD host).
root@PRODUCTION s1 slave[enwiki]> show create table pagelinks\G show create table templatelinks\G *************************** 1. row *************************** Table: pagelinks Create Table: CREATE TABLE `pagelinks` ( `pl_from` int(8) unsigned NOT NULL DEFAULT '0', `pl_namespace` int(11) NOT NULL DEFAULT '0', `pl_title` varbinary(255) NOT NULL DEFAULT '', `pl_from_namespace` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`pl_from`,`pl_namespace`,`pl_title`), KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`), KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`) ) ENGINE=InnoDB DEFAULT CHARSET=binary 1 row in set (0.00 sec)
I just realised that db1089 (enwiki) does not have those indexes.
Mon, Sep 4
Closing this as this has not happened again in months.
If it happens again, let's reopen and follow up
If no one disagrees I am going to close this as we are already using pt-online-schema-change and when available we use INPLACE operations supported on MariaDB.
As Jaime pointed out, the most popular tool out there now is gh-ost (along with pt-online-schema-change).
We could give gh-ost a go, but so far we don't have much bandwidth for it, and we can track it in an specific task for all the related tests.
MySQL is stopped.
This host is now ready for the remaining DC Ops steps to be completed
This server is ready to be totally decommissioned and only pending the DC Ops steps, so assigning it to @Cmjohnson for the pending steps
@MarkTraceur I am going to re-assign this task back to you to see if you can arrange a read-only time for commonswiki.
I will be on holidays from 7th to 25th, so anytime after 25th of Sept (and probably best to avoid it being a Friday) would be good for me.
The new 3D enum value was missing on the new created wiki kbpwiki on s3, as probably the patch was merged after it was created.
The rest of wikis look good across all the shards.
Just for the record, I am checking again that the alter has been done across all the host and all the wikis and the only pending one would be commonswiki on s4 master - db1068
I have done the alter table on enwiki master (db1052) and it has gone fine. I pre-warmed the big tables (image and filearchive) before and it went thru relatively quickly.
Before running the alter I checked the tables last modifications and indeed they were having low traffic at this time, so while running the alter I didn't see any connection waiting or errors happening so looks like it didn't impact anyone.
Image table took around 1 minute to be altered, and filearchive around 3 minutes.
Sat, Sep 2
@Cmjohnson please go ahead and replace the disk when you can
Fri, Sep 1
I have renamed the table on enwiki, on db1089 just to make sure nothing breaks
root@db1089[enwiki]> show tables like 'T17%'; +-------------------------+ | Tables_in_enwiki (T17%) | +-------------------------+ | T174782_pr_index | +-------------------------+ 1 row in set (0.00 sec)
These empty tables current exist on the following wikis, and should be dropped from there:
db1026 is now ready to be decommissioned and all the pending steps are DC Ops ones, so I am handing this over to @Cmjohnson
All s4 hosts have been upgraded to 10.0.32.
Obviously not the master (see below)
After rebooting the server again, everything looks good again and I see no more HW errors.
I have started mysql and replication and everything is looking ok.
Yeah - I thought about creating it on enwikisource to unblock this ticket first and then slowly create it on the big wikis and so forth
I have rebooted the server because it was basically unresponsive and it has came back fine apparently. I will do some more checks before starting MySQL
From the logs:
/system1/log1/record16 Targets Properties number=16 severity=Critical date=09/01/2017 time=06:18 description=Drive Array Controller Failure (Slot 0)