TZ: UTC +1/+2
Update from Jaime 18th Oct 16:05: s8 core hosts all finished getting fixed (pending labs)
if it fails again, I suggest we go for a DC failover.
So I guess it should go on Phase 0 before #4, somewhere between #3 and #4?
Or maybe even before #2, as it if it is not connected, maybe better to abort before wasting time on other next steps?
Leave it open until it finally gets rebuilt. They fail quite often unfortunately specially on old hosts and they need Papaul or Chris to pull the disk out and then back in
I have created T207385 so we can follow the discussion there.
@Volans ^ is that something we can do on the dc switchover script?
Is this alert fully deployed?
For the record, this table only has 4 rows, so it can probably done directly on the master with replication (once we are out of the woods with s8)
Update from 17th at 19:04 from Jaime:
all tables except wb_terms, which is half done, should be equal on the s8 master
Wed, Oct 17
Where is the compiler run link?
This needs some thought in order to make it effective:
Let's close it and if something breaks we can reopen.
Update from Jaime at 12:10 UTC
All tables on s8 master fixed except pagelinks and wb_terms
So user_id,user_ip as PK would require changes to MW?
@Krinkle @Reedy could you give this some priority so we can get a PK in place for user_newtalk? Right now we are dealing with recoveries on T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") and by not having a PK here it slows down a bit the recoveries on that table, so another reason to have a PK for it :-)
Keep in mind that the disk space is always constant on the pc hosts: https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&orgId=1&var-server=pc1005&var-network=eth0&from=1532045819383&to=1535494432566
If it grows then that is normally an indication of an issue.
Update from yesterday at around 14:00UTC
@jcrespo has done an amazing job of manually checking and fixing most of the tables on db1087 (which is the labs master, and it is not as easy to reclone as the others). He's gone thru all the tables to check and fix them.
Right all the tables apart from pagelinks and wb_terms are supposed to be fixed (although replication is broken on db1124 sanitarium master for wb_items_per_site, which is kind of expected with such amount of rows to check - we are looking at it).
I have added it to tendril.
As it happened before, this recovered itself - closing for now:
04:26 < icinga-wm> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops
Tue, Oct 16
I am going to decline as there are no plans to use analyze as a substitute of EXPLAIN. Instead show explain maybe in combination with https://tools.wmflabs.org/tools-info/optimizer.py should be done. T141095#2895716
If anyone feels this needs to be re-opened, please do so.
I've ack'ed the alerts for 3 hours on db1107 and db1108.
You can probably disconnect pc1005 and pc1006 today already, but keep executing the truncating without binlog, just in case.
Considering we are at around 70% at eqiad, and should remain like that, I don't think it is worth the hassle and the risk of depooling hosts at this point. So probably just truncating codfw tables and leaving replication codfw -> eqiad disconnected is the way to go here.
Pick some of the biggest tables and try to alter them to see how much you get and then we can probably extrapolate.
If it is not worth to be run on eqiad is good to know. Codfw should still be truncated after your tests I would suggest.
Make sure you enable notifications first and update zarcillo DB before repooling.
what's pending here?
Incident report (please feel free to add or modify whatever you feel it needs some changes!): https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover
Mon, Oct 15
Yeah, S3 has like 90 wikis!
Feel free to resolve this ticket if his is fully done from your side!
It needs to also run for S3 shard.
Battery replaced by Chris - thank you!:
Battery/Capacitor Count: 1 Battery/Capacitor Status: OK
So as Jaime said at T206743#4666169 all the pooled replicas in eqiad have now the content from codfw, so that is now consistent.
What is pending is db1087 (depooled) which is the master for db1124 (labsdb master) and labsdb, and of course the master, db1071 which doesn't receive reads.
So labsdb still have inconsistencies.
All the pooled replicas have now compressed tables, can you confirm from your end if this is good?
Now that I see it we are even more than 50% done, as we have also done lots of sections in both DCs already, so only pending 4 sections in eqiad. So I don't think we have delayed this that much considering the ticket was ready for us on the 13th. Sept.
Also as T206740#4659202, let's create a separate task for the replication check addition, so we can just focus on the immediate triage on this task I would say
Another test with db1104 before and after compressing:
firstname.lastname@example.org[wikidatawiki]> show create table wb_changes_dispatch\G *************************** 1. row *************************** Table: wb_changes_dispatch Create Table: CREATE TABLE `wb_changes_dispatch` ( `chd_site` varbinary(32) NOT NULL, `chd_db` varbinary(32) NOT NULL, `chd_seen` int(11) NOT NULL DEFAULT '0', `chd_touched` varbinary(14) NOT NULL DEFAULT '00000000000000', `chd_lock` varbinary(64) DEFAULT NULL, `chd_disabled` tinyint(3) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`chd_site`), KEY `wb_changes_dispatch_chd_seen` (`chd_seen`), KEY `wb_changes_dispatch_chd_touched` (`chd_touched`) ) ENGINE=InnoDB DEFAULT CHARSET=binary 1 row in set (0.00 sec)
@hoo after db1109 has been recloned (and now has compressed tables):
email@example.com[wikidatawiki]> SELECT COUNT(*) FROM wb_changes_dispatch; +----------+ | COUNT(*) | +----------+ | 577 | +----------+ 1 row in set (0.00 sec)
@Banyek please double check the key purge has finished on mwaint1002 and keep on with the rest of pending things to do here.
Probably after re-enabling the normal config options you might want to check if it is worth doing the table rebuild by doing it on codfw (make sure not to replicate the changes) and checking before/after space, so we can evaluate if it is worth the time or not (as we are on a good shape now)
This is no longer about dbstore2002 but about db2042, so let's follow on that task: T202051
dbstore2002 is good for now, so let's close this and re-open if necessary:
Sun, Oct 14
eqiad db API hosts return the query in less than a second as they have the correct schema:
firstname.lastname@example.org[wikidatawiki]> select @@hostname; +------------+ | @@hostname | +------------+ | db1104 | +------------+ 1 row in set (0.00 sec)
Sat, Oct 13
Fri, Oct 12
root@neodymium:/home/marostegui# ./section s8 | while read host port; do echo $host; mysql.py -h$host:$port wikidatawiki -BN -e "select page_id, page_title, page_latest from page where page_id = 99480;";done labsdb1011.eqiad.wmnet 99480 Q97215 745463455 labsdb1010.eqiad.wmnet 99480 Q97215 745463455 labsdb1009.eqiad.wmnet 99480 Q97215 745463455 dbstore2001.codfw.wmnet 99480 Q97215 745463455 dbstore1002.eqiad.wmnet 99480 Q97215 745463455 db2094.codfw.wmnet 99480 Q97215 745463455 db2086.codfw.wmnet 99480 Q97215 745463455 db2085.codfw.wmnet 99480 Q97215 745463455 db2083.codfw.wmnet 99480 Q97215 745463455 db2082.codfw.wmnet 99480 Q97215 745463455 db2081.codfw.wmnet 99480 Q97215 745463455 db2080.codfw.wmnet 99480 Q97215 745463455 db2079.codfw.wmnet 99480 Q97215 745463455 db2045.codfw.wmnet 99480 Q97215 745463455 db1124.eqiad.wmnet 99480 Q97215 745463455 db1116.eqiad.wmnet 99480 Q97215 745463455 db1109.eqiad.wmnet 99480 Q97215 745463455 db1104.eqiad.wmnet 99480 Q97215 745463455 db1101.eqiad.wmnet 99480 Q97215 745463455 db1099.eqiad.wmnet 99480 Q97215 745463455 db1092.eqiad.wmnet 99480 Q97215 745463455 db1087.eqiad.wmnet 99480 Q97215 745463455 db1071.eqiad.wmnet 99480 Q97215 745463455
We are also fully restoring the eqiad hosts from codfw which has all the fine data.
The fix done on Friday was a quick fix to get all the data in there.
Thu, Oct 11
We have had unexpected fires that might last long, so this might be delayed more than next week. We will do whatever we can, but I cannot promise this will get done next week or the week after.
This is no longer "unbreak now"
pc1004 Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/pc1004--vg-srv xfs 2.2T 1.6T 644G 71% /srv