I have provisioned dbproxy2001 into m1 codfw - with notifications disabled as it is not an active proxy (or even service)
Jul 22 2019
So, for the table drop what I will do will be.
Drop the table from a enwiki on a codfw replica (passive DC) to make sure there are no writes (if there are, replication will break and we'll know).
Will leave it for a few days, and if nothing breaks, I will rename the table on a eqiad (active DC) enwiki replica, and will monitor the error log to make sure nothing reads from it.
If there are also no issues, I will go ahead and start dropping it everywhere.
As spoken, I am going to close this as the scope of the ticket is done.
I will create a new one to re-image this host with buster+10.3 and rebuild its data
Jul 19 2019
Jul 18 2019
So, from my side, this is the way I read those tags:
I have altered db2116 for now, to make sure nothing writes to that column (if it does, it will break replication there, but won't impact the users). Will leave it for a few days before altering an active slave on eqiad (which is active and we can monitor if something reads from it for another few days).
firstname.lastname@example.org[enwiki]> show create table abuse_filter_log\G *************************** 1. row *************************** Table: abuse_filter_log Create Table: CREATE TABLE `abuse_filter_log` ( `afl_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `afl_filter` varbinary(64) NOT NULL DEFAULT '', `afl_user` bigint(20) unsigned NOT NULL DEFAULT '0', `afl_user_text` varbinary(255) NOT NULL DEFAULT '', `afl_ip` varbinary(255) NOT NULL DEFAULT '', `afl_action` varbinary(255) NOT NULL DEFAULT '', `afl_actions` varbinary(255) NOT NULL DEFAULT '', `afl_var_dump` blob NOT NULL, `afl_timestamp` varbinary(14) NOT NULL DEFAULT '', `afl_namespace` int(11) NOT NULL, `afl_title` varbinary(255) NOT NULL DEFAULT '', `afl_wiki` varbinary(64) DEFAULT NULL, `afl_deleted` tinyint(1) NOT NULL DEFAULT '0', `afl_patrolled_by` int(10) unsigned NOT NULL DEFAULT '0', `afl_rev_id` int(10) unsigned DEFAULT NULL, PRIMARY KEY (`afl_id`), KEY `afl_timestamp` (`afl_timestamp`), KEY `afl_rev_id` (`afl_rev_id`), KEY `user_timestamp` (`afl_user`,`afl_user_text`,`afl_timestamp`), KEY `filter_timestamp` (`afl_filter`,`afl_timestamp`), KEY `page_timestamp` (`afl_namespace`,`afl_title`,`afl_timestamp`), KEY `ip_timestamp` (`afl_ip`,`afl_timestamp`), KEY `wiki_timestamp` (`afl_wiki`,`afl_timestamp`) ) ENGINE=InnoDB AUTO_INCREMENT=24431867 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8 1 row in set (0.04 sec)
This host is ready for DC-Ops to start their decommissioning steps
All good - thanks!
root@es2003:/usr/local/lib/nagios/plugins# megacli -LDPDInfo -aAll
All good - thanks @Papaul!
root@db2044:~# hpssacli controller all show config
Jul 17 2019
Thanks - I can see it rebuilding:
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Rebuilding)
Going to close this ticket as I have created the decommission one: T228281: decommission db2045.codfw.wmnet
No point on spending time with this old host, I will start its decommissioning process.
Window reserved on the deployments page: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1832674&oldid=1832612
Email sent to ops and to wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2019-July/092308.html
We should have a bunch of disks from the decommissioned hosts, no?
Let's replace with an USED one for now, that host will go away "soonish"
Thanks for confirming. I am removing the Blocked-on-schema-change tag as there is nothing blocked on this removal (please correct me if I am wrong). So this is part of our our clean up backlog.
What I will do is rename the column on an enwiki host and leave it for a few days to make sure nothing really uses it.
Great work, a lot less files to edit when provisioning/moving/decommissioning hosts which were very error prone!
This host is ready for DC-Ops to decommission.
Glad to hear @MusikAnimal - we are trying a different approach whilst still compressing tables, which requires less depooling time. We will, still, however, require depooling once it is time for the biggest wikis to be compressed (enwiki, commons, wikidata..), but hopefully hours instead of days :)
Jul 10 2019
Jul 9 2019
We've discussed this (and the wider implication of the entire feature) in the Engineering meeting, and agreed it's time to revisit whether the product is worth the significant effort here. We tried to estimate, generally, how long things may take.
Here is our general estimate of what it would take, in general, to be able to store pageviews in the database and sort the result by them, erring on the side of caution:
- Add indexed integer column to the PageTriage table. Since the table isn't extremely large, it will probably not take too long but it does requires DBA review and assistance. Estimated time: A couple of weeks (with some risk of this being months)
Let's wait a few days before actually starting to decommission it.
I have disabled notifications though
This was done successfully.
root@cumin1001:/home/marostegui# mysql.py -hdb1117:3322 -e "show slave status\G" *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db1065.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1065-bin.000251 Read_Master_Log_Pos: 437347077 Relay_Log_File: db1117-relay-bin.000002 Relay_Log_Pos: 1278 Relay_Master_Log_File: db1065-bin.000251 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 436828506 Relay_Log_Space: 520148 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 171978772 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: 0-171970569-1006906062,171970636-171970636-23122305,171970569-171970569-156638323,171978772-171978772-139561525 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative
@Papaul and myself chatted about this and the plan is to:
- Clear logs (I just did)
- Upgrade firmware, BIOS etc
- Leave this task open for a week to see if it happens again and if not close it for now.
As per my chat with @Papaul I rebooted the host a second time and the previous error didn't show up.
Jul 8 2019
I have restarted db1109 to pickup STATEMENT as a binlog format. db1109 will be the candidate master once db1104 (current candidate master) gets promoted to master.
After this big batch of wiki compression only 3555 tables were left to be compressed - I am now trying to compress medium size wikis, between 20G and 100GB (a total of 3000 tables). If this goes fine, only just the bigger wikis (just one table for enwiki, ruwiki, wikidata) would be left to be compressed, so only depooling for them, which would reduce the amount of days that we'd need to have 1009 depooled and the service won't be as much degraded.
Will report back once this new batch is done.
Um, it has? I just found it on meta, though email@example.com(metawiki)> select * from edit_page_tracking; Empty set (0.00 sec)
I am going to remove the DBA tag from here as there is nothing for us to do yet.
Once this is ready to go, please follow the template to create a schema change request and we'll take care of it: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change
Jul 5 2019
In order to cause less disruption to the service I am trying a different approach with labsdb1009.
I am compressing around 50k tables from almost 700 wikis which are smaller than 10GB size without depooling the host so the load won't be as higher as we've seen on the other hosts. And we'll need to depool the hosts only for the big big wikis.
Those tables are small enough that replication isn't a problem and the tables are compressed very very fast that metadata locking isn't an issue either (also because those wikis aren't that used).