Original plan (AFAIK) is for one of the labsdb hosts to be a master for tools-db and a slave for postgres, and the other to be master for postgres and slave for tools-db. Right now tools-db isn't puppetized nor has a slave anywhere. It should.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Populate labsdb1004 with mariadb | operations/puppet | production | +33 -7 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | coren | T88234 Puppetize & fix tools-db | |||
Resolved | yuvipanda | T94643 Planned labs maintenance on tools-db: Puppetization + log file change | |||
Resolved | yuvipanda | T105720 Labs team reliability goal for Q1 2015/16 | |||
Resolved | Andrew | T105723 Eliminate SPOFs in Labs infrastructure | |||
Resolved | • jcrespo | T88718 Make sure tools-db is replicated somewhere |
Event Timeline
Tools-db is now properly puppetized; I'ma create a changeset for review by Sean to create the matching slave this week.
tools-db is currently on labsdb1005, with a slave planned on labsdb1004. tools-db is the user mysql database for tool-labs, set up by role::labs::db::master in puppet. This involves setting up role::labs::db::slave on labsdb1004 and making sure replication works properly.
While it is a MariaDB 5.5 already configured as a master, this has 175GB of Aria/MyISAM tables, so it will require downtime:
root@labsdb1005:~$ mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 49496738 Server version: 5.5.39-MariaDB-log Source distribution Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB TOOLSDB localhost (none) > SELECT -> COUNT(*) as '# TABLES', -> CONCAT(ROUND(sum(data_length) / ( 1024 * 1024 * 1024 ), 2), 'G') as DATA, -> CONCAT(ROUND(sum(index_length) / ( 1024 * 1024 * 1024 ), 2), 'G') as INDEXES, -> CONCAT(sum(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2)), 'G') as TOTAL, -> engine as ENGINE -> FROM information_schema.TABLES -> GROUP BY engine; +----------+---------+---------+---------+--------------------+ | # TABLES | DATA | INDEXES | TOTAL | ENGINE | +----------+---------+---------+---------+--------------------+ | 23 | NULL | NULL | NULL | NULL | | 182 | 108.66G | 22.43G | 131.00G | Aria | | 2 | 0.00G | 0.00G | 0.00G | CSV | | 1480 | 128.03G | 65.50G | 193.35G | InnoDB | | 71 | 0.00G | 0.00G | 0.00G | MEMORY | | 276 | 38.13G | 5.99G | 44.01G | MyISAM | | 17 | 0.00G | 0.00G | 0.00G | PERFORMANCE_SCHEMA | +----------+---------+---------+---------+--------------------+ 7 rows in set (50.22 sec)
datadir: /srv/labsdb/data
As xtrabackup doesn't work with Aria, I would go for mydumper (slower, but only requires read only) or an rsync + cold backup (faster, but requires full downtime for a small period). There isn't much fragmentation.
@jcrespo I think we should do the full downtime for a shorter period - most tools relying on tools-db don't particularly work great when read only I suspect.
@yuvipanda Let me start working on labsdb1004 and prepare everything else and we can start thinking on a maintenance date while I measure how much time it will need.
Change 218874 had a related patch set uploaded (by Jcrespo):
Adding mariadb to labsdb1004 as a slave of labsdb1005
@yuvipanda @coren The estimated downtime needed is 10 minutes, with some minor degradations by extra use of disk and network before and after that, plus a lower mysql performance after the restart for some hours. Ready to agree on a maintenance window at the appropriate time.
Ideally this should happen post Wikimania, so after 20th July...
Or it can happen on Monday if @coren is ok with doing that? A bit of a short notice though.
My suggestion: let's wait after Wikimania, and give a lot of advance notice (but you, Coren and Yuvi are the bosses here).
If you feel ok with it, let me take an "bad" backup (inconsistent one) without locking and put it on the slave, so worse case scenario we have at least something to go back to (PS: technically, I have already a local backup in preparation to the real one).
@yuvipanda: Yeah, Monday is not enough notice, even if the outage is expected to be brief. Let's schedule that for after our return from Wikimania.
I can try something without downtime, but I need the gerrit:218874 patch applied. Please +1 to apply it.
I've been updating Mark on this, the backup is running, but a series of coincidences happened here:
- This host has MyISAM tables, which cannot be backuped in a hot way AND be used for a replica
- mydumper (fast logical backup) segmentfalted on it
- mysqldump is being extremely slow due to the load that the original host has
- Even after all of that, the copy will be imperfect, and the replica setup may fail
We really need to setup downtime for a 1 time proper copy. After that, everything will work (backups failover). But I cannot guarantee having this soon.
Imperfect replica is up and slowly catching up. It has been added to monitoring:
https://tendril.wikimedia.org/host/view/labsdb1004.eqiad.wmnet/3306
I am blocked by @akosiaris because I need to open port 3306 to 10.x via ferm for proper monitoring (include role::mariadb::ferm), but that cannot be done yet (if I remember correctly).
Clarifications on "Imperfect replica":
- Data consistency cannot be 100% until slave heal is checked with pt-table-checksum. This is due to the hot MyISAM backup.
- Binary log has not been activated on the slave, for performance reasons
- Slave currently lacks the database s51071__templatetiger_p Due to the sixe of its indexes, it could not be copied logically. I will be imported physically when the slave catches up.
My apologies, @akosiaris, I've actually seen that ferm has been already applied here via the postgres role, so you are not blocking me.
This is almost finished, but the following tables failed to be imported:
s51071__templatetiger_p/desort5wiki s51071__templatetiger_p/desort5wiki_para s51071__templatetiger_p/desort5wiki_sum s51071__templatetiger_p/frsort3wiki s51071__templatetiger_p/frsort3wiki_para s51071__templatetiger_p/frsort3wiki_sum s51071__templatetiger_p/info s51230__linkwatcher/coibot_regexmonitor
As a consequence, replication may break easily. Will fix soon.
I've manually imported the above tables. There is still some (minor) issues, but the general scope of the ticket has been done: replication with all tools data is up and running. Will track the other issues on a separate ticket, once I can research them.