Page MenuHomePhabricator

Make sure tools-db is replicated somewhere
Closed, ResolvedPublic

Description

Original plan (AFAIK) is for one of the labsdb hosts to be a master for tools-db and a slave for postgres, and the other to be master for postgres and slave for tools-db. Right now tools-db isn't puppetized nor has a slave anywhere. It should.

Event Timeline

yuvipanda assigned this task to coren.
yuvipanda raised the priority of this task from to High.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: faidon, mark, Aklapper, yuvipanda.

@coren: can we get an update on this if there has been any progress at all?

coren set Security to None.
coren moved this task from Backlog to Doing on the Labs-Q4-Sprint-1 board.

Tools-db is now properly puppetized; I'ma create a changeset for review by Sean to create the matching slave this week.

yuvipanda removed a project: Labs-Sprint-101.

tools-db is currently on labsdb1005, with a slave planned on labsdb1004. tools-db is the user mysql database for tool-labs, set up by role::labs::db::master in puppet. This involves setting up role::labs::db::slave on labsdb1004 and making sure replication works properly.

mysql: labsdb1005 -> labsdb1004 (labstore1004 is a mistake)

While it is a MariaDB 5.5 already configured as a master, this has 175GB of Aria/MyISAM tables, so it will require downtime:

root@labsdb1005:~$ mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 49496738
Server version: 5.5.39-MariaDB-log Source distribution

Copyright (c) 2000, 2014, Oracle, Monty Program Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB TOOLSDB localhost (none) > SELECT
    ->   COUNT(*) as '# TABLES',
    ->   CONCAT(ROUND(sum(data_length) / ( 1024 * 1024 * 1024 ), 2), 'G') as DATA,
    ->   CONCAT(ROUND(sum(index_length) / ( 1024 * 1024 * 1024 ), 2), 'G') as INDEXES,
    ->   CONCAT(sum(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2)), 'G') as TOTAL,
    ->   engine as ENGINE
    -> FROM information_schema.TABLES
    -> GROUP BY engine;
+----------+---------+---------+---------+--------------------+
| # TABLES | DATA    | INDEXES | TOTAL   | ENGINE             |
+----------+---------+---------+---------+--------------------+
|       23 | NULL    | NULL    | NULL    | NULL               |
|      182 | 108.66G | 22.43G  | 131.00G | Aria               |
|        2 | 0.00G   | 0.00G   | 0.00G   | CSV                |
|     1480 | 128.03G | 65.50G  | 193.35G | InnoDB             |
|       71 | 0.00G   | 0.00G   | 0.00G   | MEMORY             |
|      276 | 38.13G  | 5.99G   | 44.01G  | MyISAM             |
|       17 | 0.00G   | 0.00G   | 0.00G   | PERFORMANCE_SCHEMA |
+----------+---------+---------+---------+--------------------+
7 rows in set (50.22 sec)

datadir: /srv/labsdb/data

As xtrabackup doesn't work with Aria, I would go for mydumper (slower, but only requires read only) or an rsync + cold backup (faster, but requires full downtime for a small period). There isn't much fragmentation.

@jcrespo I think we should do the full downtime for a shorter period - most tools relying on tools-db don't particularly work great when read only I suspect.

@yuvipanda Let me start working on labsdb1004 and prepare everything else and we can start thinking on a maintenance date while I measure how much time it will need.

Change 218874 had a related patch set uploaded (by Jcrespo):
Adding mariadb to labsdb1004 as a slave of labsdb1005

https://gerrit.wikimedia.org/r/218874

@yuvipanda @coren The estimated downtime needed is 10 minutes, with some minor degradations by extra use of disk and network before and after that, plus a lower mysql performance after the restart for some hours. Ready to agree on a maintenance window at the appropriate time.

Ideally this should happen post Wikimania, so after 20th July...

Or it can happen on Monday if @coren is ok with doing that? A bit of a short notice though.

My suggestion: let's wait after Wikimania, and give a lot of advance notice (but you, Coren and Yuvi are the bosses here).

If you feel ok with it, let me take an "bad" backup (inconsistent one) without locking and put it on the slave, so worse case scenario we have at least something to go back to (PS: technically, I have already a local backup in preparation to the real one).

@yuvipanda: Yeah, Monday is not enough notice, even if the outage is expected to be brief. Let's schedule that for after our return from Wikimania.

I can try something without downtime, but I need the gerrit:218874 patch applied. Please +1 to apply it.

Change 218874 merged by Jcrespo:
Populate labsdb1004 with mariadb

https://gerrit.wikimedia.org/r/218874

I've been updating Mark on this, the backup is running, but a series of coincidences happened here:

  • This host has MyISAM tables, which cannot be backuped in a hot way AND be used for a replica
  • mydumper (fast logical backup) segmentfalted on it
  • mysqldump is being extremely slow due to the load that the original host has
  • Even after all of that, the copy will be imperfect, and the replica setup may fail

We really need to setup downtime for a 1 time proper copy. After that, everything will work (backups failover). But I cannot guarantee having this soon.

Imperfect replica is up and slowly catching up. It has been added to monitoring:

https://tendril.wikimedia.org/host/view/labsdb1004.eqiad.wmnet/3306

I am blocked by @akosiaris because I need to open port 3306 to 10.x via ferm for proper monitoring (include role::mariadb::ferm), but that cannot be done yet (if I remember correctly).

Clarifications on "Imperfect replica":

  • Data consistency cannot be 100% until slave heal is checked with pt-table-checksum. This is due to the hot MyISAM backup.
  • Binary log has not been activated on the slave, for performance reasons
  • Slave currently lacks the database s51071__templatetiger_p Due to the sixe of its indexes, it could not be copied logically. I will be imported physically when the slave catches up.

My apologies, @akosiaris, I've actually seen that ferm has been already applied here via the postgres role, so you are not blocking me.

This is almost finished, but the following tables failed to be imported:

s51071__templatetiger_p/desort5wiki
s51071__templatetiger_p/desort5wiki_para
s51071__templatetiger_p/desort5wiki_sum
s51071__templatetiger_p/frsort3wiki
s51071__templatetiger_p/frsort3wiki_para
s51071__templatetiger_p/frsort3wiki_sum
s51071__templatetiger_p/info

s51230__linkwatcher/coibot_regexmonitor

As a consequence, replication may break easily. Will fix soon.

I've manually imported the above tables. There is still some (minor) issues, but the general scope of the ticket has been done: replication with all tools data is up and running. Will track the other issues on a separate ticket, once I can research them.