One of the VMs affected by today's Cloud VPS hypervisor problems was deployment-db04. After the Cloud VPS peeps ran fsck it came back up but with a broken enwiki.logging table: P8078
Not being sure how to repair that exactly, and it being an old single jessie VM that appears to be a SPOF for MediaWiki (which doesn't appear willing to fall back to reading from master?), I've created a new (stretch) slave named deployment-db05.
When possible (the master, deployment-db03 is also down for related reasons but should be in an okay state) I intend to attempt https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data - this is probably a 40GB copy operation over the network which I'll coordinate with the cloud admins
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
deployment-prep: swap db master / slave | operations/mediawiki-config | master | +10 -10 |
Related Objects
- Mentioned In
- T242607: Create in-cloud puppetmaster for codfw1dev
T219088: Ensure we are unlikely to have both deployment-prep DB instances hosted together again in future
T219087: Get rid of deployment-db0[34]
T216635: MySQL database on deployment-db03 does not start due to InnoDB issue
T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start - Mentioned Here
- T216635: MySQL database on deployment-db03 does not start due to InnoDB issue
P8078 deployment-db04 enwiki.logging corruption 2019-02-13
Event Timeline
The whole of beta (MediaWiki) is off because of the slave being broken. I think the master can safely (without causing any more disruption than there already is) be disabled to do such a transfer.
Strange since there is CuminExecution.py in the same directory as transfer.py
So WMFMariadbpy is WIP, and has not yet been productionized. It will likely search for those on /usr/lib/python/wmfmariadbpy . Either position those there or, easier until we have a proper .deb package,
-from wmfmariadbpy.RemoteExecution import RemoteExecution, CommandReturn +from RemoteExecution import RemoteExecution, CommandReturn
on CuminExecution.py
and
-from wmfmariadbpy.CuminExecution import CuminExecution as RemoteExecution +from CuminExecution import CuminExecution as RemoteExecution
on transfer.py so it can be found on .
I have yet to productionize this properly, that is why it is not yet deployed in production.
PS: I accept contributors, as this will help us dbas, but also other people manage and transfer files and databases.
When I go to the directory and do the import manually I get:
ariel@deployment-cumin:~/wmfmariadbpy/wmfmariadbpy$ python Python 2.7.9 (default, Sep 25 2018, 20:42:16) [GCC 4.9.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import CuminExecution.py Traceback (most recent call last): File "<stdin>", line 1, in <module> File "CuminExecution.py", line 3, in <module> import cumin ImportError: No module named cumin >>>
so that's the problem I guess.
Correct, it needed python3, the directory exists , with python3 the error is different and it's likely the error jcrespo mentioned earlier.
I'm wondering if it's possible that the evacuation of db04 from the old broken hypervisor to the new one might have solved the problem.
Mentioned in SAL (#wikimedia-releng) [2019-02-18T11:29:28Z] <hasharAway> beta: tried to start instance deployment-db03 172.16.5.23 --> ERROR | T216067
I have tried to start deployment-db03.deployment-prep.eqiad.wmflabs via Horizon, but it errors out. I don't have access to the VM logs via Horizon. deployment-db03 is the master database.
The first of these makes sense, the second... there is no "from wmfmariadbpy" in transfer.py.
Mentioned in SAL (#wikimedia-releng) [2019-02-18T12:00:46Z] <Krenair> T216067 Stopping mysql on -db04 to begin copy to -db05. Note crashed tables centralauth.globaluser and centralauth.localuser
Transfer running the documented way, screen import on -db05 and export on -db04.
Once this is done I'll probably make deployment-db06 and copy stuff there too, then make MediaWiki treat one of them as a master.
Ignore the log above, it does require mysql to be running (I assumed the opposite)
And of course I can't find the package from db03/db04 because their apt status is corrupted...
If they are using our wmf-mariadb101 package, the binary is called "mariabackup", it was put on the path on the latest versions.
It doesn't appear to work the same way:
krenair@deployment-db05:~$ sudo mariabackup --apply-log --use-memory=12G /srv/sqldata Info: Using unique option prefix 'apply-log' is error-prone and can break in the future. Please use the full name 'apply-log-only' instead. mariabackup: Error: unknown argument: '/srv/sqldata' krenair@deployment-db05:~$ sudo mariabackup --apply-log --use-memory=12G --datadir /srv/sqldata Info: Using unique option prefix 'apply-log' is error-prone and can break in the future. Please use the full name 'apply-log-only' instead. mariabackup based on MariaDB server 10.1.38-MariaDB Linux (x86_64) Open source backup tool for InnoDB and XtraDB Copyright (C) 2009-2015 Percona LLC and/or its affiliates. Portions Copyright (C) 2000, 2011, MySQL AB & Innobase Oy. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation version 2 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You can download full text of the license on http://www.gnu.org/licenses/gpl-2.0.txt Usage: mariabackup [--defaults-file=#] [--backup | --prepare | --copy-back | --move-back] [OPTIONS] [...]
Mentioned in SAL (#wikimedia-releng) [2019-02-20T00:25:14Z] <greg-g> disabled beta-update-databases-eqiad in the jenkins UI - T216067
Mentioned in SAL (#wikimedia-releng) [2019-02-20T17:26:29Z] <hashar> For beta cluster the MySQL master database has some innodb issue T216635 , the MySQL slave has an issue as well T216067
I worked around that by restoring some backup of the dpkg status file. Two backup files were corrupted as well, had to use dpkg.status.2.gz:
sudo zcat /var/backups/dpkg.status.2.gz | sudo tee /var/lib/dpkg/status
On deployment-db03, I found bunch of binaries in /opt/wmf-mariadb10/bin
Sounds good. deployment-db03 (the master) seems heavily corrupted T216635#4969813 :-/ I have stalled the task pending the rebuild that is happening here.
Subscribing...not being able to test writes in beta is blocking some other stuff.
Thank you for working on this!
Is there any time estimation for this? not to annoy, I'm just trying to understand the timeline
Repeatedly asking for ETAs does not help get the issue fixed faster. We are working on it, but it is not a simple issue.
@greg I understand and I'm sure it's hard and I'm sorry, just to get a time guesstimation, are we talking about hours/days/weeks?
Again, I don't want to annoy, it's just information needed for us to make decisions, so basically, just a minor clarification would be nice
I plan to work on this today (and tomorrow if I can't get it running today)
@Krenair: how far did you get with the slave?
Mariadb is now running on db05!
root@deployment-db05:/srv/sqldata# chown mysql.mysql * . root@deployment-db05:/srv/sqldata# chown mysql.mysql * -R root@deployment-db05:/srv/sqldata# systemctl start mariadb root@deployment-db05:/srv/sqldata# systemctl status mariadb.service ● mariadb.service - mariadb database server Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2019-02-25 18:18:26 UTC; 3s ago Main PID: 12016 (mysqld) Status: "Taking your SQL requests now..." Tasks: 27 (limit: 4915) CGroup: /system.slice/mariadb.service └─12016 /opt/wmf-mariadb101/bin/mysqld
I wasn't able to complete the documented apply-log instruction due to the issue I posted above
@Krenair: it seems like your transfer completed (in screen) but I'm not sure which procedure you were following?
the one in the description, https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data
the bit it says to do before starting mysql
Mentioned in SAL (#wikimedia-releng) [2019-02-25T23:32:10Z] <twentyafterfour> root@deployment-db05# mariabackup --innobackupex --apply-log --use-memory=10G /srv/sqldata # T216067
So I figured out how to do innobackupex with our current installed packages: mariabackup has a innobackupex compatibility mode. [1]
So after running that, I got:
190225 23:30:41 completed OK!
Now I guess the thing to do is make db04 the new master and db05 a slave to db04?
Change master on db05 to point to db04
CHANGE MASTER to MASTER_USER='repl', MASTER_PASSWORD='...', MASTER_PORT=3306, MASTER_HOST='deployment-db04', MASTER_LOG_FILE='deployment-db04-bin.000136', MASTER_LOG_POS=368; Query OK, 0 rows affected (0.02 sec)
Change 492938 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/mediawiki-config@master] deployment-prep: swap db master / slave
Change 492938 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: swap db master / slave
beta logstash looks good. we're back folks.
Ping @RazShuty, @Ottomata just to alert you that this is resolved.
Also, thanks to @Krenair, @hashar, @Marostegui, @jcrespo, @ArielGlenn for working on this before I came along.
I have a request: can the procedure for setting up the replica/new master/whatever finally happened here, from start to finish, get put on beta's wikitech pages someplace, with any little tricks or tweaks needed?
@ArielGlenn it's mostly covered on https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data but not very clearly... I'll make an attempt to organize it better on a new wikitech page. It definitely needs to be easier to understand.
I'm not sure, I got to this kinda late, I'm still tweaking the docs by re-reading the history on this task.
Page moved and links updated to point to the correct title: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/MariaDB_Slave_instance_setup