Page MenuHomePhabricator

Recover from corrupted beta MySQL slave (deployment-db04)
Closed, ResolvedPublic

Assigned To
None
Authored By
Krenair
Feb 13 2019, 8:12 PM
Referenced Files
None
Tokens
"Orange Medal" token, awarded by kostajh."Like" token, awarded by RhinosF1."Stroopwafel" token, awarded by debt."Baby Tequila" token, awarded by mmodell."Party Time" token, awarded by Jdlrobson."Love" token, awarded by Ryasmeen."Evil Spooky Haunted Tree" token, awarded by Krenair.

Description

One of the VMs affected by today's Cloud VPS hypervisor problems was deployment-db04. After the Cloud VPS peeps ran fsck it came back up but with a broken enwiki.logging table: P8078
Not being sure how to repair that exactly, and it being an old single jessie VM that appears to be a SPOF for MediaWiki (which doesn't appear willing to fall back to reading from master?), I've created a new (stretch) slave named deployment-db05.
When possible (the master, deployment-db03 is also down for related reasons but should be in an okay state) I intend to attempt https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data - this is probably a 40GB copy operation over the network which I'll coordinate with the cloud admins

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Marostegui /srv/sqldata has 39G on it on db04, presumably that's pretty close to the amount of data on the master.

If the master cannot be put down to do a binary transfer, probably a mysqldump on the master and then importing it on the new instance is a good way to rebuild it.

The whole of beta (MediaWiki) is off because of the slave being broken. I think the master can safely (without causing any more disruption than there already is) be disabled to do such a transfer.

Strange since there is CuminExecution.py in the same directory as transfer.py

So WMFMariadbpy is WIP, and has not yet been productionized. It will likely search for those on /usr/lib/python/wmfmariadbpy . Either position those there or, easier until we have a proper .deb package,

-from wmfmariadbpy.RemoteExecution import RemoteExecution, CommandReturn
+from RemoteExecution import RemoteExecution, CommandReturn

on CuminExecution.py

and

-from wmfmariadbpy.CuminExecution import CuminExecution as RemoteExecution
+from CuminExecution import CuminExecution as RemoteExecution

on transfer.py so it can be found on .

I have yet to productionize this properly, that is why it is not yet deployed in production.

PS: I accept contributors, as this will help us dbas, but also other people manage and transfer files and databases.

When I go to the directory and do the import manually I get:

ariel@deployment-cumin:~/wmfmariadbpy/wmfmariadbpy$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import CuminExecution.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "CuminExecution.py", line 3, in <module>
    import cumin
ImportError: No module named cumin
>>>

so that's the problem I guess.

When I go to the directory and do the import manually I get:

ariel@deployment-cumin:~/wmfmariadbpy/wmfmariadbpy$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import CuminExecution.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "CuminExecution.py", line 3, in <module>
    import cumin
ImportError: No module named cumin
>>>

so that's the problem I guess.

you probably need to run python3

check /usr/lib/python3/dist-packages/cumin/__init__.py exist, it does on production.

Correct, it needed python3, the directory exists , with python3 the error is different and it's likely the error jcrespo mentioned earlier.

Did someone fix deployment-db04? It appears to have MySQL running now...

I'm wondering if it's possible that the evacuation of db04 from the old broken hypervisor to the new one might have solved the problem.

Did someone fix deployment-db04? It appears to have MySQL running now...

All I did was move it and run fsck. If the db is up now, then... great?

Mentioned in SAL (#wikimedia-releng) [2019-02-18T11:29:28Z] <hasharAway> beta: tried to start instance deployment-db03 172.16.5.23 --> ERROR | T216067

I have tried to start deployment-db03.deployment-prep.eqiad.wmflabs via Horizon, but it errors out. I don't have access to the VM logs via Horizon. deployment-db03 is the master database.

Strange since there is CuminExecution.py in the same directory as transfer.py

So WMFMariadbpy is WIP, and has not yet been productionized. It will likely search for those on /usr/lib/python/wmfmariadbpy . Either position those there or, easier until we have a proper .deb package,

-from wmfmariadbpy.RemoteExecution import RemoteExecution, CommandReturn
+from RemoteExecution import RemoteExecution, CommandReturn

on CuminExecution.py

and

-from wmfmariadbpy.CuminExecution import CuminExecution as RemoteExecution
+from CuminExecution import CuminExecution as RemoteExecution

on transfer.py so it can be found on .

I have yet to productionize this properly, that is why it is not yet deployed in production.

The first of these makes sense, the second... there is no "from wmfmariadbpy" in transfer.py.

Mentioned in SAL (#wikimedia-releng) [2019-02-18T12:00:46Z] <Krenair> T216067 Stopping mysql on -db04 to begin copy to -db05. Note crashed tables centralauth.globaluser and centralauth.localuser

Transfer running the documented way, screen import on -db05 and export on -db04.
Once this is done I'll probably make deployment-db06 and copy stuff there too, then make MediaWiki treat one of them as a master.
Ignore the log above, it does require mysql to be running (I assumed the opposite)

-bash: innobackupex-1.5.1: command not found
-bash: innobackupex: command not found

And of course I can't find the package from db03/db04 because their apt status is corrupted...

If they are using our wmf-mariadb101 package, the binary is called "mariabackup", it was put on the path on the latest versions.

It doesn't appear to work the same way:

krenair@deployment-db05:~$ sudo mariabackup --apply-log --use-memory=12G /srv/sqldata
Info: Using unique option prefix 'apply-log' is error-prone and can break in the future. Please use the full name 'apply-log-only' instead.
mariabackup: Error: unknown argument: '/srv/sqldata'
krenair@deployment-db05:~$ sudo mariabackup --apply-log --use-memory=12G --datadir /srv/sqldata
Info: Using unique option prefix 'apply-log' is error-prone and can break in the future. Please use the full name 'apply-log-only' instead.
mariabackup based on MariaDB server 10.1.38-MariaDB Linux (x86_64) 
Open source backup tool for InnoDB and XtraDB

Copyright (C) 2009-2015 Percona LLC and/or its affiliates.
Portions Copyright (C) 2000, 2011, MySQL AB & Innobase Oy. All Rights Reserved.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation version 2
of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You can download full text of the license on http://www.gnu.org/licenses/gpl-2.0.txt

Usage: mariabackup [--defaults-file=#] [--backup | --prepare | --copy-back | --move-back] [OPTIONS]
[...]

Mentioned in SAL (#wikimedia-releng) [2019-02-20T00:25:14Z] <greg-g> disabled beta-update-databases-eqiad in the jenkins UI - T216067

Mentioned in SAL (#wikimedia-releng) [2019-02-20T17:26:29Z] <hashar> For beta cluster the MySQL master database has some innodb issue T216635 , the MySQL slave has an issue as well T216067

And of course I can't find the package from db03/db04 because their apt status is corrupted...

I worked around that by restoring some backup of the dpkg status file. Two backup files were corrupted as well, had to use dpkg.status.2.gz:

sudo zcat /var/backups/dpkg.status.2.gz | sudo tee /var/lib/dpkg/status

On deployment-db03, I found bunch of binaries in /opt/wmf-mariadb10/bin

Transfer running the documented way, screen import on -db05 and export on -db04.
Once this is done I'll probably make deployment-db06 and copy stuff there too, then make MediaWiki treat one of them as a master.
Ignore the log above, it does require mysql to be running (I assumed the opposite)

Sounds good. deployment-db03 (the master) seems heavily corrupted T216635#4969813 :-/ I have stalled the task pending the rebuild that is happening here.

greg triaged this task as High priority.Feb 22 2019, 12:14 AM

(setting to high to reflect reality)

Subscribing...not being able to test writes in beta is blocking some other stuff.

Thank you for working on this!

Any updates here? I see beta still in readonly...

Is there any time estimation for this? not to annoy, I'm just trying to understand the timeline

Is there any time estimation for this? not to annoy, I'm just trying to understand the timeline

! In T216067#4981157, @Ottomata wrote:

Any updates here? I see beta still in readonly...

Would be good for an ET on the fix?
It's a big thing to be down.

Repeatedly asking for ETAs does not help get the issue fixed faster. We are working on it, but it is not a simple issue.

@greg I understand and I'm sure it's hard and I'm sorry, just to get a time guesstimation, are we talking about hours/days/weeks?

Again, I don't want to annoy, it's just information needed for us to make decisions, so basically, just a minor clarification would be nice

There is no explicit ETA at this time.

I plan to work on this today (and tomorrow if I can't get it running today)

@Krenair: how far did you get with the slave?

ok it looks like the data copy finished without errors.

Mariadb is now running on db05!

root@deployment-db05:/srv/sqldata# chown mysql.mysql * .
root@deployment-db05:/srv/sqldata# chown mysql.mysql * -R
root@deployment-db05:/srv/sqldata# systemctl start mariadb
root@deployment-db05:/srv/sqldata# systemctl status mariadb.service
● mariadb.service - mariadb database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-02-25 18:18:26 UTC; 3s ago
 Main PID: 12016 (mysqld)
   Status: "Taking your SQL requests now..."
    Tasks: 27 (limit: 4915)
   CGroup: /system.slice/mariadb.service
           └─12016 /opt/wmf-mariadb101/bin/mysqld

@Krenair: how far did you get with the slave?

I wasn't able to complete the documented apply-log instruction due to the issue I posted above

@Krenair: it seems like your transfer completed (in screen) but I'm not sure which procedure you were following?

Mentioned in SAL (#wikimedia-releng) [2019-02-25T23:32:10Z] <twentyafterfour> root@deployment-db05# mariabackup --innobackupex --apply-log --use-memory=10G /srv/sqldata # T216067

So I figured out how to do innobackupex with our current installed packages: mariabackup has a innobackupex compatibility mode. [1]

So after running that, I got:

190225 23:30:41 completed OK!

Now I guess the thing to do is make db04 the new master and db05 a slave to db04?

  1. https://mariadb.com/kb/en/library/mariabackup-options/#-innobackupex

@Krenair: hope you don't mind me claiming this one?

Change master on db05 to point to db04

CHANGE MASTER to MASTER_USER='repl', MASTER_PASSWORD='...', MASTER_PORT=3306, MASTER_HOST='deployment-db04', MASTER_LOG_FILE='deployment-db04-bin.000136', MASTER_LOG_POS=368;
Query OK, 0 rows affected (0.02 sec)

Change 492938 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/mediawiki-config@master] deployment-prep: swap db master / slave

https://gerrit.wikimedia.org/r/492938

Change 492938 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: swap db master / slave

https://gerrit.wikimedia.org/r/492938

it appears that we're out of read-only and replicating properly

beta logstash looks good. we're back folks.

Ping @RazShuty, @Ottomata just to alert you that this is resolved.

Also, thanks to @Krenair, @hashar, @Marostegui, @jcrespo, @ArielGlenn for working on this before I came along.

Yay! Thanks so much @mmodell and everyone else involved in fixing this.

Omg you're all amazing! Thanks a lot!!!! Much much much appreciated!

I have a request: can the procedure for setting up the replica/new master/whatever finally happened here, from start to finish, get put on beta's wikitech pages someplace, with any little tricks or tweaks needed?

@ArielGlenn it's mostly covered on https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data but not very clearly... I'll make an attempt to organize it better on a new wikitech page. It definitely needs to be easier to understand.

The package doesn't come from puppet?

I'm not sure, I got to this kinda late, I'm still tweaking the docs by re-reading the history on this task.

The page title looks wrong: 'Nove Resource:' instead of Nova Resource.

The page title looks wrong: 'Nove Resource:' instead of Nova Resource.

Page moved and links updated to point to the correct title: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/MariaDB_Slave_instance_setup