Recover from corrupted beta MySQL slave (deployment-db04)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Krenair
	Feb 13 2019, 8:12 PM

Description

One of the VMs affected by today's Cloud VPS hypervisor problems was deployment-db04. After the Cloud VPS peeps ran fsck it came back up but with a broken enwiki.logging table: P8078
Not being sure how to repair that exactly, and it being an old single jessie VM that appears to be a SPOF for MediaWiki (which doesn't appear willing to fall back to reading from master?), I've created a new (stretch) slave named deployment-db05.
When possible (the master, deployment-db03 is also down for related reasons but should be in an okay state) I intend to attempt https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data - this is probably a 40GB copy operation over the network which I'll coordinate with the cloud admins

Details

	Subject	Repo	Branch	Lines +/-
	deployment-prep: swap db master / slave	operations/mediawiki-config	master	+10 -10

Customize query in gerrit

Related Objects

Mentioned In: T242607: Create in-cloud puppetmaster for codfw1dev
T219088: Ensure we are unlikely to have both deployment-prep DB instances hosted together again in future
T219087: Get rid of deployment-db0[34]
T216635: MySQL database on deployment-db03 does not start due to InnoDB issue
T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start
Mentioned Here: T216635: MySQL database on deployment-db03 does not start due to InnoDB issue
P8078 deployment-db04 enwiki.logging corruption 2019-02-13

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T216067#4953700, @Marostegui wrote:

In T216067#4953498, @ArielGlenn wrote:

@Marostegui /srv/sqldata has 39G on it on db04, presumably that's pretty close to the amount of data on the master.

If the master cannot be put down to do a binary transfer, probably a mysqldump on the master and then importing it on the new instance is a good way to rebuild it.

The whole of beta (MediaWiki) is off because of the slave being broken. I think the master can safely (without causing any more disruption than there already is) be disabled to do such a transfer.

Strange since there is CuminExecution.py in the same directory as transfer.py

So WMFMariadbpy is WIP, and has not yet been productionized. It will likely search for those on /usr/lib/python/wmfmariadbpy . Either position those there or, easier until we have a proper .deb package,

-from wmfmariadbpy.RemoteExecution import RemoteExecution, CommandReturn
+from RemoteExecution import RemoteExecution, CommandReturn

on CuminExecution.py

and

-from wmfmariadbpy.CuminExecution import CuminExecution as RemoteExecution
+from CuminExecution import CuminExecution as RemoteExecution

on transfer.py so it can be found on .

I have yet to productionize this properly, that is why it is not yet deployed in production.

PS: I accept contributors, as this will help us dbas, but also other people manage and transfer files and databases.

When I go to the directory and do the import manually I get:

ariel@deployment-cumin:~/wmfmariadbpy/wmfmariadbpy$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import CuminExecution.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "CuminExecution.py", line 3, in <module>
    import cumin
ImportError: No module named cumin
>>>

so that's the problem I guess.

In T216067#4954038, @ArielGlenn wrote:

When I go to the directory and do the import manually I get:

ariel@deployment-cumin:~/wmfmariadbpy/wmfmariadbpy$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import CuminExecution.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "CuminExecution.py", line 3, in <module>
    import cumin
ImportError: No module named cumin
>>>

so that's the problem I guess.

you probably need to run python3

check /usr/lib/python3/dist-packages/cumin/__init__.py exist, it does on production.

Correct, it needed python3, the directory exists , with python3 the error is different and it's likely the error jcrespo mentioned earlier.

Did someone fix deployment-db04? It appears to have MySQL running now...

Not I. Anyone else?

I'm wondering if it's possible that the evacuation of db04 from the old broken hypervisor to the new one might have solved the problem.

In T216067#4954457, @Krenair wrote:

Did someone fix deployment-db04? It appears to have MySQL running now...

All I did was move it and run fsck. If the db is up now, then... great?

greg moved this task from To Triage to Extensions & services config on the Beta-Cluster-Infrastructure board.Feb 14 2019, 10:55 PM

Mentioned in SAL (#wikimedia-releng) [2019-02-18T11:29:28Z] <hasharAway> beta: tried to start instance deployment-db03 172.16.5.23 --> ERROR | T216067

I have tried to start deployment-db03.deployment-prep.eqiad.wmflabs via Horizon, but it errors out. I don't have access to the VM logs via Horizon. deployment-db03 is the master database.

In T216067#4954019, @jcrespo wrote:
Strange since there is CuminExecution.py in the same directory as transfer.py

So WMFMariadbpy is WIP, and has not yet been productionized. It will likely search for those on /usr/lib/python/wmfmariadbpy . Either position those there or, easier until we have a proper .deb package,
-from wmfmariadbpy.RemoteExecution import RemoteExecution, CommandReturn
+from RemoteExecution import RemoteExecution, CommandReturn
on CuminExecution.py

and
-from wmfmariadbpy.CuminExecution import CuminExecution as RemoteExecution
+from CuminExecution import CuminExecution as RemoteExecution
on transfer.py so it can be found on .

I have yet to productionize this properly, that is why it is not yet deployed in production.

The first of these makes sense, the second... there is no "from wmfmariadbpy" in transfer.py.

Mentioned in SAL (#wikimedia-releng) [2019-02-18T12:00:46Z] <Krenair> T216067 Stopping mysql on -db04 to begin copy to -db05. Note crashed tables centralauth.globaluser and centralauth.localuser

Transfer running the documented way, screen import on -db05 and export on -db04.
Once this is done I'll probably make deployment-db06 and copy stuff there too, then make MediaWiki treat one of them as a master.
Ignore the log above, it does require mysql to be running (I assumed the opposite)

hashar mentioned this in T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start.Feb 18 2019, 12:29 PM

-bash: innobackupex-1.5.1: command not found
-bash: innobackupex: command not found

And of course I can't find the package from db03/db04 because their apt status is corrupted...

If they are using our wmf-mariadb101 package, the binary is called "mariabackup", it was put on the path on the latest versions.

RhinosF1 subscribed.Feb 19 2019, 11:32 AM

It doesn't appear to work the same way:

krenair@deployment-db05:~$ sudo mariabackup --apply-log --use-memory=12G /srv/sqldata
Info: Using unique option prefix 'apply-log' is error-prone and can break in the future. Please use the full name 'apply-log-only' instead.
mariabackup: Error: unknown argument: '/srv/sqldata'
krenair@deployment-db05:~$ sudo mariabackup --apply-log --use-memory=12G --datadir /srv/sqldata
Info: Using unique option prefix 'apply-log' is error-prone and can break in the future. Please use the full name 'apply-log-only' instead.
mariabackup based on MariaDB server 10.1.38-MariaDB Linux (x86_64) 
Open source backup tool for InnoDB and XtraDB

Copyright (C) 2009-2015 Percona LLC and/or its affiliates.
Portions Copyright (C) 2000, 2011, MySQL AB & Innobase Oy. All Rights Reserved.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation version 2
of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You can download full text of the license on http://www.gnu.org/licenses/gpl-2.0.txt

Usage: mariabackup [--defaults-file=#] [--backup | --prepare | --copy-back | --move-back] [OPTIONS]
[...]

Mentioned in SAL (#wikimedia-releng) [2019-02-20T00:25:14Z] <greg-g> disabled beta-update-databases-eqiad in the jenkins UI - T216067

Paladox subscribed.Feb 20 2019, 12:27 AM

hashar mentioned this in T216635: MySQL database on deployment-db03 does not start due to InnoDB issue.Feb 20 2019, 5:24 PM

Mentioned in SAL (#wikimedia-releng) [2019-02-20T17:26:29Z] <hashar> For beta cluster the MySQL master database has some innodb issue T216635 , the MySQL slave has an issue as well T216067

In T216067#4964127, @Krenair wrote:

And of course I can't find the package from db03/db04 because their apt status is corrupted...

I worked around that by restoring some backup of the dpkg status file. Two backup files were corrupted as well, had to use dpkg.status.2.gz:

sudo zcat /var/backups/dpkg.status.2.gz | sudo tee /var/lib/dpkg/status

On deployment-db03, I found bunch of binaries in /opt/wmf-mariadb10/bin

In T216067#4961394, @Krenair wrote:

Transfer running the documented way, screen import on -db05 and export on -db04.
Once this is done I'll probably make deployment-db06 and copy stuff there too, then make MediaWiki treat one of them as a master.
Ignore the log above, it does require mysql to be running (I assumed the opposite)

Sounds good. deployment-db03 (the master) seems heavily corrupted T216635#4969813 :-/ I have stalled the task pending the rebuild that is happening here.

(setting to high to reflect reality)

@Krenair / @hashar / @jcrespo : What are the next steps here?

• Pablo-WMDE subscribed.Feb 22 2019, 4:07 PM

Subscribing...not being able to test writes in beta is blocking some other stuff.

Thank you for working on this!

Jdlrobson subscribed.Feb 22 2019, 6:14 PM

Ryasmeen subscribed.Feb 22 2019, 9:37 PM

DatGuy subscribed.Feb 23 2019, 1:11 PM

• RazShuty subscribed.Feb 25 2019, 8:23 AM

Lydia_Pintscher subscribed.Feb 25 2019, 11:27 AM

Any updates here? I see beta still in readonly...

Is there any time estimation for this? not to annoy, I'm just trying to understand the timeline

In T216067#4981587, @RazShuty wrote:

Is there any time estimation for this? not to annoy, I'm just trying to understand the timeline

! In T216067#4981157, @Ottomata wrote:

Any updates here? I see beta still in readonly...

Would be good for an ET on the fix?
It's a big thing to be down.

Repeatedly asking for ETAs does not help get the issue fixed faster. We are working on it, but it is not a simple issue.

@greg I understand and I'm sure it's hard and I'm sorry, just to get a time guesstimation, are we talking about hours/days/weeks?

Again, I don't want to annoy, it's just information needed for us to make decisions, so basically, just a minor clarification would be nice

There is no explicit ETA at this time.

I plan to work on this today (and tomorrow if I can't get it running today)

@Krenair: how far did you get with the slave?

ok it looks like the data copy finished without errors.

Mariadb is now running on db05!

root@deployment-db05:/srv/sqldata# chown mysql.mysql * .
root@deployment-db05:/srv/sqldata# chown mysql.mysql * -R
root@deployment-db05:/srv/sqldata# systemctl start mariadb
root@deployment-db05:/srv/sqldata# systemctl status mariadb.service
● mariadb.service - mariadb database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-02-25 18:18:26 UTC; 3s ago
 Main PID: 12016 (mysqld)
   Status: "Taking your SQL requests now..."
    Tasks: 27 (limit: 4915)
   CGroup: /system.slice/mariadb.service
           └─12016 /opt/wmf-mariadb101/bin/mysqld

In T216067#4981728, @mmodell wrote:

@Krenair: how far did you get with the slave?

I wasn't able to complete the documented apply-log instruction due to the issue I posted above

debt subscribed.Feb 25 2019, 10:23 PM

@Krenair: it seems like your transfer completed (in screen) but I'm not sure which procedure you were following?

the one in the description, https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data
the bit it says to do before starting mysql

Mentioned in SAL (#wikimedia-releng) [2019-02-25T23:32:10Z] <twentyafterfour> root@deployment-db05# mariabackup --innobackupex --apply-log --use-memory=10G /srv/sqldata # T216067

So I figured out how to do innobackupex with our current installed packages: mariabackup has a innobackupex compatibility mode. [1]

So after running that, I got:

190225 23:30:41 completed OK!

Now I guess the thing to do is make db04 the new master and db05 a slave to db04?

https://mariadb.com/kb/en/library/mariabackup-options/#-innobackupex

@Krenair: hope you don't mind me claiming this one?

• mmodell claimed this task.Feb 25 2019, 11:42 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptFeb 25 2019, 11:42 PM

Krenair awarded a token.Feb 25 2019, 11:42 PM

DannyS712 subscribed.Feb 25 2019, 11:54 PM

Change master on db05 to point to db04

CHANGE MASTER to MASTER_USER='repl', MASTER_PASSWORD='...', MASTER_PORT=3306, MASTER_HOST='deployment-db04', MASTER_LOG_FILE='deployment-db04-bin.000136', MASTER_LOG_POS=368;
Query OK, 0 rows affected (0.02 sec)

Change 492938 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/mediawiki-config@master] deployment-prep: swap db master / slave

https://gerrit.wikimedia.org/r/492938

gerritbot added a project: Patch-For-Review.Feb 26 2019, 12:48 AM

Change 492938 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: swap db master / slave

https://gerrit.wikimedia.org/r/492938

it appears that we're out of read-only and replicating properly

Ryasmeen awarded a token.Feb 26 2019, 2:04 AM

beta logstash looks good. we're back folks.

Ping @RazShuty, @Ottomata just to alert you that this is resolved.

Also, thanks to @Krenair, @hashar, @Marostegui, @jcrespo, @ArielGlenn for working on this before I came along.

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptFeb 26 2019, 2:06 AM

Yay! Thanks so much @mmodell and everyone else involved in fixing this.

Jdlrobson awarded a token.Feb 26 2019, 2:09 AM

• mmodell awarded a token.Feb 26 2019, 4:56 AM

Omg you're all amazing! Thanks a lot!!!! Much much much appreciated!

debt awarded a token.Feb 26 2019, 1:48 PM

Thank youuuu!

Thanks, @mmodell !

RhinosF1 awarded a token.Feb 26 2019, 3:11 PM

kostajh awarded a token.Feb 26 2019, 5:04 PM

I have a request: can the procedure for setting up the replica/new master/whatever finally happened here, from start to finish, get put on beta's wikitech pages someplace, with any little tricks or tweaks needed?

@ArielGlenn it's mostly covered on https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data but not very clearly... I'll make an attempt to organize it better on a new wikitech page. It definitely needs to be easier to understand.