Page MenuHomePhabricator

Reclone db2054 and db2068
Closed, ResolvedPublic

Description

These two servers from s7 got stuck while running this query: https://phabricator.wikimedia.org/P7535
They had to get their MYSQL process manually killed to get the query thru.

They probably need recloning or their data checked at least.
Once done they can be repooled back to s7

Cloning process:

  • db2054
  • db2068

Event Timeline

Marostegui triaged this task as High priority.Sep 12 2018, 2:37 PM
Marostegui created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2018, 2:37 PM
Marostegui moved this task from Triage to In progress on the DBA board.Sep 12 2018, 3:41 PM

We could probably reclone one of these hosts (for example db2054) from an eqiad slave, and then move it under codfw master. That way we don't have to depool an active codfw s7 slave, as that might be too much, 3 hosts out.
Once db2054 is up to date and under codfw master, we can reclone db2068 from it.

I can work on this if somebody shows me how to clone hosts (normally I'd use xtrabackup -> tar -> netcat -> netcat -> tar but I think that is a no-no with mariadb)

We could probably reclone one of these hosts (for example db2054) from an eqiad slave, and then move it under codfw master. That way we don't have to depool an active codfw s7 slave, as that might be too much, 3 hosts out.
Once db2054 is up to date and under codfw master, we can reclone db2068 from it.

If only there was some kind of "provisioning server"- you know, a place were to take backups and recover them to production! ;-D Spoiler, there is one, we call it dbstore200X! No need for a slow process of cloning cross-dc. This is exactly the reason why we setup dbstore. We use it right now for logical backups, but we can use it for a cold copy too :-)

I can work on this if somebody shows me how to clone hosts (normally I'd use xtrabackup -> tar -> netcat -> netcat -> tar but I think that is a no-no with mariadb)

This is mostly the process, but it is already partially automated (transfer.py) and we cannot use xtrabackup since we migrated away from mysql because mariadb specifics (it segfaults), we have pending to test mariabackup (xtrabackup linked with mariadb libs), but for now we can do it in a cold way- dbstores are not part of the production pool, so we will stop them and just use tranfer.py to provision the faulty ones.

Marostegui added a comment.EditedSep 13 2018, 5:10 AM

We could probably reclone one of these hosts (for example db2054) from an eqiad slave, and then move it under codfw master. That way we don't have to depool an active codfw s7 slave, as that might be too much, 3 hosts out.
Once db2054 is up to date and under codfw master, we can reclone db2068 from it.

If only there was some kind of "provisioning server"- you know, a place were to take backups and recover them to production! ;-D Spoiler, there is one, we call it dbstore200X! No need for a slow process of cloning cross-dc. This is exactly the reason why we setup dbstore. We use it right now for logical backups, but we can use it for a cold copy too :-)

I know about dbstore present and future. Reminder: we wrote the 30 pages backup document together.
The only reason I specifically left dbstore aside from this equation is because it might have been cloned out from one of those two hosts. I suppose it didn't as it would have been crashed too maybe.

Mentioned in SAL (#wikimedia-operations) [2018-09-13T05:11:19Z] <marostegui> Stop MySQL on db2054 and dbstore2001:3317 to clone db2054 - T204127

Marostegui updated the task description. (Show Details)Sep 13 2018, 5:31 AM

Change 460304 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Slowly repool db2054

https://gerrit.wikimedia.org/r/460304

Marostegui updated the task description. (Show Details)Sep 13 2018, 10:05 AM

db2054 has been recloned, it is catching up. Once it has sync'ed with its master, I will remove its downtime and repool it into s7

Change 460304 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Slowly repool db2054

https://gerrit.wikimedia.org/r/460304

Mentioned in SAL (#wikimedia-operations) [2018-09-13T10:33:24Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Slowly repool db2054 - T204127 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2018-09-13T13:24:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 49s)

Change 460369 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Increase weight for db2054

https://gerrit.wikimedia.org/r/460369

Change 460369 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Increase weight for db2054

https://gerrit.wikimedia.org/r/460369

Mentioned in SAL (#wikimedia-operations) [2018-09-13T14:47:06Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2018-09-13T15:25:48Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 49s)

Change 460390 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db2068 with load load after recloning it

https://gerrit.wikimedia.org/r/460390

jcrespo claimed this task.Sep 13 2018, 4:27 PM
jcrespo updated the task description. (Show Details)

db2068 has been recloned, but needs time to catch up replication and then be slowly repooled with the above patch.

Change 460390 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db2068 with load load after recloning it

https://gerrit.wikimedia.org/r/460390

db2068 has been recloned, but needs time to catch up replication and then be slowly repooled with the above patch.

I have merged and deployed this patch.

Mentioned in SAL (#wikimedia-operations) [2018-09-14T05:06:36Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Increase weight for db2068 - T204127 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2018-09-14T05:10:18Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Increase weight for db2054 - T204127 (duration: 00m 50s)

jcrespo reassigned this task from jcrespo to Marostegui.Sep 14 2018, 6:50 AM

Mentioned in SAL (#wikimedia-operations) [2018-09-14T07:11:09Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: T204127: Weight Adjust db2068 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2018-09-14T08:15:31Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: T204127: Weight Adjust db2068 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2018-09-14T12:51:43Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: T204127: Weight Adjust db2068 (duration: 00m 50s)

Banyek closed this task as Resolved.Sep 14 2018, 12:53 PM