Page MenuHomePhabricator

Replace db1077 with db1112
Closed, ResolvedPublic

Description

db1077 is definitely having BBU issues (T225391 - T225391#5261662) and db1077 servers production traffic on s3 (and it is also sanitarium master)

db1112 is now serving on the test cluster as a replica.
This set of hosts have minimal traffic (https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1112&var-port=9104&from=now-6M&to=now), so there should be no performance issues with db1077 there without a BBU as a test-cluster slave.

The idea is to exchange db1077 with db1112 so we can have db1112 serving in production and acting as sanitarium master as it has a healthy BBU

Event Timeline

test-cluster users have been notified that on Thursday the replica will go offline to be changed by db1077.

Marostegui triaged this task as Medium priority.Jun 18 2019, 5:41 AM

Change 517589 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow reimage db1112

https://gerrit.wikimedia.org/r/517589

Mentioned in SAL (#wikimedia-operations) [2019-06-18T05:54:05Z] <marostegui> Stop slave and mysql on db1112 to copy its content to dbstore1001:/srv/tmp/db1112 - T225981

Mentioned in SAL (#wikimedia-operations) [2019-06-18T06:19:20Z] <marostegui> Stop slave and mysql on db1112 to copy its content to dbprov1001:/srv/backups/tmp/db1112 - T225981

Change 517589 merged by Marostegui:
[operations/puppet@production] install_server: Allow reimage db1112

https://gerrit.wikimedia.org/r/517589

Change 517799 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1112: Move to s3

https://gerrit.wikimedia.org/r/517799

Change 517799 merged by Marostegui:
[operations/puppet@production] db1112: Move to s3

https://gerrit.wikimedia.org/r/517799

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1112.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906190638_marostegui_237082.log.

Change 517801 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/517801

Change 517801 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/517801

Mentioned in SAL (#wikimedia-operations) [2019-06-19T06:53:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1077 T225981 (duration: 01m 06s)

Completed auto-reimage of hosts:

['db1112.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-06-19T06:57:35Z] <marostegui> Stop MySQL on db1077 to transfer its data to db1112 - T225981

Mentioned in SAL (#wikimedia-operations) [2019-06-19T07:12:05Z] <marostegui> s3 will be lagging on labsdb hosts due to maintenance on db1077 - T225981

Mentioned in SAL (#wikimedia-operations) [2019-06-19T09:14:51Z] <marostegui> Start MySQL on db1077 - s3 labsdb lag should start catching up T225981

db1112 is now cloned from db1077. I am going to let it replicate for 24h before changing sanitarium to replicate from it and to pool it in s3.

Change 517813 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1077

https://gerrit.wikimedia.org/r/517813

Change 517813 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1077

https://gerrit.wikimedia.org/r/517813

Mentioned in SAL (#wikimedia-operations) [2019-06-19T09:24:55Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly repool db1077 T225981 (duration: 01m 00s)

Change 517818 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1077

https://gerrit.wikimedia.org/r/517818

Change 517818 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Fully repool db1077

https://gerrit.wikimedia.org/r/517818

Change 517963 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1112: Enable notifications

https://gerrit.wikimedia.org/r/517963

Change 517963 merged by Marostegui:
[operations/puppet@production] db1112: Enable notifications

https://gerrit.wikimedia.org/r/517963

Change 517964 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/517964

Change 517964 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/517964

Mentioned in SAL (#wikimedia-operations) [2019-06-20T04:52:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1077 T225981 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-06-20T04:53:03Z] <marostegui> Stop replication in sync on db1112 and db1077 to move db1124 under db1112 - T225981

Mentioned in SAL (#wikimedia-operations) [2019-06-20T05:04:34Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1077 T225981 (duration: 00m 55s)

Change 517966 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Pool db1112 into s3

https://gerrit.wikimedia.org/r/517966

Change 517966 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Pool db1112 into s3

https://gerrit.wikimedia.org/r/517966

Mentioned in SAL (#wikimedia-operations) [2019-06-20T05:22:16Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Slowly pool db1112 into s3 T225981 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2019-06-20T05:23:19Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Slowly pool db1112 into s3 T225981 (duration: 00m 55s)

Change 517967 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1077: Allow reimage

https://gerrit.wikimedia.org/r/517967

Change 517967 merged by Marostegui:
[operations/puppet@production] db1077: Allow reimage

https://gerrit.wikimedia.org/r/517967

Mentioned in SAL (#wikimedia-operations) [2019-06-20T05:40:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to db1112 in s3 T225981 (duration: 00m 56s)

Change 517969 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db1077 from s3 to test-s4

https://gerrit.wikimedia.org/r/517969

Mentioned in SAL (#wikimedia-operations) [2019-06-20T05:54:55Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to db1112 in s3 T225981 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-06-20T06:09:38Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to db1112 in s3 T225981 (duration: 00m 57s)

db1112 is now the sanitarium master for s3.

Mentioned in SAL (#wikimedia-operations) [2019-06-20T06:16:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to db1112 in s3 T225981 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2019-06-20T06:31:02Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: More traffic to db1112 in s3 T225981 (duration: 00m 56s)

Change 517974 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db1077

https://gerrit.wikimedia.org/r/517974

Change 517974 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db1077

https://gerrit.wikimedia.org/r/517974

Mentioned in SAL (#wikimedia-operations) [2019-06-20T06:43:47Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool and remove from config db1077 T225981 (duration: 00m 54s)

Change 517969 merged by Marostegui:
[operations/puppet@production] mariadb: Move db1077 from s3 to test-s4

https://gerrit.wikimedia.org/r/517969

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906200652_marostegui_2936.log.

Completed auto-reimage of hosts:

['db1077.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-06-20T07:15:53Z] <marostegui> Transfer dbprov1001:/srv/backups/tmp/db1112/sqldata to db1077 T225981

And after the reboot the battery fully failed T226154:

Battery/Capacitor Count: 0

Mentioned in SAL (#wikimedia-operations) [2019-06-20T09:25:41Z] <marostegui> Remove dbprov1001:/srv/backups/tmp/db1112 - T225981

db1077 is now replicating from db1111 in the test-s4 cluster.
The temporary data has been also removed from dbprov1001

root@dbprov1001:/srv/backups/tmp# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T  5.0T  6.0T  46% /srv

Users of the test-s4 cluster have been notified via email and documentation has been updated: https://wikitech.wikimedia.org/w/index.php?title=MariaDB&type=revision&diff=1829875&oldid=1829454
Resolving this.