Page MenuHomePhabricator

Prepare to decommission 2 codfw x1 hosts db2033 and db2034
Closed, ResolvedPublic

Description

The following two hosts are in x1 and are partially broken, BBUs broken on db2033. T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) and db2034 has had a long history of HW issues T150233: db2034 crashes meta ticket T149553: db2034: investigate its crash and reimage
They should be decommissioned.

db2033 is a slave ready for DCOPs to decommission T220070: Decommission db2033
db2034 is ready for DCOPs to decommission T223216: Decommission db2034

Event Timeline

Marostegui moved this task from Triage to Backlog on the DBA board.

I am not adding the DCOps tasks yet to avoid unnecessary noise for them as this is not ready to go yet.

Marostegui renamed this task from Decommission codfw x1 host to Decommission 2 codfw x1 hosts db2033 and db2034.Mar 28 2019, 10:26 AM

I have finished running a compare.py for wikishared between db2033 and db2069 and they are all the same.

Also ran a compare.py for the following tables across all the databases:

echo_event
echo_email_batch
echo_notification 
echo_target_page

I have found some differences on echo_notification table for some wikis, which I am going to check further and fix (the check is still running, I guess it will be running for a few more hours)

bnwikisource FIXED
diqwiki FIXED
elwiktionary FIXED
eswikisource FIXED
fawikiquote FIXED
frwikinews FIXED
idwiktionary FIXED
ilowiki FIXED
itwikiversity FIXED
jvwiki FIXED
kowikisource
kuwiki FIXED
kvwiki FIXED
pagwiki FIXED
pswiki FIXED
ruwikiquote FIXED
scnwiki FIXED
scwiki FIXED
slwikisource FIXED
suwiki FIXED
tawikisource FIXED
thwiktionary FIXED
trwiktionary FIXED
ukwikiquote FIXED
wuuwiki FIXED
yiwiki FIXED
zhwikisource FIXED

Mentioned in SAL (#wikimedia-operations) [2019-04-03T05:57:29Z] <marostegui> Fix data drifts on bnwikisource on x1 - T219493

Change 500895 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1120

https://gerrit.wikimedia.org/r/500895

Change 500895 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1120

https://gerrit.wikimedia.org/r/500895

Mentioned in SAL (#wikimedia-operations) [2019-04-03T07:07:33Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1120 T219493 (duration: 01m 13s)

Mentioned in SAL (#wikimedia-operations) [2019-04-03T07:09:22Z] <marostegui> Stop replication in sync on db1120 and db2034 (x1 codfw master) - T219493

Mentioned in SAL (#wikimedia-operations) [2019-04-03T07:26:08Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool db1120 T219493 (duration: 00m 57s)

Change 501134 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2033: Disable notifications

https://gerrit.wikimedia.org/r/501134

Change 501134 merged by Marostegui:
[operations/puppet@production] db2033: Disable notifications

https://gerrit.wikimedia.org/r/501134

Change 501135 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2033

https://gerrit.wikimedia.org/r/501135

Change 501135 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2033

https://gerrit.wikimedia.org/r/501135

Mentioned in SAL (#wikimedia-operations) [2019-04-04T05:18:26Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Remove db2033 for decommission T219493 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-04-04T05:19:32Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db2033 for decommission T219493 (duration: 00m 59s)

Change 501136 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: db2033 set to spare

https://gerrit.wikimedia.org/r/501136

Mentioned in SAL (#wikimedia-operations) [2019-04-04T05:32:05Z] <marostegui> Remove db2033 from tendril and zarcillo - T219493

Change 501136 merged by Marostegui:
[operations/puppet@production] mariadb: db2033 set to spare

https://gerrit.wikimedia.org/r/501136

Mentioned in SAL (#wikimedia-operations) [2019-04-04T05:39:38Z] <marostegui> Stop MySQL on db2033 for decommission - T219493

I will do a x1 codfw dc failover at some point, so we can depool db2034 too (as db2033 is now on DCOps hands for decommissioning). So we'd freed up 4u for the new hosts. It doesn't make any sense to keep these two old broken hosts online anymore

Change 506350 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Reorganize s8

https://gerrit.wikimedia.org/r/506350

Change 506350 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Reorganize s8

https://gerrit.wikimedia.org/r/506350

db2045 has been compared with db2080 and it is the same. It can now be moved to x1 to replace db2034.

Change 506937 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db2045 from s8 to x1

https://gerrit.wikimedia.org/r/506937

Change 506937 merged by Marostegui:
[operations/puppet@production] mariadb: Move db2045 from s8 to x1

https://gerrit.wikimedia.org/r/506937

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2045.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904290543_marostegui_29166.log.

Completed auto-reimage of hosts:

['db2045.codfw.wmnet']

Of which those FAILED:

['db2045.codfw.wmnet']

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2045.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904290715_marostegui_49061.log.

Completed auto-reimage of hosts:

['db2045.codfw.wmnet']

and were ALL successful.

Change 506944 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Move db2045 to x1

https://gerrit.wikimedia.org/r/506944

Mentioned in SAL (#wikimedia-operations) [2019-04-29T07:44:44Z] <marostegui> Stop replication on db2034 (x1 master) for maintenance - T219493

Mentioned in SAL (#wikimedia-operations) [2019-04-29T07:47:21Z] <marostegui> Stop mysql on db2034 (lag will happen on x1 codfw) - T219493

Change 506944 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Move db2045 to x1

https://gerrit.wikimedia.org/r/506944

Mentioned in SAL (#wikimedia-operations) [2019-04-29T07:58:40Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Move db2045 from s8 to x1 T219493 (duration: 00m 55s)

Change 507243 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2045: Enable notifications

https://gerrit.wikimedia.org/r/507243

Change 507243 merged by Marostegui:
[operations/puppet@production] db2045: Enable notifications

https://gerrit.wikimedia.org/r/507243

Change 508168 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db2045 to codfw x1 master

https://gerrit.wikimedia.org/r/508168

Change 508321 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Promote db2045 to master

https://gerrit.wikimedia.org/r/508321

Mentioned in SAL (#wikimedia-operations) [2019-05-07T05:12:47Z] <marostegui> Change topology on x1 codfw to promote db2045 to master T219493

Change 508168 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db2045 to codfw x1 master

https://gerrit.wikimedia.org/r/508168

Change 508321 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Promote db2045 to master

https://gerrit.wikimedia.org/r/508321

Marostegui renamed this task from Decommission 2 codfw x1 hosts db2033 and db2034 to Prepare to decommission 2 codfw x1 hosts db2033 and db2034.May 7 2019, 5:23 AM
Marostegui removed a subscriber: RobH.

Mentioned in SAL (#wikimedia-operations) [2019-05-07T05:23:33Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Promote db2045 to codfw x1 master T219493 (duration: 00m 55s)

db2034 is no longer a master, I will give it 24h before starting the decommissioning steps

Change 510074 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Prepare to decommission db2034

https://gerrit.wikimedia.org/r/510074

Change 510083 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2034

https://gerrit.wikimedia.org/r/510083

Change 510083 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db2034

https://gerrit.wikimedia.org/r/510083

Mentioned in SAL (#wikimedia-operations) [2019-05-14T09:51:38Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Remove db2034 from config T219493 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2019-05-14T09:51:59Z] <marostegui> Remove db2034 from tendril and zarcillo - T219493

Change 510074 merged by Marostegui:
[operations/puppet@production] mariadb: Prepare to decommission db2034

https://gerrit.wikimedia.org/r/510074

Mentioned in SAL (#wikimedia-operations) [2019-05-14T09:55:09Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db2034 from config T219493 (duration: 00m 49s)

Marostegui updated the task description. (Show Details)

Both hosts handed over to DCOPs for decommissioning
T220070: Decommission db2033
T223216: Decommission db2034