Page MenuHomePhabricator

Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission)
Closed, ResolvedPublic

Description

CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '275496282' for key 'PRIMARY'' on query. Default database: 'frwiki'. [Query snipped]

Event Timeline

jcrespo created this task.Apr 25 2017, 11:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2017, 11:16 AM
jcrespo moved this task from Triage to Backlog on the DBA board.Apr 28 2017, 10:01 AM

I would place one of the new servers and decommission this one as soon as we can.

Marostegui renamed this task from db1022 broke while changing topology on s6- evaluate if to fix or directly decomission to Decommission db1022 (Was: db1022 broke while changing topology on s6- evaluate if to fix or directly decommission).May 4 2017, 11:18 AM

Change 351813 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Get ready to decomission db1022

https://gerrit.wikimedia.org/r/351813

Change 351814 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Decommission db1022

https://gerrit.wikimedia.org/r/351814

Change 351815 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s6.hosts: Decommission db1022

https://gerrit.wikimedia.org/r/351815

Change 351813 merged by Marostegui:
[operations/puppet@production] mariadb: Get ready to decomission db1022

https://gerrit.wikimedia.org/r/351813

Change 351814 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Decommission db1022

https://gerrit.wikimedia.org/r/351814

Mentioned in SAL (#wikimedia-operations) [2017-05-04T11:36:56Z] <marostegui@naos> Synchronized wmf-config/db-codfw.php: Remove db1022 from config files as it will be decommissioned - T163778 (duration: 01m 25s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T11:38:36Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Remove db1022 from config files as it will be decommissioned - T163778 (duration: 01m 06s)

Change 351815 merged by jenkins-bot:
[operations/software@master] s6.hosts: Decommission db1022

https://gerrit.wikimedia.org/r/351815

Marostegui moved this task from Backlog to Done on the DBA board.
Marostegui edited projects, added ops-eqiad; removed Patch-For-Review.
Marostegui added a subscriber: Cmjohnson.

The host is ready to be decommissioned.
What I have done:

Removed it from prometheus, added it as a spare on site.pp and removed it from dhcp list: https://gerrit.wikimedia.org/r/#/c/351813
Removed it from php config files: https://gerrit.wikimedia.org/r/#/c/351814/
Removed it from the software repo: https://gerrit.wikimedia.org/r/#/c/351815/
MySQL is stopped.

@Cmjohnson the host is all yours now.
Thanks!

Restricted Application added a project: Operations. · View Herald TranscriptMay 4 2017, 11:46 AM
Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.May 8 2017, 4:47 PM
Cmjohnson moved this task from Not urgent to Up next on the ops-eqiad board.May 10 2017, 4:27 PM
Cmjohnson moved this task from Up next to Not urgent on the ops-eqiad board.May 30 2017, 4:33 PM
Cmjohnson moved this task from Not urgent to Decommission on the ops-eqiad board.Jul 20 2017, 3:24 PM

This is probably not fully decomissioned yet (dns, puppet, etc.), but I am going to try to remove it from puppet and icinga so it doesn't create garbage on alerts.

Actually, I cannot do all the steps (network changes) without coordinating with DC ops. I do not have any blocker on this getting fully decommed soon, but I would like to remove it from icinga. Let me know if removing it from puppet and icinga only (db1022 and db1023) sounds good enough for you, and dns and network ports can be done later?

Cmjohnson closed this task as Resolved.Oct 16 2017, 5:07 PM

db1022 is no longer. removed from site.pp, all dns is removed, killed in puppet, changed racktables, wiped disks, removed from rack

JFTR: The host was still showing up in puppetdb (e.g. via https://servermon.wikimedia.org/hosts/). I ran "puppet node deactivate db1022.eqiad.wmnet" on puppetmaster1001, that should properly remove it.