Page MenuHomePhabricator

Decom db2001-db2009
Closed, ResolvedPublic

Description

db2001-db2009 are unused; decide what to do with them (probably decommission them).

Current state:

No longer in use; used to contain private information
depooled from mediawiki
not present on puppet site.pp
puppet keys revoked
salt keys revoked
pending to remove dns/install entries

I noticed this ticket when checking for db servers without base::firewall enabled: Summarising:

  • db2008/db2009 were removed from mediawiki in https://gerrit.wikimedia.org/r/#/c/288945/, the change to remove them from site.pp is pending in https://gerrit.wikimedia.org/r/#/c/286172/
  • db2007 is currently used for tests by Daniel
  • db2006 is not present in site.pp, puppet or salt, but the box is currently still powered on. As such, it can probably simply be unracked and decomissioned.
  • db2001-db2005 are not present in site.pp, but are managed via puppet/salt. It's also listed in wmf-config

decom checklist:

db2001:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2002:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2003:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2004:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2005:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2006:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2007:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2008:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

db2009:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked..

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@RobH I need a second confirmation that these servers are "assigned to me" administratively, and not to fundrising or someone else (as sometimes they reuse the same name). I made a first check and they seem to be idle and not in use on my monitoring. In case that is true, I will start decomm'ing them and reusing their parts (please confirm that is also ok, that they are ours and not leases/donations.

@RobH @mark I think there is a mistake on the 5-year planing. I made a comment on the spreadsheet. Luckily, most of these do not need replacement.

RobH reassigned this task from RobH to jcrespo.EditedMar 9 2016, 10:09 PM

@RobH @mark I think there is a mistake on the 5-year planing. I made a comment on the spreadsheet. Luckily, most of these do not need replacement.

Do you mean they don't need replacement because we already have enough capacity and don't need these old systems?

Comment on Sheet from @jcrespo:

Are you sure this data is right? It makes no sense we bought db2001 2 years after db2012, and papul told me they out of warranty.

The sheet had an incorrect date, and instead listed the hw warranty in the purchase date. It's been fixed. However, racktables has the correct info on these, linking to old purchase ticket https://rt.wikimedia.org/Ticket/Display.html?id=1600 (which was an order for 11 systems.)

db2001-2011 are old Poweredge R510s, originally purchased for Tampa. At the time of the migration, they were still under warranty, and thus were shipped to codfw.

As such, these are well past their warranty expiration.

I'm not sure who else would ever have db systems assigned to them, as far as I know these are for your use in the db cluster. If you don't have a use for them, I'd think we would decommission them at their age, but we would need to check with @mark.

Assigning back to Jaime for his input (I imagine he'll advise we decommission them.) If so, should we expand it to all the db R510s or just db2001-2008?

Do you mean they don't need replacement because we already have enough capacity and don't need these old systems?

Yes.

I would like to decommission the first 8 for parts (mainly disks)- their use has been shifted to newer machines.

I have in use:
db2009: x1 remote server (there are more replicas on the same datacenter)
db2010: m1 remote server
db2011: m2 remote server

for misc/non core servers. Being these 3 backup servers, and having parts available for them, I am not in a hurry to replace them (they do not really need performance, and we can use dbstore for them if it was needed.

It doesn't have to be all of them, just enough to avoid purchasing new disks. Having out-of warranty spare servers can be helpful for not critical missions. These 3 services will have to be replaced eventually (without hurry).

I'm editing this task because I'm taking db2008.codfw.wmnet back in usage for T130098 so it must not be decommissioned.

Volans renamed this task from Investigate/decom db2001-db2008 to Investigate/decom db2001-db2007.Mar 16 2016, 10:27 AM
Volans updated the task description. (Show Details)

Maybe edit site.pp so that the actually unused ones are removed from puppet but the one still used is still in it. Then there is less ambiguity and we can move forward with the decom by revoking puppet certs and salt keys and shutting them down to save energy.

Dzahn added a subscriber: Papaul.

Change 278338 had a related patch set uploaded (by Dzahn):
remove db200[1-7] from DHCP

https://gerrit.wikimedia.org/r/278338

"the actually unused ones are removed from puppet"

It was like that, until Moritz readded a bunch to get security updates.

Please wait until I see the final destination of all of these.

Moritz said he is adding the updates because the servers are up. This might be a catch 22.

So what's the actual blocker? Is there really one since Moritz says he only added so they get updates?

I need to see the destination of the disks to have at least working complete servers before the failover.

Some es2 hosts and these have to be checked to try to solve codfw disk issues (I cannot remember the ticket numbers).

Got it, thanks for explaining.

Change 278338 abandoned by Dzahn:
remove db200[1-7] from DHCP

https://gerrit.wikimedia.org/r/278338

Andrew triaged this task as Medium priority.Apr 14 2016, 7:58 PM
Andrew removed a project: Patch-For-Review.
jcrespo renamed this task from Investigate/decom db2001-db2007 to Investigate/decom db2001-db2009.Apr 29 2016, 3:41 PM
jcrespo updated the task description. (Show Details)

db2008 and db2009 are in theory still in use, but ready to be decommed as they have been substituted by the larger db2033.

Change 286172 had a related patch set uploaded (by Jcrespo):
Retire db2008 and db2009 as x1 nodes

https://gerrit.wikimedia.org/r/286172

Change 288945 had a related patch set uploaded (by Jcrespo):
Remove all mentions to db1027, db2008 and db2009 from mediawiki

https://gerrit.wikimedia.org/r/288945

Change 288945 merged by Jcrespo:
Remove all mentions to db1027, db2008 and db2009 from mediawiki

https://gerrit.wikimedia.org/r/288945

After we talked on IRC i am using db2007 to test upgrading RT (T119112) which involves a schema change. It's in Icinga as a host, but mariadb/mysql was already removed and this is not part of the cluster in any way. The mariadb-server that is currently on it is installed by me and i will kill it again as well.

Change 289725 had a related patch set uploaded (by Dzahn):
temp. setup to use db2007 for RT upgrade test

https://gerrit.wikimedia.org/r/289725

Change 289725 merged by Dzahn:
temp. setup to use db2007 for RT upgrade test

https://gerrit.wikimedia.org/r/289725

I noticed this ticket when checking for db servers without base::firewall enabled: Summarising:

  • db2008/db2009 were removed from mediawiki in https://gerrit.wikimedia.org/r/#/c/288945/, the change to remove them from site.pp is pending in https://gerrit.wikimedia.org/r/#/c/286172/
  • db2007 is currently used for tests by Daniel
  • db2006 is not present in site.pp, puppet or salt, but the box is currently still powered on. As such, it can probably simply be unracked and decomissioned.
  • db2001-db2005 are not present in site.pp, but are managed via puppet/salt. It's also listed in wmf-config

I would remove them all when Daniel finishes his work.

Change 292397 had a related patch set uploaded (by Dzahn):
remove db2007 from site.pp, done with testing

https://gerrit.wikimedia.org/r/292397

Change 292397 merged by Dzahn:
remove db2007 from site.pp, done with testing

https://gerrit.wikimedia.org/r/292397

11:23 < mutante> !log db2007 shutdown, schedule eternal downtime
11:24 < mutante> !log db2007, revoke puppet cert, delete salt key, nuke from stored configs / icinga

Change 286172 merged by Jcrespo:
Retire db2008 and db2009 as x1 nodes

https://gerrit.wikimedia.org/r/286172

jcrespo renamed this task from Investigate/decom db2001-db2009 to Decom db2001-db2009.Aug 11 2016, 3:17 PM
jcrespo edited projects, added ops-codfw; removed Patch-For-Review.
jcrespo updated the task description. (Show Details)

@Papaul ,@RobH These servers are ready to go; icinga/puppet/salt-wiped. DNS and tftboot are still active.

I set on you the decision of its final destination.

I'm re-assigning this to @mark for his approval to decommission db2001-db2009. All 9 of these systems had their warranties expire on 2014-11-10. These are old Dell PowerEdge R510 systems, shipped from their initial use in our Tampa deployment.

All 9 of these systems are located in rack a6-codfw. This will free up 18U of space.

Please advise if we can decommission these entirely, or if we need to reclaim to spare, and assign back to me. I'll triage/next steps from there.

Thanks!

Switch ports disabled, diff below since the port info will be needed once these systems are unracked.

[edit interfaces ge-6/0/0]

  • enable;

+ disable;
[edit interfaces ge-6/0/1]

  • enable;

+ disable;
[edit interfaces ge-6/0/2]

  • enable;

+ disable;
[edit interfaces ge-6/0/3]

  • enable;

+ disable;
[edit interfaces ge-6/0/4]

  • enable;

+ disable;
[edit interfaces ge-6/0/5]

  • enable;

+ disable;
[edit interfaces ge-6/0/6]

  • enable;

+ disable;
[edit interfaces ge-6/0/7]

  • enable;

+ disable;
[edit interfaces ge-6/0/8]

  • enable;

+ disable;

Change 341582 had a related patch set uploaded (by robh):
[operations/puppet] decom of db2001-db2009

https://gerrit.wikimedia.org/r/341582

Change 341582 merged by RobH:
[operations/puppet] decom of db2001-db2009

https://gerrit.wikimedia.org/r/341582

Change 341585 had a related patch set uploaded (by robh):
[operations/dns] decom of db2001-db2009

https://gerrit.wikimedia.org/r/341585

Change 341585 merged by RobH:
[operations/dns] decom of db2001-db2009

https://gerrit.wikimedia.org/r/341585

RobH updated the task description. (Show Details)
RobH removed projects: Patch-For-Review, DBA.

Ok, this is now ready for on-site disk wipes of all the systems. Assigning to @Papaul for followup.

Change 342841 had a related patch set uploaded (by Papaul):
[operations/dns] DNS/Decom: Remove DNS entries for db200[1-9]

https://gerrit.wikimedia.org/r/342841

switch port information
All servers are in row A rack A6
db2001 ge-6/0/0
db2002 ge-6/0/1
db2003 ge-6/0/2
db2004 ge-6/0/3
db2005 ge-6/0/4
db2006 ge-6/0/5
db2007 ge-6/0/6
db2008 ge-6/0/7
db2009 ge-6/0/8

robh@asw-a-codfw# show | compare
[edit interfaces ge-6/0/0]

  • description db2001;

[edit interfaces ge-6/0/1]

  • description db2002;

[edit interfaces ge-6/0/2]

  • description db2003;

[edit interfaces ge-6/0/3]

  • description db2004;

[edit interfaces ge-6/0/4]

  • description db2005;

[edit interfaces ge-6/0/5]

  • description db2006;

[edit interfaces ge-6/0/6]

  • description db2007;

[edit interfaces ge-6/0/7]

  • description db2008;

[edit interfaces ge-6/0/8]

  • description db2009;

{master:2}[edit]
robh@asw-a-codfw# commit comment T125827

switch port description removal done.

Change 342841 merged by RobH:
[operations/dns] DNS/Decom: Remove DNS entries for db200[1-9]

https://gerrit.wikimedia.org/r/342841

RobH updated the task description. (Show Details)