Page MenuHomePhabricator

decom promethium/WMF3571
Closed, ResolvedPublic

Description

So this system was started (and then not finished) with decom on T164395.

This task will track the decommission of server promethium

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

promethium:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host (someone already did this before @RobH started working steps)
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Details

Related Gerrit Patches:
operations/puppet : productionDHCP: Remove MAC address entry for promethium
operations/puppet : productionDHCP: remove promethium
operations/dns : masterDNS: Remove promethium from DNS

Event Timeline

RobH triaged this task as Medium priority.Apr 3 2018, 10:00 PM
RobH created this task.
Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Apr 10 2018, 2:00 PM

This host looks weird because it's on the wmcs vlan and uses the wmcs puppetmaster. I'm currently trying to confirm that it's no longer used so we can decom it (and I can rip out some special-purpose code that still supports it.)

@ssastry can you confirm that you're all done with this host?

RobH reassigned this task from RobH to ssastry.EditedAug 1 2018, 7:43 PM

@ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign back to me for followup, thanks!)

ssastry added a subscriber: Arlolra.EditedAug 2 2018, 3:19 PM

@ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign back to me for followup, thanks!)

We use it, but on an irregular basis - about once a week (see http://mw-expt-tests.wmflabs.org/commits). We need this for one of your ongoing projects (https://phabricator.wikimedia.org/T118517 that @Arlolra is actively working on). So, if you want to decommission this host, I should provision a true labs VM with similar hardware resources. @Andrew, I've forgotten the process for requesting these largish VMs, can you point me to the wiki page? Or, I can catch you in person tomorrow at a coffee shop? :)

@ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign back to me for followup, thanks!)

We use it, but on an irregular basis - about once a week (see http://mw-expt-tests.wmflabs.org/commits). We need this for one of your ongoing projects (https://phabricator.wikimedia.org/T118517 that @Arlolra is actively working on). So, if you want to decommission this host, I should provision a true labs VM with similar hardware resources. @Andrew, I've forgotten the process for requesting these largish VMs, can you point me to the wiki page? Or, I can catch you in person tomorrow at a coffee shop? :)

We used this actively for the Tidy replacement project (that we completed in July). Now that that is done, we are going to switch testing databases and use it for that RFC 118517 that I linked there -- and for that, we are going to use it semi-regularly, a few weeks at a time every 2-3 months .. for different wikitext / parser change projects. And, this testing is pretty crucial to those deployments.

RobH added a comment.Aug 2 2018, 3:25 PM

So promethium/WMF3571 was purchased in January of 2013. It is very old, and very out of warranty.

If this host is going to continue to be used for work, we should look at replacing it with newer hardware. It sounds like this bit of hardware is critical, so having a 5+ year old server is non-ideal.

Is it projected to need this system for another year?

Is it projected to need this system for another year?

Yes, at least. I am happy to explore getting a true labs VM for this. Will chat with Andrew about that.

Andrew added a comment.EditedOct 15 2018, 4:52 PM

Note that everything is weird about this host. It's on a cloud VM network, and isn't monitored by icinga, and is managed by the cloud puppetmaster. So probably most of the steps above are moot, you can just switch it off and pull the plug. It has prod DNS entries, though, so those will need cleanup.

Andrew removed ssastry as the assignee of this task.Oct 15 2018, 4:53 PM

(also, the labs_metal hiera var in hieradata/common.yaml could be emptied and the metaldns/metal_resolver stuff jettisoned from puppet)

faidon assigned this task to Andrew.EditedOct 24 2018, 10:23 PM
faidon added a subscriber: Cmjohnson.

@Andrew, promethium's hostname, IP and MAC address are still referenced in a number of places in the puppet tree, including e.g. hardcoded in Python code (proxyleaks.py) that I'd rather not have DC Ops touch :)

Could you or someone else from WMCS remove those, and essentially go through and check the first 5 bullets in the checklist above?

Then please assign to @RobH or @Cmjohnson for finishing up the rest of the decom :)

Andrew reassigned this task from Andrew to RobH.Oct 25 2018, 4:04 PM
Andrew updated the task description. (Show Details)

I made some patches getting rid of promethium stuff, then realised part of it would actually likely be covered by Rob in this ticket, so have abandoned https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470102/ and https://gerrit.wikimedia.org/r/#/c/operations/dns/+/470103/

RobH updated the task description. (Show Details)Feb 14 2019, 6:06 PM
RobH reassigned this task from RobH to ayounsi.Feb 14 2019, 6:14 PM

So, trying to disable the switch port:

robh@asw2-b-eqiad# show | compare
[edit interfaces interface-range disabled]

member ge-4/0/38 { ... }

+ member ge-3/0/19;
[edit interfaces interface-range vlan-cloud-instances1-b-eqiad]

  • member ge-3/0/19;

{master:2}[edit]
robh@asw2-b-eqiad# show interfaces ge-3/0/19 | display inheritance
error: interface-range 'vlan-cloud-instances1-b-eqiad' has no member/member-range statements
error: interface-ranges expansion failed

{master:2}[edit]

Arzhel, can you clear up what we should do here to disable this switch port?

RobH updated the task description. (Show Details)Feb 14 2019, 6:15 PM
ayounsi reassigned this task from ayounsi to Andrew.Feb 19 2019, 12:40 AM
ayounsi added a subscriber: ayounsi.

This is because ge-3/0/19 is the last interface in the interface-range vlan-cloud-instances1-b-eqiad.

@Andrew is vlan cloud-instances1-b-eqiad/1102 still in use? IIRC it's the one carrying instance to instance traffic on the old nova setup.

If so we can only delete interface-range vlan-cloud-instances1-b-eqiad

If not, we can delete all references to cloud-instances1-b-eqiad.

@ayounsi I assume you're talking about this?

labs-instances1-b-eqiad:
  ipv4: 10.68.16.0/21
  ipv6: 2620:0:861:202::/64

If so, that range is definitely still in use -- it'll be a couple of months before we're able to shut down that region entirely.

Andrew reassigned this task from Andrew to ayounsi.Feb 26 2019, 6:30 PM
ayounsi closed this task as Resolved.Feb 26 2019, 7:49 PM

Ok, thanks, deleting the interface range from that switch only:

[edit interfaces]
-   interface-range vlan-cloud-instances1-b-eqiad {
-       member ge-3/0/19;
-       mtu 9192;
-       unit 0 {
-           family ethernet-switching {
-               interface-mode access;
-               vlan {
-                   members cloud-instances1-b-eqiad;
-               }
-           }
-       }
-   }
[edit interfaces interface-range disabled]
     member ge-4/0/38 { ... }
+    member ge-3/0/19;
[edit interfaces]
-   ge-3/0/19 {
-       description promethium;
-   }
ayounsi reopened this task as Open.Feb 26 2019, 7:50 PM
ayounsi reassigned this task from ayounsi to RobH.
RobH reassigned this task from RobH to Cmjohnson.Apr 18 2019, 9:15 PM
RobH updated the task description. (Show Details)
Jclark-ctr updated the task description. (Show Details)Dec 12 2019, 10:40 PM

Change 561923 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove promethium from DNS

https://gerrit.wikimedia.org/r/561923

Change 561923 merged by Papaul:
[operations/dns@master] DNS: Remove promethium from DNS

https://gerrit.wikimedia.org/r/561923

Papaul closed this task as Resolved.Jan 3 2020, 11:29 PM
Papaul updated the task description. (Show Details)
Papaul added a subscriber: Papaul.

Complete

Dzahn added a subscriber: Dzahn.Jan 3 2020, 11:30 PM

It is still in DHCP config.

Change 561924 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove promethium

https://gerrit.wikimedia.org/r/561924

Change 561924 merged by Dzahn:
[operations/puppet@production] DHCP: remove promethium

https://gerrit.wikimedia.org/r/561924

Change 561926 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Remove MAC address entry for promethium

https://gerrit.wikimedia.org/r/561926

Change 561926 abandoned by Papaul:
DHCP: Remove MAC address entry for promethium

https://gerrit.wikimedia.org/r/561926