Page MenuHomePhabricator

decom promethium/WMF3571
Closed, ResolvedPublic

Description

So this system was started (and then not finished) with decom on T164395.

This task will track the decommission-hardware of server promethium

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

promethium:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host (someone already did this before @RobH started working steps)
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

RobH triaged this task as Medium priority.Apr 3 2018, 10:00 PM
RobH created this task.

This host looks weird because it's on the wmcs vlan and uses the wmcs puppetmaster. I'm currently trying to confirm that it's no longer used so we can decom it (and I can rip out some special-purpose code that still supports it.)

@ssastry can you confirm that you're all done with this host?

RobH reassigned this task from RobH to ssastry.EditedAug 1 2018, 7:43 PM

@ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign back to me for followup, thanks!)

@ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign back to me for followup, thanks!)

We use it, but on an irregular basis - about once a week (see http://mw-expt-tests.wmflabs.org/commits). We need this for one of your ongoing projects (https://phabricator.wikimedia.org/T118517 that @Arlolra is actively working on). So, if you want to decommission this host, I should provision a true labs VM with similar hardware resources. @Andrew, I've forgotten the process for requesting these largish VMs, can you point me to the wiki page? Or, I can catch you in person tomorrow at a coffee shop? :)

@ssastry: I'm assigning this to you for feedback, please confirm this host is no longer used and can be decommissioned. (Then assign back to me for followup, thanks!)

We use it, but on an irregular basis - about once a week (see http://mw-expt-tests.wmflabs.org/commits). We need this for one of your ongoing projects (https://phabricator.wikimedia.org/T118517 that @Arlolra is actively working on). So, if you want to decommission this host, I should provision a true labs VM with similar hardware resources. @Andrew, I've forgotten the process for requesting these largish VMs, can you point me to the wiki page? Or, I can catch you in person tomorrow at a coffee shop? :)

We used this actively for the Tidy replacement project (that we completed in July). Now that that is done, we are going to switch testing databases and use it for that RFC 118517 that I linked there -- and for that, we are going to use it semi-regularly, a few weeks at a time every 2-3 months .. for different wikitext / parser change projects. And, this testing is pretty crucial to those deployments.

So promethium/WMF3571 was purchased in January of 2013. It is very old, and very out of warranty.

If this host is going to continue to be used for work, we should look at replacing it with newer hardware. It sounds like this bit of hardware is critical, so having a 5+ year old server is non-ideal.

Is it projected to need this system for another year?

Is it projected to need this system for another year?

Yes, at least. I am happy to explore getting a true labs VM for this. Will chat with Andrew about that.

Note that everything is weird about this host. It's on a cloud VM network, and isn't monitored by icinga, and is managed by the cloud puppetmaster. So probably most of the steps above are moot, you can just switch it off and pull the plug. It has prod DNS entries, though, so those will need cleanup.

(also, the labs_metal hiera var in hieradata/common.yaml could be emptied and the metaldns/metal_resolver stuff jettisoned from puppet)

faidon added a subscriber: Cmjohnson.

@Andrew, promethium's hostname, IP and MAC address are still referenced in a number of places in the puppet tree, including e.g. hardcoded in Python code (proxyleaks.py) that I'd rather not have DC Ops touch :)

Could you or someone else from WMCS remove those, and essentially go through and check the first 5 bullets in the checklist above?

Then please assign to @RobH or @Cmjohnson for finishing up the rest of the decom :)

Andrew updated the task description. (Show Details)

I made some patches getting rid of promethium stuff, then realised part of it would actually likely be covered by Rob in this ticket, so have abandoned https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470102/ and https://gerrit.wikimedia.org/r/#/c/operations/dns/+/470103/

So, trying to disable the switch port:

robh@asw2-b-eqiad# show | compare
[edit interfaces interface-range disabled]

member ge-4/0/38 { ... }

+ member ge-3/0/19;
[edit interfaces interface-range vlan-cloud-instances1-b-eqiad]

  • member ge-3/0/19;

{master:2}[edit]
robh@asw2-b-eqiad# show interfaces ge-3/0/19 | display inheritance
error: interface-range 'vlan-cloud-instances1-b-eqiad' has no member/member-range statements
error: interface-ranges expansion failed

{master:2}[edit]

Arzhel, can you clear up what we should do here to disable this switch port?

ayounsi subscribed.

This is because ge-3/0/19 is the last interface in the interface-range vlan-cloud-instances1-b-eqiad.

@Andrew is vlan cloud-instances1-b-eqiad/1102 still in use? IIRC it's the one carrying instance to instance traffic on the old nova setup.

If so we can only delete interface-range vlan-cloud-instances1-b-eqiad

If not, we can delete all references to cloud-instances1-b-eqiad.

@ayounsi I assume you're talking about this?

labs-instances1-b-eqiad:
  ipv4: 10.68.16.0/21
  ipv6: 2620:0:861:202::/64

If so, that range is definitely still in use -- it'll be a couple of months before we're able to shut down that region entirely.

Ok, thanks, deleting the interface range from that switch only:

[edit interfaces]
-   interface-range vlan-cloud-instances1-b-eqiad {
-       member ge-3/0/19;
-       mtu 9192;
-       unit 0 {
-           family ethernet-switching {
-               interface-mode access;
-               vlan {
-                   members cloud-instances1-b-eqiad;
-               }
-           }
-       }
-   }
[edit interfaces interface-range disabled]
     member ge-4/0/38 { ... }
+    member ge-3/0/19;
[edit interfaces]
-   ge-3/0/19 {
-       description promethium;
-   }
ayounsi reassigned this task from ayounsi to RobH.
RobH updated the task description. (Show Details)

Change 561923 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove promethium from DNS

https://gerrit.wikimedia.org/r/561923

Change 561923 merged by Papaul:
[operations/dns@master] DNS: Remove promethium from DNS

https://gerrit.wikimedia.org/r/561923

Papaul updated the task description. (Show Details)
Papaul subscribed.

Complete

Change 561924 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove promethium

https://gerrit.wikimedia.org/r/561924

Change 561924 merged by Dzahn:
[operations/puppet@production] DHCP: remove promethium

https://gerrit.wikimedia.org/r/561924

Change 561926 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Remove MAC address entry for promethium

https://gerrit.wikimedia.org/r/561926

Change 561926 abandoned by Papaul:
DHCP: Remove MAC address entry for promethium

https://gerrit.wikimedia.org/r/561926