Page MenuHomePhabricator

Decommission task for old cp hosts (cp1075-1090)
Closed, ResolvedPublic

Description

Now that T349244 is completed, we can move on decommissioning old eqiad cp hosts: cp1075-1090

cp1075

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1076

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1077

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1078

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1079

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1080

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1081

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1082

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1083

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1084

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1085

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1086

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1087

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1088

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1089

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

cp1090

Steps for service owner:

  • all system services confirmed offline from production use
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member and site project (ops-eqiad) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.

Event Timeline

Change 977702 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] decom cp1075-1090

https://gerrit.wikimedia.org/r/977702

Change 977702 merged by Fabfur:

[operations/puppet@production] decom cp1075-1090

https://gerrit.wikimedia.org/r/977702

Mentioned in SAL (#wikimedia-operations) [2023-11-29T10:36:32Z] <fabfur> decommissioning cp1075-1090 (T352253)

cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: cp[1075-1090].eqiad.wmnet

  • cp1075.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1076.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1077.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1078.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1079.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1080.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1081.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1082.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1083.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1084.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1085.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1086.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1087.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1088.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1089.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cp1090.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
VRiley-WMF claimed this task.
VRiley-WMF updated the task description. (Show Details)

Hi dc-ops team: quick question: have these hosts already been hardware decommissioned?

For further context: we have a request from @dr0ptp4kt for running a Blazegraph experiment and we are trying to free up a cp node for him. So we were wondering if this hardware has still not been hardware decomissioned, we can just bring up a host here.

Hi @ssingh - the hardware should still be around, and we should be able to reallocate one of them for testing purposes. Can you shoot open a new Phabricator for us with all the necessary details (hostname, racking info, network setup, raid/partitioning, OS, and main poc)? Also, do you know how long Adam would need it for?

Thanks,
Willy

After setup, I would be interested in using it for 6 weeks if that's okay (hopefully things would only take 4 weeks, but there's some PTO and real life stuff always comes up). Would that be okay?

We're presently running Debian 11 with backported Java 8 from our APT on the wdqsNNNN hosts, so for simplicity that should be the target OS.

What we're attempting to do is validate the performance effects of running with a data center quality high speed physically attached NVMe disk for the case of needing to repopulate Blazegraph. We have some promising indicators from my Alienware i7-8700 (I had installed a consumer M.2 NVMe) and from physically attached NVMes in AWS (but there's still a level of virtualization abstraction), but we're hoping to see real world bare metal as we plan for next FY's server refreshes for a number of WDQS nodes.

Looking at {T193911} I suspect we may want to see if it's possible to bundle a couple of NVMes and a couple of SATA SSDs onto a host if that's possible so that we can verify both non-RAIDed and RAIDed performance. The dump ingestion process we want to test can occupy up to 3 TB in source file(s) (this would sit on the SATA SSDs) and and up to 1.3 TB for the destination file (this would sit on the NVMe SSDs).

I don't know if it's possible to move NVMes to another wdqsNNNN host, but that may be a good idea so that we can have an apples-to-apples contrast with runs on similarly spec'd CPUs. Technically, it would be possible to run ingestion processes with SATA SSD1 -> SATA SSD2 and then SATA SSD1 -> NVMe(s) to compare the differences as we already know that clock speed is also a factor in performance here. I'll be discussing a bit more with @bking and @RKemper tomorrow and hopefully we can all close the loop soon.

Thanks!

@bking , @RKemper , and I met today. @bking has an action on this here ticket (@bking LMK in case I need to chime in on anything!). Thanks!

@wiki_willy I'm going to take over this work from @dr0ptp4kt . l'll make a phab task with the data you requested shortly.

Sounds good @bking, thanks!

@wiki_willy I'm going to take over this work from @dr0ptp4kt . l'll make a phab task with the data you requested shortly.