Page MenuHomePhabricator

decom wtp2005 (was: wtp2005 hardware issue)
Closed, ResolvedPublic

Description

wtp2005 went down, logging to the mgmt interface at the console we have:

Alert!  System fatal error during previous boot
 Cache and Core Box, Last Level Cache Error

And in the racadm logs ( racadm getsel):

-------------------------------------------------------------------------------
Record:      12
Date/Time:   07/14/2020 10:22:01
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
[...SNIP...]
-------------------------------------------------------------------------------
Record:      26
Date/Time:   07/14/2020 10:23:06
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------

and from racadm lclog view:

--------------------------------------------------------------------------------
SeqNumber       = 212
Message ID      = CPU0000
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2020-07-14 10:32:14
Message         = CPU 2 has an internal error (IERR).
Message Arg   1 = 2
RawEventData    = 0x1A,0x00,0x02,0x8A,0x87,0x0D,0x5F,0xB1,0x00,0x04,0x07,0x09,0x6F,0xA0,0x02,0x37

FQDD            =
--------------------------------------------------------------------------------
SeqNumber       = 211
Message ID      = CPU9000
Category        = System
AgentID         = SEL
Severity        = Information
Timestamp       = 2020-07-14 10:32:13
Message         = An OEM diagnostic event occurred.
RawEventData    = 0x19,0x00,0x02,0x8A,0x87,0x0D,0x5F,0xB1,0x00,0x04,0xC1,0x28,0x7E,0x00,0x20,0xBE

FQDD            = System.Embedded.1
--------------------------------------------------------------------------------
[...SNIP...]
--------------------------------------------------------------------------------
SeqNumber       = 207
Message ID      = CPU0704
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2020-07-14 10:23:06
Message         = CPU 2 machine check error detected.
Message Arg   1 = 2
RawEventData    = 0x15,0x00,0x02,0x8A,0x87,0x0D,0x5F,0xB1,0x00,0x04,0x07,0x0D,0x07,0xA6,0x02,0x37

FQDD            = CPU.Socket.1
--------------------------------------------------------------------------------
[...SNIP...]
--------------------------------------------------------------------------------
SeqNumber       = 197
Message ID      = CPU0704
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2020-07-14 10:22:01
Message         = CPU 2 machine check error detected.
Message Arg   1 = 2
RawEventData    = 0x0C,0x00,0x02,0x49,0x87,0x0D,0x5F,0xB1,0x00,0x04,0x07,0x0D,0x07,0xA6,0x02,0x36

FQDD            = CPU.Socket.1
--------------------------------------------------------------------------------

Event Timeline

Volans created this task.Jul 14 2020, 10:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2020, 10:44 AM

Depoled from confctl and marked as failed on Netbox.

Mentioned in SAL (#wikimedia-operations) [2020-07-14T10:52:00Z] <volans> powerdown wtp2005, hardware issue - T257903

wiki_willy assigned this task to Papaul.Jul 14 2020, 9:59 PM
wiki_willy added subscribers: akosiaris, wiki_willy.

@akosiaris - looks like this server is past the 5yr server life cycle, and was due to be refreshed via T231255. Let us know if we can ignore this alert. Thanks, Willy

@akosiaris - looks like this server is past the 5yr server life cycle, and was due to be refreshed via T231255. Let us know if we can ignore this alert. Thanks, Willy

Yeah I think we can for now. The replacing hosts have been racked and have the role(insetup) applied so we can take it from here. Thanks!

akosiaris triaged this task as Low priority.Jul 15 2020, 7:24 AM

Handling this now as part of T243112

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:45 PM
Papaul added a subscriber: Dzahn.Sep 23 2020, 12:10 AM

@Dzahn
I noticed that this server is not present in Icinga and has status "failed" in Netbox. Can we turn this task to a decommission task if the server is no longer needed in production and since T247441 is done?

Thanks

Dzahn claimed this task.Sep 23 2020, 8:05 PM
Dzahn added a comment.Sep 23 2020, 8:12 PM

Yeah I think we can for now. The replacing hosts have been racked and have the role(insetup) applied so we can take it from here. Thanks!

Meanwhile the replacing hosts are in production.

Removing wtp2005 from conftool data and DHCP.

Dzahn raised the priority of this task from Low to Medium.Sep 23 2020, 8:13 PM

Change 629468 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom wtp2005

https://gerrit.wikimedia.org/r/629468

Change 629468 merged by Dzahn:
[operations/puppet@production] decom wtp2005.codfw.wmnet

https://gerrit.wikimedia.org/r/629468

Dzahn added a comment.Sep 23 2020, 8:33 PM

I noticed that this server is not present in Icinga and has status "failed" in Netbox.

When I tried to run the decom cookbook I got ATTENTION: the query does not match any host in PuppetDB or failed. So it was already not in puppet db which explains why it was not in Icinga.

But it was still in conftool data with status "inactive" and in DHCP.

Removed those above.

> Can we turn this task to a decommission task if the server is no longer needed in production and since T247441 is done?

Yep. It makes sense. Doing.

Dzahn added a comment.Sep 23 2020, 8:34 PM

Arr.. it still shows up in MediaWiki config:

Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet
Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private
Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging
wmf-config/InitialiseSettings.php:		'10.192.16.47' => true, # wtp2005.codfw.wmnet
Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway?

Change 629475 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/mediawiki-config@master] remove wtp2005 from wgLinterSubmitterWhitelist

https://gerrit.wikimedia.org/r/629475

Dzahn added a comment.Sep 23 2020, 8:46 PM
Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway?
Type "done" to proceed
> done
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['wtp2005.codfw.wmnet']
**Failed downtime host on Icinga (likely already removed)**
Management Password: 
Found physical host
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['wtp2005.mgmt.codfw.wmnet']
Skipped downtime management interface on Icinga (likely already removed)
**Unable to connect to the host, wipe of bootloaders will not be performed**: Cumin execution failed (exit_code=2)
Running IPMI command: ipmitool -I lanplus -H wtp2005.mgmt.codfw.wmnet -U root -E chassis power off
Powered off
Deleting interface eno1 and related IPs
Deleting interface eno2 and related IPs
Netbox status updated for host wtp2005 Failed -> decommissioning
Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
Sleeping for 20s to avoid race conditions...
Removed host wtp2005.codfw.wmnet from Debmonitor
Removed from DebMonitor
Removed from Puppet master and PuppetDB
Generating the DNS records from Netbox data. It will take a couple of minutes.

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: wtp2005.codfw.wmnet

  • wtp2005.codfw.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)
    • Not all affected DC(s) have been migrated to automatic DNS, a manual patch to the operations/dns repository is required

ERROR: some step on some host failed, check the bolded items above

Change 629476 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] decom wtp2005.codfw.wmnet

https://gerrit.wikimedia.org/r/629476

Change 629476 merged by Dzahn:
[operations/dns@master] decom wtp2005.codfw.wmnet

https://gerrit.wikimedia.org/r/629476

Dzahn renamed this task from wtp2005 hardware issue to decom wtp2005 (was: wtp2005 hardware issue).Sep 23 2020, 9:08 PM
Dzahn reassigned this task from Dzahn to Papaul.
Dzahn removed a project: Patch-For-Review.
Dzahn added a comment.Sep 23 2020, 9:14 PM

@Papaul This should now be ready for decom. After some intial issue with the DNS removal it is now gone. The only thing left is a MW config change but that can go anytime later, it's just a whitelist of host names.

edit interfaces interface-range disabled]
     member ge-3/0/2 { ... }
+    member ge-4/0/21;
[edit interfaces]
-   ge-4/0/21 {
-       description wtp2005;
-       enable;
-   }
Dzahn added a comment.Sep 23 2020, 9:34 PM

Arr.. it still shows up in MediaWiki config:

removal added to today's evening deploy window (formerly SWAT)

https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_September_23

Papaul added a comment.EditedSep 23 2020, 9:42 PM

Removed mgmt DNS. what left is just to remove the disk from the server and unrack it.

Change 629475 merged by jenkins-bot:
[operations/mediawiki-config@master] remove wtp2005 from wgLinterSubmitterWhitelist

https://gerrit.wikimedia.org/r/629475

Mentioned in SAL (#wikimedia-operations) [2020-09-23T23:27:19Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: 22382a97ec252488a346fbf0c3d40bc974d0cdbe: remove wtp2005 from wgLinterSubmitterWhitelist (T257903) (duration: 01m 04s)

Papaul closed this task as Resolved.Sep 24 2020, 4:46 PM

Disks removed from server and unrack