Page MenuHomePhabricator

mw2231 is down and unable to reboot
Closed, ResolvedPublic

Description

The server went down on Saturday august 24th, and didn't come back after a powercycle and a hardreset. Console is completely blank.

It should be inspected on-site.

Event Timeline

Joe triaged this task as Medium priority.Aug 26 2019, 9:52 AM

Mentioned in SAL (#wikimedia-operations) [2019-08-26T09:54:19Z] <_joe_> codfw/appserver/*/mw2231.codfw.wmnet: pooled changed yes => inactive T231192

System Event Log
Severity Date/Time Description
Instructions: The System Event Log contains information about the managed system. To sort the log by column, click a column header.
Clear Log
Save As

	 	Mon Aug 26 2019 15:16:40	CPU 1 has an internal error (IERR).	
	 	Sat Aug 24 2019 09:27:51	CPU 1 has an internal error (IERR).	
	 	Sat Aug 24 2019 09:27:04	CPU 2 has an internal error (IERR).

The server wouldn't boot, it goes through the DELL logo screen. Then we get the message "stuck on initializing intel quickpath interconnect" after a couple of minutes it reboots again .

Get this after swapping CPU 1 with CPU 2

Clear Log
Save As

	 	Mon Aug 26 2019 16:58:15	CPU 2 has an internal error (IERR).	
	 	Mon Aug 26 2019 16:57:52	CPU 1 has an internal error (IERR).

@Joe Looks like we are having CPU issues on this system and the system is out of warranty as well.

The switch ports to that serve has flapped 2000 times in 12h. I'm disabling it.

That server is fairly freshly out of warranty (May 2019), do we have a spare CPU of that type on site or shall we buy a replacement part?

I second what @MoritzMuehlenhoff suggested. The system is not scheduled for replacement for another 2 years, so if we can salvage it somehow, that'd be great.

@wiki_willy this system is out of warranty since May 2019 which is like 4 months and we do not have spare . Did some tests again on the system today.

  • Swapped CPU 1 with CPU2 error showing on CPU1
  • Swapped DIMM A1 with DMM B1 error showing on CPU1

This really don't look again like CPU problem but more like a main-board problem.

Please advice on what to do since we have no spare CPU or main - board on site.

@MoritzMuehlenhoff @Joe I talked to @wiki_willy On IRC on what needs to be done for this system.

The system has :
2x500GB 2.5" SATA disks
E5 2650 V3 @2,3 GHz
2x32 GB of memory

we are about to put back in the spare pool grahpite2002 T200210 which is a R430 too. The system has

4x1.6TB 2.5" SSD disks
E5-2640 v3 2.6GHz
2x32GH of memory

What we can do is to replace mw2231 with this system. Remove the SSD and use the SATA from mw2231. let me know .

@Papaul that looks fine - I don't think we need to swap out the SSDs, so just do it if we have a better use of those disks (they're pretty useless on an appserver).

@Joe I am afraid i didn;t get the comment on the SSD's

Do you to use the SSD's or keep the SATA?

mw2231 has SATA 2.5" disk
graphite2002 has 1.6TB SSD's

What Giuseppe meant: For the app server use case it doesn't matter whether we use SSD or SATA, they do very little I/O. If you have other use for the SSDs (e.g. because we can use these as spare parts for other SSD-based servers or so), then we can use the old SATA drives from mw2231 in graphite2002, but if not, then it's probably easier and less work for you if you simply repurpose graphite2002 with the SSDs as the new app server and don't bother swapping disks between servers.

Change 535212 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Change asset tag DNS for mw2231

https://gerrit.wikimedia.org/r/535212

Change 535214 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address for mw2231

https://gerrit.wikimedia.org/r/535214

Change 535214 merged by Muehlenhoff:
[operations/puppet@production] DHCP: Change MAC address for mw2231

https://gerrit.wikimedia.org/r/535214

System replacement complete

  • update Netbox
  • Switch port re-enable

@MoritzMuehlenhoff the system is ready for re-image.

Mentioned in SAL (#wikimedia-operations) [2019-09-10T07:42:12Z] <moritzm> reimaging mw2231 after hardware maintenance T231192

I reimaged mw2231 and repooled it.

Change 535212 merged by Dzahn:
[operations/dns@master] DNS: Change asset tag DNS for mw2231

https://gerrit.wikimedia.org/r/535212