Page MenuHomePhabricator

cp5001 unreachable since 2018-07-14 17:49:21
Closed, ResolvedPublic

Description

cp5001 lost network connectivity at 2018-07-14 17:49:21. librenms shows the network port as down: https://librenms.wikimedia.org/device/device=163/tab=port/port=15429/

The server is still reachable via the mgmt interface, but the console shows nothing but the string "Startin"

Related Objects

Event Timeline

cp5001 began to complain about memory errors at Jul 14 17:39:19:

vgutierrez@cp5001:~$ fgrep "section_type: memory error" /var/log/syslog
Jul 14 17:39:19 cp5001 kernel: [4071097.227959] {1}[Hardware Error]:   section_type: memory error
Jul 14 17:40:05 cp5001 kernel: [4071143.323703] {2}[Hardware Error]:   section_type: memory error
Jul 14 17:40:32 cp5001 kernel: [4071170.926275] {3}[Hardware Error]:   section_type: memory error
Jul 14 17:42:04 cp5001 kernel: [4071262.120688] {4}[Hardware Error]:   section_type: memory error
Jul 14 17:43:07 cp5001 kernel: [4071325.427453] {5}[Hardware Error]:   section_type: memory error
Jul 14 17:47:11 cp5001 kernel: [4071569.871201] {6}[Hardware Error]:   section_type: memory error

Complete error:

Jul 14 17:39:19 cp5001 kernel: [4071097.227953] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Jul 14 17:39:19 cp5001 kernel: [4071097.227955] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jul 14 17:39:19 cp5001 kernel: [4071097.227956] {1}[Hardware Error]: event severity: corrected
Jul 14 17:39:19 cp5001 kernel: [4071097.227958] {1}[Hardware Error]:  Error 0, type: corrected
Jul 14 17:39:19 cp5001 kernel: [4071097.227958] {1}[Hardware Error]:  fru_text: B4
Jul 14 17:39:19 cp5001 kernel: [4071097.227959] {1}[Hardware Error]:   section_type: memory error
Jul 14 17:39:19 cp5001 kernel: [4071097.227960] {1}[Hardware Error]:   error_status: 0x0000000000000400
Jul 14 17:39:19 cp5001 kernel: [4071097.227961] {1}[Hardware Error]:   physical_address: 0x000000535fc1c880
Jul 14 17:39:19 cp5001 kernel: [4071097.227963] {1}[Hardware Error]:   node: 1 card: 3 module: 0 rank: 1 bank: 1 row: 11256 column: 800
Jul 14 17:39:19 cp5001 kernel: [4071097.227964] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Jul 14 17:39:19 cp5001 kernel: [4071097.227981] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jul 14 17:39:19 cp5001 kernel: [4071097.227986] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Jul 14 17:39:19 cp5001 kernel: [4071097.227992] EDAC sbridge MC0: TSC 66397a3df61eb2
Jul 14 17:39:19 cp5001 kernel: [4071097.228000] EDAC sbridge MC0: ADDR 535fc1c880
Jul 14 17:39:19 cp5001 kernel: [4071097.228006] EDAC sbridge MC0: MISC 0
Jul 14 17:39:19 cp5001 kernel: [4071097.228012] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1531589959 SOCKET 0 APIC 0
Jul 14 17:39:19 cp5001 kernel: [4071097.228031] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#1_DIMM#0 (channel:5 slot:0 page:0x535fc1c offset:0x880 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:2 rank:1)
Vgutierrez moved this task from Backlog to Hardware on the Traffic board.

both kernel and server event log shows issues on DIMM B4:

3 | 07/14/2018 | 17:49:17 | Memory ECC Uncorr Err | Uncorrectable ECC (UnCorrectable ECC |  DIMMB4) | Asserted
BBlack raised the priority of this task from Medium to High.EditedJul 16 2018, 3:33 PM
BBlack subscribed.

Turning priority to "high" for this and the 5006 ticket ( T187157 ), as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes.

BBlack lowered the priority of this task from High to Medium.Jul 25 2018, 8:25 PM

I've opened case 977580870 to coordinate getting a Dell Tech dispatched to eqsin with a replacement part.

Update from email thread: I've started to arrange for the Dell Onsite Engineer to visit next Monday, August 6th. We'll need to ensure cp5001 is still offline this Friday in anticipation of the work.

I'm still getting the details on who the exact engineer going onsite will be (since I have to list them in an access/escort ticket with Equinix.)

Just need to check with Traffic to ensure having this offline Friday-Monday is ok?

It's already depooled, should be fine!

Excellent, I'll continue coordinating with Dell support and Equinix to file the tasks for this repair.

Dell finally replied back to me (3 days later) giving me a list of 4 engineers to go onsite. They keep doing that (listing more than are going.) So now I have to figure out which they will send with them and file the proper ticket.

This is taking days longer than I'd like.

Scheduled visit for next monday, EQ ticket 1-165318922260. Dell dispatch 91912127436

Engineers (Wong Kee Heng & Kelvin Goh Keng Yew) from Unisys (sub-contracted by Dell for Pro support) will be onsite on Monday, August 13th between 1500 and 1700 Singapore local.

They'll be swapping out the bad dimm and booting the server back up (to put system through post memory testing.) If all goes well, system will be ready to go back into service by the end of their work.

Email to EQ:

Shaun,

Our Equinix portal lists you as our account rep for SG3, so I'm hoping you can assist me in a recent issue we're having.

We have a defective Dell server, and we're attmepting to get Dell Onsite Engineers escorted to the rack. I've opened 2-165443475311, but it now has been placed on hold.

When I email support asking about it, I get no reply.

Can you investigate why 2-165443475311 was denied? We basically need to file a ticket to get onsite access to two Dell/Unisys Engineers, listed on that ticket (2-165443475311).

Can you advise what needs to be done for this to be processed and to allow the engineers onsite access? I've filled out the form, and it says accepted, and then a few days later SG3 denies or puts the ticket on hold. We need to get some folks inside to fix the system in question.

Please advise,

EQ Reply:

Hi Rob,

Sorry for the inconvenience caused.

Let me check and get back to you tomorrow.

Thanks

My reply back:

Shaun,

I appreciate it! We basically need someone from EQ to escort the Dell/Unisys engineer(s) to our rack, unlock the rack, and then supervise them as they work on a single host in the rack. They'll be changing out a memory stick in a single machine.

Unfortunately, there seems to have been some miscommunication on it, since the ticket keeps being delayed or denied. It may be I entered it into the system incorrectly. If possible, please advise on the best practice for me to file a ticket to achieve the above.

Thanks!

Equinix Support,

Please note this ticket is for both a site access/visitor ticket for two Unisys Engineers as well as a SmartHands Escort & Supervise ticket for those two same visitors.

On Monday, 2018-04-09 between 1300-1500 Singapore Local, two Unisys engineers will visit SG3 to work on one of our Dell systems. The engineers are:

Name: Wong Kee Heng
IC#: <redacted from phab posting>
Mobile #: <redacted from phab posting>

Name: Kelvin Goh Keng Yew
IC#: <redacted from phab posting>
Mobile# <redacted from phab posting>

They will need to be escorted to our rack, 06:040020:0603, and given access to it to work on system labeled cp5001/WMF7174/2B9L9M2, in U 17. They will be removing one memory dimm from the mainboard and replacing it.

I filed the above as part of my work order ticket, listing both the techs above, which resulted in Equinix ticket 1-166706276918. I then emailed this ticket info back to sg.dsp@sg.unisys.com, which is the unisys generic support address they email/CC all the Dell Singapore Support dispatches from. (Seems they contract out to Unisys for the actual work there.)

There was some back and forth with both Unisys and Equinix since Unisys says they were denied entry, but Equinix says no one ever showed up. Trying to figure out who is at fault is counter-productive to getting cp5001 working though, so I've pretty much ignored the 'who did what wrong' blame game and focused on getting them onsite again.

I did email our EQ rep team with the new ticket # for them to double check and ensure things go smoothly.

Hopefully they'll show up next Monday and do the work!

@RobH ping? This has been pending since July, with the last update being Aug 27(!?)

Ok, picking this back up!

I emailed into our support case 91912127436

Support,

This was dropped and not picked back up, so I'm trying to determine the status now.

We need to have the memory replaced on this host. Do I need to file a new task or can we reopen this one? Please advise,

The old ticket was too old, and new ticket 19131684 has been opened. I'm working this (sending over all the old info and logs) and will schedule another onsite attempt.

Hi Rob,
Good day to you.
I am replying on behalf of my colleague Marco, as you have spoken earlier.

I have created a new case number on the issue with the server's memory module which needs to be replaced.
Case number #19131684.

Would you be able to share the TSR logs from the server IDRAC for me to review again before i could arrange the service.
Also do provide me with below details for dispatch service purpose.

Address of server location:-
Onsite contact person name:-
Onsite contact person number:-
Onsite contact person email:-

Thanks

Best Regards,
Hari Mohan Rao
Senior Enterprise Analyst
Dell EMC | Global Support and Deployment
Hari_mohan_rao_s@dell.com
How am I doing? Please contact my manager, Sze_Leong_Kok@Dell.com with any feedback.

I sent back all the info:

Support,

The server is located at:
<address redacted from task>

Please note we do NOT have an on-site technician in Singapore, and instead I'll be the point of contact. I am located in San Francisco, CA, USA.
Rob Halsell - +1.727.255.4597 rhalsell@wikimedia.org

The system had a defective single dimm direct from Dell, it has never worked. We just need the single dimm replaced.

Log:

/admin1-> racadm getsel
Record: 1
Date/Time: 10/31/2017 12:52:22
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 07/14/2018 17:49:17
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 3
Date/Time: 07/14/2018 17:49:17
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B4.

I've also attached the TSR. Please note that for Equinix, they require 3 business days advance notice of onsite visits. So we'll need the full names of the onsite engineer who will visit in advance of their visit. I'll open a ticket with Equinix for this visit.

Thanks in advance,

So Dell wants us to update the bios and return this to service to see if the error happens again. I'll flash the bios, and attempt to run memtest remotely and see if that works.

The latency in pushing things to the mgmt network is pretty high, but it is working.

Updated the idrac firmware to 2.60 from 2.50, now updating bios from 2.5.4 to 2.8.0

Bios updated, now running memtest86+ via Dell diagnostics boot option entry.

Mentioned in SAL (#wikimedia-operations) [2018-11-21T21:16:43Z] <robh> cp5001 is offline running hardware tests after firmware updates to see if memory error still exists. ref: T199675

Update from SRE meeting today - memtest was successful, and we're asked to put it back in production and see if the error happens again or not. Re-pooling!

No new EDAC errors reported since repooling, all we can do is assume it's ok for now I think.