Page MenuHomePhabricator

mw2256 - hardware issue
Closed, ResolvedPublic

Description

Follow up of T155180, the server mw2256 periodically freeze completely. The server is now depooled as "inactive".

The server was part of the batch mw2251-mw2260, that as far as I know are identical servers.

Summary of what has been done so far:

Details

Related Gerrit Patches:

Event Timeline

Dzahn created this task.Apr 19 2017, 4:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2017, 4:10 PM

@Papaul: we replaced the one of the RAM banks on mw2256 a while ago, it might be possible that we are more similar issues?

@elukey HW check came out with no error.

I just repooled mw2256, now pybal is sending health checks. I didn't find any trace of recurrence of the error, let's keep this task opened for a little longer.

Papaul lowered the priority of this task from Medium to Low.May 26 2017, 12:01 AM

Report from today (UTC timings):

10:30 <ema> !log mw2256 down, console stuck on 'Starti'. power cycled.

Mentioned in SAL (#wikimedia-operations) [2017-08-06T13:17:13Z] <elukey> powercycle mw2256 - com2 frozen - T163346

elukey added a comment.EditedAug 6 2017, 1:20 PM

@Papaul the host keeps getting in a frozen state, we'd need to re-check what's wrong :(

Papaul added a comment.Aug 7 2017, 1:52 PM

@elukey do you have any log for me?

Mentioned in SAL (#wikimedia-operations) [2017-08-08T14:16:23Z] <elukey> set mw2256 pooled=inactive + downtime to allow BIOS upgrade - T163346

Papaul added a comment.Aug 8 2017, 3:01 PM

@elukey done
Update firmware from 2.40 to 2.41
Update BIOS from 2.3.4 to 2.4.2

jcrespo raised the priority of this task from Low to Medium.Aug 9 2017, 4:08 AM
jcrespo added a subscriber: jcrespo.

[44534.817426] B

I would say kernel panic again based on the above output, but who knows.

Sadly, I cannot find relevant software or hardware logs.

Papaul added a comment.Aug 9 2017, 3:07 PM

@jcrespo that is not good it doesn't help me but I will contact Dell and have their opinion.

Thanks.

Papaul added a comment.Aug 9 2017, 6:01 PM

@elukey @jcrespo I open a case with Dell and we are working on the issue.

The Dell engineer asked to generate the sosreport the SupportAssist Collection and the RAID Controller Log. He is reviewing them and will get back in touch with me.

Hello Papaul,

case number SR 952124690

I’m your case owner and primary point of contact through resolution of this issue. Here are the best ways to contact me:

· Email: Dan.Coulter@Dell.com (Preferred)

· Direct Extension: 800-945-3355 ext: 5135128

· My working hours: 9:00am to 6:00pm CT Monday - Friday

· If I am not available, please contact My Backup Team

As discussed, I will contact you throughout the life of this case via email whenever possible.

Remember your satisfaction with our support is my responsibility; please inform me of any issues or concerns so I can address them immediately.

Mentioned in SAL (#wikimedia-operations) [2017-08-10T06:45:49Z] <elukey> powercycle mw2256 - T163346

Papaul applied the thermal paste on the CPU since it was basically not present, and send a sos report to DELL to get their support. I just re-pooled the host, let's see if it freezes again.

Mentioned in SAL (#wikimedia-operations) [2017-08-12T15:25:04Z] <elukey> powercycle mw2256 (able to use com2 but not to login as root, regular ssh hanging) - T163346

This time the host showed a sudden increase in load average and I can see this in the syslog at around the same time:

Aug 11 21:21:38 mw2256 kernel: [109231.690343] BUG: stack guard page was hit at ffff9f1a88107fd8 (stack is ffff9f1a88108000..ffff9f1a8810bfff)
Aug 11 21:21:38 mw2256 kernel: [109231.701317] kernel stack overflow (double-fault): 0000 [#1] SMP
Aug 11 21:21:38 mw2256 kernel: [109231.708018] Modules linked in: binfmt_misc 8021q garp mrp stp llc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_raw ip6_tables xt_pktt
ype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter xt_tcpudp xt_CT iptable_raw ip_tables x_tables intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp co
retemp mgag200 ttm kvm drm_kms_helper drm irqbypass crct10dif_pclmul i2c_algo_bit crc32_pclmul dcdbas evdev ghash_clmulni_intel mxm_wmi iTCO_wdt iTCO_vendor_support pcspkr ipmi_si m
ei_me mei shpchp lpc_ich mfd_core button wmi ipmi_devintf ipmi_msghandler nf_conntrack autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci crc32c_intel aesni_intel
libahci aes_x86_64 glue_helper ehci_pci lrw tg3 gf128mul ablk_helper ptp ehci_hcd cryptd libata pps_core usbcore libphy
Aug 11 21:21:38 mw2256 kernel: [109231.787526]  scsi_mod usb_common
Aug 11 21:21:38 mw2256 kernel: [109231.789690] CPU: 28 PID: 18879 Comm: exim4 Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.25-1~bpo8+3
Aug 11 21:21:38 mw2256 kernel: [109231.799979] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Aug 11 21:21:38 mw2256 kernel: [109231.808425] task: ffff8e14d6755140 task.stack: ffff9f1a88108000
Aug 11 21:21:38 mw2256 kernel: [109231.815127] RIP: 0010:[<ffffffffbb2069bd>]  [<ffffffffbb2069bd>] page_fault+0xd/0x30
Aug 11 21:21:38 mw2256 kernel: [109231.823878] RSP: 0018:ffff9f1a88107fe8  EFLAGS: 00010087
Aug 11 21:21:38 mw2256 kernel: [109231.829900] RAX: 0000000000000000 RBX: ffff9f1a88108108 RCX: ffffffffc014b430
Aug 11 21:21:38 mw2256 kernel: [109231.837959] RDX: ffffffffbad02ac8 RSI: ffffffffc016dc40 RDI: ffffffffc016dc4c
Aug 11 21:21:38 mw2256 kernel: [109231.846016] RBP: 000000000000000e R08: aaaaaaaaaaaaaaab R09: 0000000000000000
Aug 11 21:21:38 mw2256 kernel: [109231.854073] R10: ffff8e0e9de0c000 R11: ffff8e14d5d47c10 R12: 0000000000000003
Aug 11 21:21:38 mw2256 kernel: [109231.862131] R13: ffff8e14d6755140 R14: 000000000000000b R15: 0000000000030001
Aug 11 21:21:38 mw2256 kernel: [109231.870189] FS:  0000000000000000(0000) GS:ffff8e14dfb80000(0000) knlGS:0000000000000000
Aug 11 21:21:38 mw2256 kernel: [109231.879313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 11 21:21:38 mw2256 kernel: [109231.885819] CR2: ffff9f1a88107fd8 CR3: 0000000857b54000 CR4: 00000000003406e0
Aug 11 21:21:38 mw2256 kernel: [109231.893876] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 11 21:21:38 mw2256 kernel: [109231.901935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 11 21:21:38 mw2256 kernel: [109231.909993] Stack:
Aug 11 21:21:38 mw2256 kernel: [109231.912332] Call Trace:
Aug 11 21:21:38 mw2256 kernel: [109231.915160]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109231.922348]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109231.928855]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109231.934976]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109231.941008]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109231.948291]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109231.955478]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109231.962665]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109231.969172]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109231.975292]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109231.981320]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109231.988604]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109231.995791]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.002983]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109232.009504]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109232.015638]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109232.021689]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109232.028998]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.036229]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.043446]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109232.049967]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109232.056094]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109232.062131]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109232.069421]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.076617]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.083814]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109232.090331]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109232.096459]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109232.102499]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109232.109793]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.116993]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.124190]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109232.130722]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109232.136864]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109232.142917]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109232.150229]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.157433]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.164640]  [<ffffffffbac60a18>] ? fixup_exception+0x18/0x40
Aug 11 21:21:38 mw2256 kernel: [109232.171177]  [<ffffffffbac5f095>] ? no_context+0x45/0x400
Aug 11 21:21:38 mw2256 kernel: [109232.177309]  [<ffffffffbb2069d8>] ? page_fault+0x28/0x30
Aug 11 21:21:38 mw2256 kernel: [109232.183355]  [<ffffffffc014b430>] ? scsi_ioctl+0x1b0/0x3e0 [scsi_mod]
Aug 11 21:21:38 mw2256 kernel: [109232.190647]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
Aug 11 21:21:38 mw2256 kernel: [109232.197850]  [<ffffffffbad02ac8>] ? search_module_extables+0x68/0x70
[..looong stacktrace..]
Aug 11 21:21:38 mw2256 kernel: [109234.161695] Code: 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff ff e8 88 97 a5 ff e9 53 02 00 00 0f 1f 00 0f 01 ca 66 0f 1f 44 00 00 48 83 c4 88 <e8> 7e 01 00 00 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff
Aug 11 21:21:38 mw2256 kernel: [109234.183453] RIP  [<ffffffffbb2069bd>] page_fault+0xd/0x30
Aug 11 21:21:38 mw2256 kernel: [109234.189588]  RSP <ffff9f1a88107fe8>
Aug 11 21:21:38 mw2256 kernel: [109234.199011] ---[ end trace 6c2481d9a50ca780 ]---
Aug 11 21:21:38 mw2256 kernel: [109234.264961] Fixing recursive fault but reboot is needed!

Hi Papaul,

I’m back in the office today and getting caught up on my emails and case backlog. I’m in the process of reviewing the case status and will provide feedback / update shortly.

Regards,

Dan Coulter

Hi Papaul,

I had a chance to review the logs you provided and found the following:

  • Previous entries in SEL indicate memory errors
  • Replacement 32GB DIMM was dispatched on previous case to address memory issues
  • Nothing currently in logs to indicate HW error
  • BIOS was updated from 2.3.4 to 2.4.2 on 8/8/17
  • From LCC log, system shows that it was turned off / shutdown prior to restart

2017-08-09 04:13:38 LOG007 The previous log entry was repeated 1 times.

2017-08-09 04:13:23 SYS1003 System CPU Resetting.

2017-08-09 04:12:46 LOG007 The previous log entry was repeated 1 times.

2017-08-09 04:10:16 SYS1003 System CPU Resetting.

2017-08-09 04:10:14 SYS1000 System is turning on.

2017-08-09 04:10:07 RAC0702 Requested system powercycle.

2017-08-09 04:10:06 SYS1003 System CPU Resetting.

2017-08-09 04:10:06 SYS1001 System is turning off.

  • kern.log entries prior to system being restarted

Aug 8 16:18:16 mw2256 kernel: [ 4689.558803] perf: interrupt took too long (2533 > 2500), lowering kernel.perf_event_max_sample_rate to 78750

Aug 8 16:49:45 mw2256 kernel: [ 6578.997313] perf: interrupt took too long (3171 > 3166), lowering kernel.perf_event_max_sample_rate to 63000

Aug 8 17:49:54 mw2256 kernel: [10188.432044] perf: interrupt took too long (3964 > 3963), lowering kernel.perf_event_max_sample_rate to 50250

Aug 8 19:57:30 mw2256 kernel: [17844.977396] perf: interrupt took too long (4960 > 4955), lowering kernel.perf_event_max_sample_rate to 40250

There is only the current syslog that begins with the last restart on 8/9/17, so I am unable to see any events reported prior to the restart.

I’ve seen references to similar behavior with different distros of Linux and updates / patches have been suggested to resolve. However, it looks like you are already at 4.9.2-10 which looks to be the most recent release for that version.

I will continue to look to see what I can find, but it does not appear to be a hardware issue from the information provided.

Regards,

Dan Coulter

@elukey

Good morning Papaul,

I would suggest booting the system to our Support Live ISO and running the stressapptest for an extended period of time.

The ISO can be downloaded from the following link:

Dell Support Live Image Version 2.2

http://www.dell.com/support/home/us/en/4/Drivers/DriversDetails?driverId=CWF92

You can mount the ISO using virtual media from the iDRAC and boot to it.

I would suggest the following:

  1. Clear the system event log from the iDRAC interface
  1. Mount the ISO using virtual media and boot to it
  1. Open a terminal session and run the following command: stressapptest -s <time in seconds> -W

Running the stressapptest will bring the CPU utilization to 95% and keep it there for the duration of the test.

Let’s run the test for as long as possible to see if we can induce the same behavior you are seeing with the installed OS.

Regards,

Dan Coulter

@Papaul sounds good to me. The host is now in maintenance for Icinga (until Sept 9th), and depooled from any service. You are free to do the test whenever you prefer :)

@elukey the system is not booting up with the live CD provide by Dell get error message below. I am working with DELL on this issue at the moment.

Was about this time to boot off the live CD and after the OS load the system got in frozen state so i couldn't run the stress command.
The last option now is to run the extended HW test and see if the problem is not found, the main board will be replaced according to what the Dell engineer said.

So i am running the extended HW test now it will take up to 3 to 6 hours.

Papaul added a comment.EditedAug 21 2017, 2:55 PM

Extended test came out with no HW problem. I will be contacting Dell once again to follow up on the case.

I will be receiving a replacement main board for this system this Wednesday.

Your Service Request
SR#: 952124690

Contact Us | Support Library | Download Center | SupportAssist | Community Forums

Dear Papaul Tshibamba,

This e-mail is to update you on the status of your Dell Service Request.

Current Status:

The Dell replacement part(s) for your PowerEdge R430 Server has been shipped by FEDX on tracking number 744122987458.

Papaul reassigned this task from Papaul to elukey.Aug 22 2017, 5:37 PM

@elukey main board replacement complete.

@Papaul: Luca's out this week. I've tried to connect to the host, but can't connect via SSH. It works fine over the mgmt, can you check the cabling please?

@MoritzMuehlenhoff the cable is connected. Just keep in mind new main board = new MAC address.

racadm getsysinfo reports:

Embedded NIC MAC Addresses:
NIC.Embedded.1-1-1      Ethernet                = 14:18:77:5F:43:64
NIC.Embedded.2-1-1      Ethernet                = 14:18:77:5F:43:65
NIC.Embedded.3-1-1      Ethernet                = 14:18:77:5F:43:66
NIC.Embedded.4-1-1      Ethernet                = 14:18:77:5F:43:67

Meanwhile this is the puppet config:

host mw2256 {
    hardware ethernet 18:66:DA:83:16:47;
    fixed-address mw2256.codfw.wmnet;
}

Change 374168 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] linux-host-entries: change MAC address of mw2256

https://gerrit.wikimedia.org/r/374168

Change 374168 merged by Elukey:
[operations/puppet@production] linux-host-entries: change MAC address of mw2256

https://gerrit.wikimedia.org/r/374168

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw2256.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708281031_elukey_27779.log.

Completed auto-reimage of hosts:

['mw2256.codfw.wmnet']

Of which those FAILED:

set(['mw2256.codfw.wmnet'])

Host reimaged and re-pooled, everything looks good. Let's keep this task open for a couple of days to see if anything weird comes up but I'd say that we are good.

elukey closed this task as Resolved.Aug 29 2017, 8:04 AM

Closing this task since the hw issue should have been resolved. Will re-open if necessary. Thanks @Papaul for the work done!

elukey reopened this task as Open.Aug 29 2017, 10:05 AM
jcrespo removed a subscriber: jcrespo.Aug 29 2017, 10:07 AM

Host frozen again, not responding to ssh and pings, com2 shows [82623.895993] g

@elukey i think i will take your advice to burn mw2256 down lol.
Here is want I want for you to do for me. Configure the system to generate a kernel crash dump. When the system gets into the frozen state, do not power cycle the system. We will use the NMI button to power down the system this will generate a kernel crash dump that can help to determine what is causing the system to become unresponsive and i can get back with Dell

Thanks.

elukey added a comment.Sep 2 2017, 6:52 AM

@Papaul the host froze again, all yours :)

@elukey I follow the steps that were giving to me from the Dell engineer to power down the server using the NMI button, holding the button down doesn't power the server down. So i contact Dell once again they asked me to upload another support assistance report which i did.
After reviewing the report, they came up with 2 things according to them

1- All the firmwares on the server are not up to date
2 The OS (Debian Jessie) running on the server is not compatible with the hardware.

They send me another link to download the new firmware. So after updating the firmware if we stay have the same problem I think we will have to escalate this issue to @mark or @faidon.

Thanks

Papaul added a comment.Sep 5 2017, 4:34 PM

@elukey The ISO file is about 3.3 GB can not use mifi to download it so will download it once home and bring it to the DC tomorrow to update the firmware.

Thanks.

elukey updated the task description. (Show Details)
elukey added a comment.Sep 6 2017, 9:29 AM

@Papaul one thing that we could tell to Dell is that we have, as far as I can see, mw2251->60 that are identical, so our software is almost surely not the problem. I put a summary in this task about what has been done so far, and it looks like the host has already been through a lot of hw failures, so everything seems to point to another hw problem that we haven't seen yet.

About the firmwares: do we have a quick way to compare the versions across mw2251->60?

Papaul added a comment.Sep 6 2017, 3:13 PM

@elukey
I spoke today with one of the Dell manager on this case. He ensure me that he will personal follow this case with the engineer working with me. He asked that i go ahead and update the firmware than they will send me

1- controller panel
2 - controller cable

Hello Papaul,

I’m a Resolution Manager within the Dell Enterprise Expert Center and I am contacting you about incident # 952124690. My role as an RM is to serve as an advocate for you and to ensure that this incident is resolved to your satisfaction. Please let me know if you prefer to be contacted via e-mail or phone.

Please contact me if you are not completely satisfied with the level of support you are receiving so I may act on your behalf or if you have any questions or concerns that I can assist with.

Regards,

Brad Smith

Enterprise Resolution Manager

Dell EMC | Enterprise Support Resolution Team

office +1 800 945 3355 x 4340497

mobile +1 405 306 9423

Bradley.Smith@Dell.com

Papaul added a comment.Sep 6 2017, 4:12 PM

In the process of updating the firmware on the server, the server got again in a frozen state. nothing on the monitor and no keyboard response as well.

Dzahn removed a subscriber: Dzahn.Sep 6 2017, 5:36 PM
Dzahn added a subscriber: Dzahn.Sep 6 2017, 5:39 PM

Given all the work that has gone into this single host and it still being dead after all this.. i suggest we just give up on it and permanently decom it. It probably costs us less in the end that way.

Papaul added a comment.Sep 7 2017, 6:19 PM

@elukey got a call from the Dell manager support team here is what going to happen for the next step.
They will send out :

1- Another main board
2 - 2 CPU's
3 - controller panel
4 - controller panel cable

Once I have all the parts on site, they will dispatch a tech to come on site for him to install and troubleshoot the issue. If the part are shipped today, i will be get those by tomorrow and the tech will be on site by Monday.

Papaul added a comment.Sep 8 2017, 5:02 PM

Hi Papaul,

Here is the dispatch information for service to address system unresponsiveness of the PowerEdge R430 with service tag FXLPND2.

Next business day service dispatch

Service date deferred by customer to Monday, 9/11/17

DPS 330180974

Parts:

CN7X8 - ASSY,PWA,PLN,SV,R530/R430,V3

1GKDW - (qty 2) PRC SVC Kit (Contains PRC E52650V4, 2.2, 30M, M0)

R3GGP - ASSY Front I/O Module, (for X10 Hot Plug Config)

KRHM8 - ASSY Motherboard to X10 Front Panel USB Cable

397G1 - ASSY,CBL,INTFC,MB,CTL,R420/320

All parts show a status of shipped. The technician will be contacting you directly to schedule a specific time for service once he has received the parts.

After replacing the components, we’d like for the technician to confirm the system can boot to the Support Live ISO, run the Linux stress app and confirm the he can crash the system with the NMI button while booted to the live environment. Let me know if you need for me to resend the link to the Support Live ISO to have on hand.

Please let us know if you have any questions or concerns. Otherwise, I will keep an eye on the dispatch and follow up with you after service has been completed to check status.

Regards,

Dan Coulter

Papaul claimed this task.Sep 11 2017, 2:47 PM

Tech arrived on site at 10:13am and started working on the server at 10:30.

After replacing the main board the server came up with an error on the PSU 's (PSU Mismatch)
After troubleshooting with Dell they told him to removed the new board and put back the old board

12:30 pm Replacing new board with old board
13:13 pm Replacing new board with old board complete
After replacing the new board with the old board no errors (problem with new board)
1:19 pm start stress test on the server
1:48 pm stress test complete with no error

part replaced :

  • both CPU's
  • Control panel
  • control panel cable
Papaul reassigned this task from Papaul to elukey.Sep 13 2017, 2:14 PM

@elukey I think we can put the system back in production to test it out after the Dell Engineer replaced all the parts above.

Thanks.

MoritzMuehlenhoff closed this task as Resolved.Sep 15 2017, 7:28 AM

Agreed. I ran a "scap pull" and repooled the server. Closing the task, we can reopen if it crashes again.