Page MenuHomePhabricator

VMs on cloudvirt1015 crashing - bad mainboard/memory
Open, NormalPublic

Description

I put cloudvirt1015 into service on Monday the 8th. Yesterday (the 11th) tools-prometheus-01 crashed with a kernel panic. On Friday 12th, tools-worker-1023 also crashed.
On Saturday 13th, @Zppix's puppet-lta.lta-tracker.eqiad.wmflabs crashed

We've replaced lots of parts in this box, to no avail:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@RobH and/or @Cmjohnson, I'm hoping the above is enough to pass on to Dell for a replacement part. Let us know if you need more details.

Restricted Application added a project: Operations. · View Herald TranscriptApr 13 2019, 9:53 PM

I finished draining cloudvirt1015 and put it in downtime, so it's ready for whatever reboots/rebuilds/hardware changes might be needed.

Andrew updated the task description. (Show Details)Apr 15 2019, 3:08 PM
bd808 updated the task description. (Show Details)Apr 15 2019, 3:10 PM

I just deleted product-analytics-test and product-analytics-bayes so y'all don't need to worry about those instances :)

colewhite triaged this task as Normal priority.Apr 16 2019, 6:07 PM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Apr 16 2019, 6:14 PM

*bump* Chris, do you have any thoughts about what we should do next here?

Mentioned in SAL (#wikimedia-operations) [2019-05-15T20:20:29Z] <robh> rebooting cloudvirt1015 into dell hardware tests per T220853

RobH added a comment.May 15 2019, 8:41 PM

Ok, so this has had CPU issues from the get go, tracked on both T215012 and T171473. It seems that the CPUs have been swapped, but not the mainboard. Considering its throwing CPU errors after all the CPU swaps, I advise we swap the mainboard next.

I'm attempting to run Dell hw test to get a failure code.

RobH claimed this task.May 15 2019, 8:42 PM
RobH reassigned this task from RobH to Cmjohnson.May 15 2019, 8:50 PM
RobH moved this task from Cloud Tasks to Hardware Failure / Troubleshoot on the ops-eqiad board.

Error output:

The event log indicates degraded or disabled ECC functionality.  Memory testing cannot continue until the problems are corrected, the log cleared and the system rebooted.

Technical Support will need this information to diagnose the problem.
Please record the information below.

Service Tag : 31R9KH2
Error Code : 2000-0125                                      
Validation : 126785

The error logs in question are:

-------------------------------------------------------------------------------
Record:      10
Date/Time:   05/15/2019 20:41:12
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   05/15/2019 20:41:45
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------

So either the memory is bad, or the mainboard is bad. Since the mainboard is also throwing CPU errors (see above), we should try to get both mainboard and memory dimm replaced.

RobH renamed this task from VMs on cloudvirt1015 crashing to VMs on cloudvirt1015 crashing - bad mainboard/memory.May 15 2019, 8:50 PM
mpopov removed a subscriber: mpopov.May 17 2019, 4:57 PM

Swapped DIMM B3 with DIMM A3 and cleared the log.

Andrew added a comment.Jun 5 2019, 6:23 PM

Any update about this? Are parts on the way?

@Andrew what parts? There is nothing that suggests that it is CPU on the server side of things. I reseated and moved the DIMM and that error has not returned. It may very well have been poorly seated DIMM. I checked dmesg and do not see any more errors related to memory or CPU. Try putting it back into production and let's see if anything comes back. Unfortunately, I need to demonstrate and prove there is a problem for Dell to do anything and right now I do not have anything to give them.

I'm put eight test VMs on 1015, will let them run for a few days and then see if they're still up :)

Mentioned in SAL (#wikimedia-cloud) [2019-06-17T13:59:22Z] <andrewbogott> moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )

Cmjohnson closed this task as Resolved.Jul 11 2019, 5:47 PM

I am resolving this task

aborrero reopened this task as Open.Jul 22 2019, 4:56 PM
aborrero added a subscriber: aborrero.

The server just died again. I found this in the mgmt console:

[4576846.406213] Code: 00 75 73 48 83 c4 38 5b 5d c3 48 8d 74 24 10 48 89 d1 89 df 48 89 ea e8 28 fe ff ff 8b 54 24 28 83 e2 01 74 0b f3 90 8b 54 24 28 <83> e2 01 75 f5 eb c1 8b 05 c4 9a db 00 85 c0 75 83 80 3d 94 49 
[4576846.509999] NMI watchdog: BUG: soft lockup - CPU#54 stuck for 22s! [CPU 1/KVM:72246]
[4576846.510028] Modules linked in: cpuid xt_multiport nf_conntrack_netlink ebt_arp ebt_among ip6table_raw ip6table_mangle nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat xt_connmark iptable_mangle xt_mac xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_physdev xt_set xt_conntrack nf_conntrack ip_set_hash_net ip_set nfnetlink vhost_net vhost macvtap macvlan tun binfmt_misc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iptable_raw 8021q garp mrp xfs intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas kvm irqbypass mgag200 crct10dif_pclmul ttm crc32_pclmul drm_kms_helper ghash_clmulni_intel drm i2c_algo_bit sg pcspkr mei_me lpc_ich mfd_core mei evdev shpchp ipmi_si wmi button ib_iser
[4576846.510050]  rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nbd ipmi_devintf ipmi_msghandler br_netfilter bridge stp llc ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear md_mod dm_mod sd_mod aesni_intel aes_x86_64 ehci_pci glue_helper ahci ehci_hcd bnx2x lrw gf128mul libahci ablk_helper ptp cryptd pps_core libata megaraid_sas mdio usbcore libcrc32c crc32c_generic usb_common crc32c_intel scsi_mod
[4576846.510052] CPU: 54 PID: 72246 Comm: CPU 1/KVM Tainted: G      D      L  4.9.0-9-amd64 #1 Debian 4.9.168-1+deb9u2
[4576846.510053] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.9.1 12/04/2018
[4576846.510054] task: ffffa0bed38c40c0 task.stack: ffffabedf4fe4000
[4576846.510057] RIP: 0010:[<ffffffff9b6feaa7>]  [<ffffffff9b6feaa7>] smp_call_function_single+0xd7/0x130
[4576846.510058] RSP: 0018:ffffabedf4fe7ab0  EFLAGS: 00000202
[4576846.510059] RAX: 0000000000000000 RBX: 0000000000000026 RCX: ffffabedf4a5fac0
[4576846.510060] RDX: 0000000000000003 RSI: ffffabedf4fe7ac0 RDI: ffffabedf4fe7ac0
[4576846.510061] RBP: ffffffffc0c35620 R08: 001042497cc14308 R09: 0000000000000014
[4576846.510062] R10: 0000000000000cb7 R11: 0000000000000ef8 R12: 0000000000000026
[4576846.510063] R13: ffffa0cca1528040 R14: ffffa0eb7816d400 R15: 0000000000000036
[4576846.510064] FS:  00007f6fb1728700(0000) GS:ffffa0bcbfac0000(0000) knlGS:0000000000000000
[4576846.510065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4576846.510066] CR2: 00007f4e3b13a398 CR3: 00000056c771e000 CR4: 0000000000362670
[4576846.510067] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[4576846.510068] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[4576846.510069] Stack:
[4576846.510071]  0000001a0000000c 0000000100000024 ffffabedf4a5fac0 ffffffffc0c35620
[4576846.510073]  ffffa0cca152c5d8 0000000000000003 d38dee9ce6e66bab 0000000000000036
[4576846.510075]  ffffabedf4fe7c00 ffffffffc0c34abf 0000000000000000 0000000000000016
[4576846.510075] Call Trace:
[4576846.510079]  [<ffffffffc0c35620>] ? update_debugctlmsr+0x20/0x20 [kvm_intel]
[4576846.510083]  [<ffffffffc0c34abf>] ? vmx_vcpu_load+0x9f/0x350 [kvm_intel]
[4576846.510085]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510086]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510088]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510089]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510090]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510092]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510093]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510095]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510096]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510098]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510099]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510101]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510102]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510104]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510105]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510107]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510108]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510122]  [<ffffffffc0da2bd6>] ? kvm_arch_vcpu_load+0x46/0x290 [kvm]
[4576846.510124]  [<ffffffff9b6a1de7>] ? finish_task_switch+0x137/0x210
[4576846.510126]  [<ffffffff9bc15ab1>] ? __schedule+0x241/0x6f0
[4576846.510129]  [<ffffffffc0c33ed0>] ? vmx_set_supported_cpuid+0x20/0x20 [kvm_intel]
[4576846.510132]  [<ffffffffc0c31be0>] ? vmx_set_tss_addr+0x130/0x130 [kvm_intel]
[4576846.510134]  [<ffffffff9bc15f92>] ? schedule+0x32/0x80
[4576846.510145]  [<ffffffffc0d8c45a>] ? kvm_vcpu_block+0x8a/0x2f0 [kvm]
[4576846.510158]  [<ffffffffc0da8c8a>] ? kvm_arch_vcpu_ioctl_run+0x44a/0x16d0 [kvm]
[4576846.510160]  [<ffffffff9b814a2b>] ? pipe_write+0x29b/0x3e0
[4576846.510171]  [<ffffffffc0d8e665>] ? kvm_vcpu_ioctl+0x315/0x5e0 [kvm]
[4576846.510173]  [<ffffffff9b820312>] ? do_vfs_ioctl+0xa2/0x620
[4576846.510175]  [<ffffffff9b820904>] ? SyS_ioctl+0x74/0x80
[4576846.510176]  [<ffffffff9b603b7d>] ? do_syscall_64+0x8d/0x100
[4576846.510178]  [<ffffffff9bc1a88e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6

This seems to only fail when under load. I've thought it was 'fixed about four times only to have it crash and cause downtime each time. That just bit be again today. I'm ready to just throw this server in the trash.

Andrew added a parent task: Unknown Object (Task).Jul 23 2019, 3:10 PM
Andrew updated the task description. (Show Details)Jul 23 2019, 3:18 PM
RobH added a comment.Jul 23 2019, 8:13 PM
/admin1-> racadm getsel
Record:      1
Date/Time:   05/30/2019 17:38:49
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/22/2019 17:03:54
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/22/2019 19:43:10
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
RobH added a comment.Jul 23 2019, 8:14 PM

Bad dimm, Chris moved it from B3 to A3 on

Swapped DIMM B3 with DIMM A3 and cleared the log.

And now it has the above errors on dimm a3

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.

Thanks,
Willy

RobH added a comment.EditedJul 24 2019, 1:17 PM

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.
Thanks,
Willy

This should be enough. I'm going to run the memtest, and also put in a self dispatch for new memory to (hopefully) arrive this week.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T13:25:39Z] <robh> rebooting cloudvirt1015 into memtest for dell support repair via T220853

RobH added a comment.Jul 24 2019, 1:47 PM

Ok, this failed with another memory error in the SEL for dimm A3 (the one in question this entire time). I've entered self dispatch SR995043467 with Dell to get a new dimm dispatched.

It should arrive on Thursday or Friday and I can swap it out.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T13:49:08Z] <robh> rebooting cloudvirt1015 into OS, memory error confirmed. new memory replacement dispatch entered via T220853

RobH added a comment.Jul 25 2019, 1:08 AM

Dear Rob Halsell,
Your dispatch shipped on 7/24/2019 7:50 PM
What's Next?

If you need to make any changes to the dispatch contact information, please visit our Support Center or Click Here to chat with a live support representative.
For expedited service to our premium tech agents please use Express Service Code when calling Dell. The Express Service Code is located under your Portables or on the back of desktop.
You may also check for updates via our Online Status page.

Please see below for important information.
Dispatch Number: 713921885
Work Order Number: SR995043467
Waybill Number: 109793257685
Service Tag: 31R9KH2
PO/Reference: T220853

RobH added a comment.EditedJul 25 2019, 1:11 AM

parts arrival for thursday has EQ inbound shipment ticket - 1-191287024247

Mentioned in SAL (#wikimedia-operations) [2019-07-25T13:35:24Z] <robh> cloudvirt1015 offline for ram swap via T220853

RobH added a comment.EditedJul 25 2019, 1:48 PM

copying the SEL to this task before I erase it

Record:      1
Date/Time:   07/24/2019 13:23:07
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/24/2019 13:32:47
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/24/2019 13:33:15
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/25/2019 13:47:03
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/25/2019 13:47:08
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
RobH reassigned this task from Cmjohnson to Andrew.EditedJul 25 2019, 1:50 PM

@Andrew:

We've swapped out the failed memory dimm on this system and the new one hasn't reported any errors (as of yet.)

Can you return this to service (perhaps with those test vms you mention) and see if any other issues crop up?

Please note: We will either resolve this task soon, or remove the ops-eqiad tag. Either way, we want to clear it off our workboard for onsite tasks. If you want to keep this task open for your own reference, please just remove ops-eqiad.

Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vacation next week.

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T14:06:58Z] <jeh> create 4 testing VMs on cloudvirt1015 T220853

Created these VMs

openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015
| 30f17a94-252e-46d2-aa28-e6f24c9c457e | cv1015-testing03                  | cloudvirt1015   |
| d1b13075-ace4-44ba-8f26-c9c12a360184 | cv1015-testing02                  | cloudvirt1015   |
| b99a2376-1bb1-48f9-9889-00d3aedb9a43 | cv1015-testing01                  | cloudvirt1015   |
| e65ff310-f0ef-451c-956c-8d21b21cc12a | cv1015-testing04                  | cloudvirt1015   |

Each VM has stress-ng running with the command

sudo screen -d -m /usr/bin/stress-ng --timeout 600 --fork 4 --cpu 4 --vm 30 --vm-bytes 1G --vm-method all --verify

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T14:49:50Z] <jeh> running cpu and ram stress tests on cloudvirt1015 T220853

Cmjohnson closed this task as Resolved.Jul 25 2019, 6:34 PM

@Andrew Resolving this task (again) if the same issue returns please reopen. If it's a different issue please create a new task.

Andrew reopened this task as Open.Jul 27 2019, 10:27 PM
Andrew reassigned this task from Andrew to wiki_willy.

I put this system under a realistic load today (running ~80 VMs) and it crashed after not all that long. I had to reboot in order to get access. I don't see anything in the syslog that presaged a crash...

Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.364 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.376 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.434 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 07:58:01 cloudvirt1015 CRON[41276]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 07:58:20 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:20.504 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Auditing locally available compute resources for node cloudvirt1015.eqiad.wmnet
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.264 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Total usable vcpus: 72, total allocated vcpus: 90
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.265 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.277 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.343 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 07:59:01 cloudvirt1015 CRON[42583]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 07:59:20 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:20.501 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Auditing locally available compute resources for node cloudvirt1015.eqiad.wmnet
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.670 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Total usable vcpus: 72, total allocated vcpus: 90
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.671 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.681 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.742 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 08:00:01 cloudvirt1015 CRON[43916]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'br_netfilter'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'ipmi_devintf'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'nbd'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'iscsi_tcp'

I haven't dug up much else on account of it being Saturday :)

RobH added a comment.Jul 29 2019, 7:27 PM

I don't see any errors in the Service Event Log:

/admin1-> racadm getsel
Record:      1
Date/Time:   07/25/2019 13:49:05
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------

It just has the entry of me clearing it of the last error after the replacement memory. I don't really see any kind of errors in the above comment either, so I'm going to reboot it (it appears locked up at this time) into the dell hardware test suite.

RobH added a comment.Jul 29 2019, 7:29 PM

ePSA Pre-boot System Assessment is now running, will update task with results

wiki_willy reassigned this task from wiki_willy to RobH.Jul 30 2019, 10:51 PM

Assigning to @RobH for results from ePSA pre-boot system assessment, before determining the next steps.

RobH added a comment.Aug 7 2019, 10:13 PM

I neglected to update this, but it passed all dell epsa tests without crash.

If all we have is the log from T220853#5371114, then it really isn't much to go on. I suppose we can insist to our Dell team they send us a new mainboard since we've tried everything else.

wiki_willy reassigned this task from RobH to Cmjohnson.Aug 7 2019, 10:17 PM

Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If they give you push back, let me know and I can try escalating with our account manager.

Thanks,
Willy

Submitted the ticket with Dell. We will see what happens

You have successfully submitted request SR996138617.

Dell approved my ticket. I talked to the technician today and he will be
out Monday morning to replace the motherboard.

Thanks Chris, hopefully this will solve things.

Did the technician replace the mainboard?

Board arrived DOA...need another one

bd808 added a comment.Aug 22 2019, 9:21 PM

Board arrived DOA...need another one

The haunting extends to replacement parts too. Maybe we need to consult an exorcist. ;)

motherboard replaced set idrac and password

Cmjohnson closed this task as Resolved.Aug 23 2019, 2:57 PM

Finished the idrac setup. on-site work is complete

Andrew claimed this task.Sep 4 2019, 2:27 PM

I'll see if I can make it crash again!

Mentioned in SAL (#wikimedia-operations) [2019-09-04T14:51:57Z] <andrewbogott> reimaging cloudvirt1015 for T220853

Andrew added a comment.Sep 4 2019, 3:06 PM

btw, @Cmjohnson, did you restore BIOS settings after replacing the board?

Andrew added a comment.Sep 4 2019, 3:11 PM

(I just now enabled virtualization in the bios)

Andrew reopened this task as Open.Sep 4 2019, 8:25 PM
Andrew reassigned this task from Andrew to wiki_willy.

I can still make this crash -- my process is scheduling 80 VMs on the host, and then getting them all busy, like this:

andrew@labpuppetmaster1001:~$ sudo cumin --force --timeout 500 -o json  "name:stresstest1015" "/usr/bin/stress-ng --fork 4 2-cpu 1 --vm 30 --vm-bytes 1G --vm-method all --verify"

I was tailing the syslog during the last crash; it looks like this:

https://phabricator.wikimedia.org/P9042

Meanwhile, the console is very busy (even after the system became unreachable):

https://phabricator.wikimedia.org/P9043

wiki_willy added a comment.EditedSep 4 2019, 8:54 PM

Hi @Andrew - I mentioned the ongoing issues with this machine to our Dell account rep last week, since we've basically replaced every CPU/DIMM/MB on this box. They mentioned we could install Live Optics to evaluate load, but I'm not sure this is something we want to run on our hardware. Do you have another cloudvirt machine up and running right now on the same hardware specs? Essentially running at the same CPU usage...mainly so we can compare and try to isolate any other type of config differences between them.

Thanks,
Willy

Andrew added a comment.Sep 4 2019, 9:05 PM

@wiki_willy, the parent task of this task is the procurement for four identical systems: cloudvirt1015, 1016. 1017, 1018. 1018 has had some problems as well, but I don't see a lot of issues for 1016 or 1017 in phab history.

Thanks @Andrew - I'll reach out to our Account Rep, to see if something else can be done.

Emailed our Dell account rep, who responded that they will look into what our options are and get back to us. Thanks, Willy

wiki_willy reassigned this task from wiki_willy to Cmjohnson.Sep 9 2019, 5:20 PM

Here's the response I got from Dell (pasted below). @Cmjohnson or @Jclark-ctr : can one of you guys call Dell at 1-800-456-3355, explain to them the numerous parts we've already replaced (and that it continues to crash on load) and get them to analyze the logs for the system? Let me know how it goes.

Thanks,
Willy

Here are the case that were created on behalf of ST 31R9KH2:

SR 996138617 Created 8/14/19
SR 995043467 Created 7/24/19
SR 986941687 Created 2/25/19
SR 955632952 Created 10/23/17
SR 953656459 Created 9/11/17

None of these cases had case owners because they were parts dispatches through our Tech Direct system.

I had a person in our Tech Support team analyze these cases and there is not much to go on because at Dell we didn’t receive logs. Tech Direct system has its advantages and disadvantages. Getting parts such as drives, Psu, Dimms and such quickly are the advantage. The disadvantage is proper troubleshooting doesn’t always occur and some issues get parts thrown at them.

Tech support suggestion is to open a case with an actual person in tech support and have them analyze the logs for the system. This system does have a Basic warranty so your techs would need call 1-800-456-3355, Monday through Friday 7am to 7pm CST (5am to 5pm PST).

What's the status, was there a reply from Dell?

@wiki_willy, is there any update on this issue? We're still a bit short on capacity due to missing this host and cloudvirt1024.

Hi @Andrew - apologies for the delay. Chris has been out, but @Jclark-ctr is going to follow up on this. Thanks, Willy

Dell EMC SR # 1000122167 || Service Tag: 31R9KH2 || Server Crashes under Load

opened SR with Dell forward TSR report for further diagnostics. Rep Advised that only basic warranty on host we do not have pro-support. Might require further diagnostic on Dells part