Page MenuHomePhabricator

VMs on cloudvirt1015 crashing - bad mainboard/memory
Open, NormalPublic

Description

I put cloudvirt1015 into service on Monday the 8th. Yesterday (the 11th) tools-prometheus-01 crashed with a kernel panic. On Friday 12th, tools-worker-1023 also crashed.
On Saturday 13th, @Zppix's puppet-lta.lta-tracker.eqiad.wmflabs crashed

We've replaced lots of parts in this box, to no avail:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 503501 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Depool cloudvirt1015

https://gerrit.wikimedia.org/r/503501

from tools-worker-1023:

tools-worker-1023 login: [239649.270695] general protection fault: 0000 [#1] SMP
[239649.272361] Modules linked in: veth xfs xt_nat ipt_REJECT nf_reject_ipv4 xt_comment xt_mark br_netfilter xt_addrtype bridge xfrm_user xfrm_algo dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp xt_pkttype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter ip_tables x_tables nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache sch_ingress cls_u32 sch_htb qxl crct10dif_pclmul ttm crc32_pclmul evdev drm_kms_helper ghash_clmulni_intel virtio_console pcspkr serio_raw virtio_balloon drm button ifb act_mirred nf_conntrack autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache dm_mod ata_generic virtio_blk virtio_net crc32c_intel aesni_intel aes_x86_64 glue_helper psmouse lrw gf128mul ablk_helper cryptd ata_piix uhci_hcd virtio_pci libata virtio_ring ehci_hcd i2c_piix4 virtio usbcore scsi_mod usb_common
[239649.302023] CPU: 0 PID: 14120 Comm: kubelet Not tainted 4.9.0-0.bpo.6-amd64 #1 Debian 4.9.88-1+deb9u1~bpo8+1
[239649.304368] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1 04/01/2014
[239649.306306] task: ffff90b8e1349140 task.stack: ffffb9e482fcc000
[239649.307743] RIP: 0010:[<ffffffff82ca3b2b>]  [<ffffffff82ca3b2b>] sched_set_stop_task+0x1b/0xb0
[239649.309850] RSP: 0018:ffffb9e482fcf880  EFLAGS: 00010046
[239649.311168] RAX: 0000d9f5b19b36ff RBX: ffff90b93fc18ec0 RCX: 0000000000000000
[239649.312867] RDX: 0000000000000001 RSI: ffff90b8e1349140 RDI: ffff90b93fff1000
[239649.314491] RBP: ffffb9e482fcf8e8 R08: 000000000000060a R09: ffffb9e482fcfd90
[239649.316132] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90b8e1349140
[239649.317764] R13: 0000000000018ec0 R14: ffff90b93fc18ec0 R15: ffff90b8e13496f8
[239649.319382] FS:  00007f8a818a6700(0000) GS:ffff90b93fc00000(0000) knlGS:0000000000000000
[239649.321234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[239649.322543] CR2: 000000c82189b000 CR3: 0000000201c9c000 CR4: 0000000000360670
[239649.324191] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[239649.325707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[239649.327202] Stack:
[239649.327661]  ffff90b8e1349140 000000013fc18ec0 ffff90b93fc18ec0 ffffffff8320eb47
[239649.329355]  ffffffff82ceaa9a 0000000000000000 0000000000014d80 0000000000000000
[239649.331026]  ffff90b8e1349140 000000000000c350 0000000000000000 0000000000000000
[239649.332697] Call Trace:
[239649.333254]  [<ffffffff8320eb47>] ? __schedule+0x417/0x6f0
[239649.334417]  [<ffffffff82ceaa9a>] ? hrtimer_start_range_ns+0x1aa/0x3d0
[239649.335808]  [<ffffffff8320ee52>] ? schedule+0x32/0x80
[239649.336830]  [<ffffffff832127b5>] ? schedule_hrtimeout_range_clock+0xa5/0x130
[239649.338232]  [<ffffffff82cea660>] ? hrtimer_init+0x110/0x110
[239649.339355]  [<ffffffff82e1ce26>] ? poll_schedule_timeout+0x46/0x70
[239649.340608]  [<ffffffff82e1d829>] ? do_select+0x579/0x7e0
[239649.341683]  [<ffffffff82cad173>] ? account_entity_dequeue+0xa3/0xc0
[239649.342937]  [<ffffffff82cb0a60>] ? dequeue_entity+0xe0/0x400
[239649.344086]  [<ffffffff82cb1867>] ? put_prev_entity+0x47/0x850
[239649.345241]  [<ffffffff82c257c9>] ? __switch_to+0x2c9/0x730
[239649.346346]  [<ffffffff8320e97b>] ? __schedule+0x24b/0x6f0
[239649.347434]  [<ffffffff8320ee52>] ? schedule+0x32/0x80
[239649.348456]  [<ffffffff82cfa44b>] ? futex_wait_queue_me+0xeb/0x140
[239649.349612]  [<ffffffff82cfb402>] ? futex_wait+0x232/0x260
[239649.350641]  [<ffffffff82ca4904>] ? try_to_wake_up+0x54/0x390
[239649.351721]  [<ffffffff82e1dc61>] ? core_sys_select+0x1d1/0x2d0
[239649.352877]  [<ffffffff82cfce17>] ? do_futex+0x2b7/0xb40
[239649.353923]  [<ffffffff82c59eda>] ? kvm_clock_get_cycles+0x1a/0x20
[239649.355095]  [<ffffffff82cefbf6>] ? ktime_get_ts64+0x46/0xf0
[239649.356180]  [<ffffffff82c59eda>] ? kvm_clock_get_cycles+0x1a/0x20
[239649.357420]  [<ffffffff82cefbf6>] ? ktime_get_ts64+0x46/0xf0
[239649.358489]  [<ffffffff82e1de1a>] ? SyS_select+0xba/0x110
[239649.359521]  [<ffffffff82c03bd1>] ? do_syscall_64+0x91/0x1a0
[239649.360626]  [<ffffffff832137ce>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[239649.361999] Code: 34 24 eb 91 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 41 54 49 89 f4 55 53 48 63 ef 48 c7 c3 c0 8e 01 00 48 83 ec <10> 48 8b 14 ed c0 e4 66 83 65 48 8b 04 25 28 00 00 00 48 89 44 
[239649.367173] RIP  [<ffffffff82ca3b2b>] sched_set_stop_task+0x1b/0xb0
[239649.368355]  RSP <ffffb9e482fcf880>

Change 503501 merged by Andrew Bogott:
[operations/puppet@production] Depool cloudvirt1015

https://gerrit.wikimedia.org/r/503501

Zppix moved this task from Backlog to Monitor on the User-Zppix board.
Zppix added a subscriber: Zppix.

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T18:25:56Z] <bd808> Restarted nova-compute service on cloudvirt1015 (T220853)

$ nova list --all-tenants --host cloudvirt1015
+--------------------------------------+----------------------------+-----------------+--------+------------+-------------+----------------------------------------+
| ID                                   | Name                       | Tenant ID       | Status | Task State | Power State | Networks                               |
+--------------------------------------+----------------------------+-----------------+--------+------------+-------------+----------------------------------------+
| bf38a115-ec86-4f72-8593-26e7d480cb1f | canary1015-01              | testlabs        | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.3.113 |
| 5db3a3ba-2137-4b32-9205-d42a6655135f | product-analytics-bayes    | discovery-stats | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.98  |
| a8f427fe-258e-48fa-92e4-afc3385736ae | product-analytics-test     | discovery-stats | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.107 |
| 8b764eb9-2dca-4902-a9c5-ed54fa3fc57d | puppet-lta                 | lta-tracker     | ERROR  | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.132 |
| adcfb94d-be6f-4d90-8d16-8fa6bdbb2419 | tools-clushmaster-02       | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.183 |
| 3077fa2a-853d-472b-871b-8938ba92982b | tools-elastic-02           | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.74  |
| de3e7366-f6f0-464e-a682-b3d615b97127 | tools-flannel-etcd-02      | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.145 |
| cdd9a533-0f70-4a0e-baf2-4cc13b1138e8 | tools-k8s-etcd-03          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.149 |
| 6c516b3c-8aa0-425b-ac02-5b14a5a089c9 | tools-paws-worker-1002     | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.188 |
| 581bd80b-60b0-4b17-acba-85cf1a29729e | tools-prometheus-01        | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.194 |
| 9911d41c-076d-4d25-b1eb-cd84e6175f6e | tools-worker-1014          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.3.94  |
| e875c8e9-155c-40ca-8e19-a5e60a52c742 | tools-worker-1015          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.3.95  |
| b2682ab1-fa37-4463-86be-71dfe553a101 | tools-worker-1016          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.12  |
| c9188b8f-b1a2-4946-9568-006bdf8be270 | tools-worker-1017          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.125 |
| 351e3410-6b7a-4e42-81b1-4cc5e44fba89 | tools-worker-1020          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.158 |
| 7e99f250-c643-475b-b67d-183a24e18c8d | tools-worker-1021          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.192 |
| 9b07f16f-0080-48e7-8a0f-ee9dfcd350d1 | tools-worker-1023          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.196 |
| a03c8dd0-7d34-4f58-93f3-0f840a5725bd | toolsbeta-flannel-etcd-01  | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.177 |
| db180047-318c-45fe-bd92-cca4c154ad84 | toolsbeta-k8s-etcd-01      | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.170 |
| 268ff37d-f5eb-4cbc-9c28-6782f3a94f50 | toolsbeta-paws-master-01   | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.187 |
| 496f5c76-ee3d-49d1-a8db-20bf33b30153 | toolsbeta-paws-worker-1002 | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.181 |
| 7fef05bd-a384-45a4-9d2a-8d15e6423e32 | toolsbeta-proxy-01         | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.18  |
| 7d7e96a7-59bb-4cc6-bd32-17560c4c500c | toolsbeta-services-01      | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.168 |
| ba3a0660-fb86-4fd7-86ee-a8e28a737ad0 | toolsbeta-worker-1001      | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.179 |
+--------------------------------------+----------------------------+-----------------+--------+------------+-------------+----------------------------------------+
Krenair updated the task description. (Show Details)Apr 13 2019, 6:56 PM
[18:29]  <    bd808>	!log admin Rebooting cloudvirt1015 (T220853)

This reboot hung on shutdown. @Bstorm looked via serial console and found:

[***   ] A stop job is running for Suspend/R…bvirt Guests (8min 38s / no limit)[681756.640740] INFO: rcu_sched detected stalls on CPUs/tasks:
[681756.646977]         52-...: (0 ticks this GP) idle=bea/0/0 softirq=7083989/7083989 fqs=0
[681756.655525]         53-...: (694 GPs behind) idle=40e/0/0 softirq=2166089/2166094 fqs=0
[681756.663975]         55-...: (929 GPs behind) idle=032/0/0 softirq=2253600/2253602 fqs=0
[681756.672426]         57-...: (1043 GPs behind) idle=432/0/0 softirq=2199382/2199382 fqs=0
[681756.680971]         59-...: (780 GPs behind) idle=5fc/0/0 softirq=2237008/2237011 fqs=0
[681756.689420]         (detected by 18, t=5264 jiffies, g=17526939, c=17526938, q=588632)
[681756.697853] rcu_sched kthread starved for 5266 jiffies! g17526939 c17526938 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  *** ] A stop job is running for Suspend/R…ibvirt Guests (9min 7s / no limit)
Krenair added a subscriber: mpopov.EditedApr 13 2019, 6:57 PM

FYI @mpopov
Apr 12 22:12:25 <andrewbogott> bearloga: I've just seen two VMs crash in a row on cloudvirt1015, so I'm a bit worried that the processor there is making mistakes (we recently replaced a processor there and I thought it was fixed but now I'm not so sure…) product-analytics-bayes and product-analytics-test are on that host.
Apr 12 22:13:02 <andrewbogott> To be on the safe side I can move them, or you could delete and rebuild if that's easy for you. cloudvirt1015 is depooled now so if you recreate they'll get scheduled someplace else
Apr 12 22:13:05 <andrewbogott> lmk your preference

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T19:51:56Z] <bstorm_> started migrating tools-flannel-etcd-02 to cloudvirt1013 T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T19:58:50Z] <bstorm_> started migrating tools-k8s-etcd-03 to cloudvirt1012 T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:17:43Z] <bstorm_> migrated product-analytics-bayes to cloudvirt1009 for T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:28:39Z] <bstorm_> migrated product-analytics-test to cloudvirt1009 for T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:36:14Z] <bstorm_> moving tools-elastic-02 to cloudvirt1009 for T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:43:59Z] <bstorm_> Moving puppet-lta to a new server because of hardware problems T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T21:08:05Z] <bstorm_> Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for T220853

After forcing powercycle on the console, I have been evacuating things that are more important. Once the last three finish, it will be some k8s workers (one paws worker) and a bunch of toolsbeta stuff. I'll leave it at that for testing and such, but the tools-workers are depooled at this time (most are also still down after the reboot). The paws worker is off as well.

The downtime will expire soon. After that, there are three VMs up right now to keep it from sending pages. I figure the rest is probably ok to stay on the server until Monday, unless someone wants to move the toolsbeta things, which wouldn't suck in case the server gets worse.

A more complete version of what I copied out of the console before I rebooted can be found at P8394

Bstorm added a comment.EditedApr 13 2019, 9:45 PM

From dmesg:

[ 5829.292669] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 6289.446681] brq7425e328-56: port 4(tap1f652fcb-db) entered disabled state
[ 6289.447040] device tap1f652fcb-db left promiscuous mode
[ 6289.447044] brq7425e328-56: port 4(tap1f652fcb-db) entered disabled state
[ 6609.259082] perf: interrupt took too long (3146 > 3136), lowering kernel.perf_event_max_sample_rate to 63500
[ 6871.260131] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 45.361 msecs
[ 6871.260138] perf: interrupt took too long (354548 > 3932), lowering kernel.perf_event_max_sample_rate to 500
[ 6871.327471] INFO: NMI handler (ghes_notify_nmi) took too long to run: 112.076 msecs
[ 6872.646908] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[ 6872.646911] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 6872.646913] {1}[Hardware Error]: event severity: corrected
[ 6872.646915] {1}[Hardware Error]:  Error 0, type: corrected
[ 6872.646917] {1}[Hardware Error]:  fru_text: B3
[ 6872.646918] {1}[Hardware Error]:   section_type: memory error
[ 6872.646920] {1}[Hardware Error]:   error_status: 0x0000000000000400
[ 6872.646921] {1}[Hardware Error]:   physical_address: 0x00000077b4ea7800
[ 6872.646925] {1}[Hardware Error]:   node: 1 card: 2 module: 0 rank: 0 bank: 2 row: 45906 column: 992
[ 6872.646927] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[ 6872.646949] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 6872.646951] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[ 6872.646952] EDAC sbridge MC0: TSC eb28091adcc
[ 6872.646954] EDAC sbridge MC0: ADDR 77b4ea7800
[ 6872.646956] EDAC sbridge MC0: MISC 0
[ 6872.646958] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1555188932 SOCKET 0 APIC 0
[ 6872.646982] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x77b4ea7 offset:0x800 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:0)
[ 7164.888310] mce: [Hardware Error]: Machine check events logged
Andrew added subscribers: Cmjohnson, RobH.

@RobH and/or @Cmjohnson, I'm hoping the above is enough to pass on to Dell for a replacement part. Let us know if you need more details.

Restricted Application added a project: Operations. · View Herald TranscriptApr 13 2019, 9:53 PM

I finished draining cloudvirt1015 and put it in downtime, so it's ready for whatever reboots/rebuilds/hardware changes might be needed.

Andrew updated the task description. (Show Details)Apr 15 2019, 3:08 PM
bd808 updated the task description. (Show Details)Apr 15 2019, 3:10 PM

I just deleted product-analytics-test and product-analytics-bayes so y'all don't need to worry about those instances :)

colewhite triaged this task as Normal priority.Apr 16 2019, 6:07 PM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Apr 16 2019, 6:14 PM

*bump* Chris, do you have any thoughts about what we should do next here?

Mentioned in SAL (#wikimedia-operations) [2019-05-15T20:20:29Z] <robh> rebooting cloudvirt1015 into dell hardware tests per T220853

RobH added a comment.May 15 2019, 8:41 PM

Ok, so this has had CPU issues from the get go, tracked on both T215012 and T171473. It seems that the CPUs have been swapped, but not the mainboard. Considering its throwing CPU errors after all the CPU swaps, I advise we swap the mainboard next.

I'm attempting to run Dell hw test to get a failure code.

RobH claimed this task.May 15 2019, 8:42 PM
RobH reassigned this task from RobH to Cmjohnson.May 15 2019, 8:50 PM
RobH moved this task from Cloud Tasks to Hardware Failure / Troubleshoot on the ops-eqiad board.

Error output:

The event log indicates degraded or disabled ECC functionality.  Memory testing cannot continue until the problems are corrected, the log cleared and the system rebooted.

Technical Support will need this information to diagnose the problem.
Please record the information below.

Service Tag : 31R9KH2
Error Code : 2000-0125                                      
Validation : 126785

The error logs in question are:

-------------------------------------------------------------------------------
Record:      10
Date/Time:   05/15/2019 20:41:12
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   05/15/2019 20:41:45
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------

So either the memory is bad, or the mainboard is bad. Since the mainboard is also throwing CPU errors (see above), we should try to get both mainboard and memory dimm replaced.

RobH renamed this task from VMs on cloudvirt1015 crashing to VMs on cloudvirt1015 crashing - bad mainboard/memory.May 15 2019, 8:50 PM
mpopov removed a subscriber: mpopov.May 17 2019, 4:57 PM

Swapped DIMM B3 with DIMM A3 and cleared the log.

Andrew added a comment.Jun 5 2019, 6:23 PM

Any update about this? Are parts on the way?

@Andrew what parts? There is nothing that suggests that it is CPU on the server side of things. I reseated and moved the DIMM and that error has not returned. It may very well have been poorly seated DIMM. I checked dmesg and do not see any more errors related to memory or CPU. Try putting it back into production and let's see if anything comes back. Unfortunately, I need to demonstrate and prove there is a problem for Dell to do anything and right now I do not have anything to give them.

I'm put eight test VMs on 1015, will let them run for a few days and then see if they're still up :)

Mentioned in SAL (#wikimedia-cloud) [2019-06-17T13:59:22Z] <andrewbogott> moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )

Cmjohnson closed this task as Resolved.Jul 11 2019, 5:47 PM

I am resolving this task

aborrero reopened this task as Open.Mon, Jul 22, 4:56 PM
aborrero added a subscriber: aborrero.

The server just died again. I found this in the mgmt console:

[4576846.406213] Code: 00 75 73 48 83 c4 38 5b 5d c3 48 8d 74 24 10 48 89 d1 89 df 48 89 ea e8 28 fe ff ff 8b 54 24 28 83 e2 01 74 0b f3 90 8b 54 24 28 <83> e2 01 75 f5 eb c1 8b 05 c4 9a db 00 85 c0 75 83 80 3d 94 49 
[4576846.509999] NMI watchdog: BUG: soft lockup - CPU#54 stuck for 22s! [CPU 1/KVM:72246]
[4576846.510028] Modules linked in: cpuid xt_multiport nf_conntrack_netlink ebt_arp ebt_among ip6table_raw ip6table_mangle nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat xt_connmark iptable_mangle xt_mac xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_physdev xt_set xt_conntrack nf_conntrack ip_set_hash_net ip_set nfnetlink vhost_net vhost macvtap macvlan tun binfmt_misc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iptable_raw 8021q garp mrp xfs intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas kvm irqbypass mgag200 crct10dif_pclmul ttm crc32_pclmul drm_kms_helper ghash_clmulni_intel drm i2c_algo_bit sg pcspkr mei_me lpc_ich mfd_core mei evdev shpchp ipmi_si wmi button ib_iser
[4576846.510050]  rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nbd ipmi_devintf ipmi_msghandler br_netfilter bridge stp llc ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear md_mod dm_mod sd_mod aesni_intel aes_x86_64 ehci_pci glue_helper ahci ehci_hcd bnx2x lrw gf128mul libahci ablk_helper ptp cryptd pps_core libata megaraid_sas mdio usbcore libcrc32c crc32c_generic usb_common crc32c_intel scsi_mod
[4576846.510052] CPU: 54 PID: 72246 Comm: CPU 1/KVM Tainted: G      D      L  4.9.0-9-amd64 #1 Debian 4.9.168-1+deb9u2
[4576846.510053] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.9.1 12/04/2018
[4576846.510054] task: ffffa0bed38c40c0 task.stack: ffffabedf4fe4000
[4576846.510057] RIP: 0010:[<ffffffff9b6feaa7>]  [<ffffffff9b6feaa7>] smp_call_function_single+0xd7/0x130
[4576846.510058] RSP: 0018:ffffabedf4fe7ab0  EFLAGS: 00000202
[4576846.510059] RAX: 0000000000000000 RBX: 0000000000000026 RCX: ffffabedf4a5fac0
[4576846.510060] RDX: 0000000000000003 RSI: ffffabedf4fe7ac0 RDI: ffffabedf4fe7ac0
[4576846.510061] RBP: ffffffffc0c35620 R08: 001042497cc14308 R09: 0000000000000014
[4576846.510062] R10: 0000000000000cb7 R11: 0000000000000ef8 R12: 0000000000000026
[4576846.510063] R13: ffffa0cca1528040 R14: ffffa0eb7816d400 R15: 0000000000000036
[4576846.510064] FS:  00007f6fb1728700(0000) GS:ffffa0bcbfac0000(0000) knlGS:0000000000000000
[4576846.510065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4576846.510066] CR2: 00007f4e3b13a398 CR3: 00000056c771e000 CR4: 0000000000362670
[4576846.510067] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[4576846.510068] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[4576846.510069] Stack:
[4576846.510071]  0000001a0000000c 0000000100000024 ffffabedf4a5fac0 ffffffffc0c35620
[4576846.510073]  ffffa0cca152c5d8 0000000000000003 d38dee9ce6e66bab 0000000000000036
[4576846.510075]  ffffabedf4fe7c00 ffffffffc0c34abf 0000000000000000 0000000000000016
[4576846.510075] Call Trace:
[4576846.510079]  [<ffffffffc0c35620>] ? update_debugctlmsr+0x20/0x20 [kvm_intel]
[4576846.510083]  [<ffffffffc0c34abf>] ? vmx_vcpu_load+0x9f/0x350 [kvm_intel]
[4576846.510085]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510086]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510088]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510089]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510090]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510092]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510093]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510095]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510096]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510098]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510099]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510101]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510102]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510104]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510105]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510107]  [<ffffffff9bc1a9b0>] ? __switch_to_asm+0x40/0x70
[4576846.510108]  [<ffffffff9bc1a9a4>] ? __switch_to_asm+0x34/0x70
[4576846.510122]  [<ffffffffc0da2bd6>] ? kvm_arch_vcpu_load+0x46/0x290 [kvm]
[4576846.510124]  [<ffffffff9b6a1de7>] ? finish_task_switch+0x137/0x210
[4576846.510126]  [<ffffffff9bc15ab1>] ? __schedule+0x241/0x6f0
[4576846.510129]  [<ffffffffc0c33ed0>] ? vmx_set_supported_cpuid+0x20/0x20 [kvm_intel]
[4576846.510132]  [<ffffffffc0c31be0>] ? vmx_set_tss_addr+0x130/0x130 [kvm_intel]
[4576846.510134]  [<ffffffff9bc15f92>] ? schedule+0x32/0x80
[4576846.510145]  [<ffffffffc0d8c45a>] ? kvm_vcpu_block+0x8a/0x2f0 [kvm]
[4576846.510158]  [<ffffffffc0da8c8a>] ? kvm_arch_vcpu_ioctl_run+0x44a/0x16d0 [kvm]
[4576846.510160]  [<ffffffff9b814a2b>] ? pipe_write+0x29b/0x3e0
[4576846.510171]  [<ffffffffc0d8e665>] ? kvm_vcpu_ioctl+0x315/0x5e0 [kvm]
[4576846.510173]  [<ffffffff9b820312>] ? do_vfs_ioctl+0xa2/0x620
[4576846.510175]  [<ffffffff9b820904>] ? SyS_ioctl+0x74/0x80
[4576846.510176]  [<ffffffff9b603b7d>] ? do_syscall_64+0x8d/0x100
[4576846.510178]  [<ffffffff9bc1a88e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6

This seems to only fail when under load. I've thought it was 'fixed about four times only to have it crash and cause downtime each time. That just bit be again today. I'm ready to just throw this server in the trash.

Andrew added a parent task: Unknown Object (Task).Tue, Jul 23, 3:10 PM
Andrew updated the task description. (Show Details)Tue, Jul 23, 3:18 PM
RobH added a comment.Tue, Jul 23, 8:13 PM
/admin1-> racadm getsel
Record:      1
Date/Time:   05/30/2019 17:38:49
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/22/2019 17:03:54
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/22/2019 19:43:10
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
RobH added a comment.Tue, Jul 23, 8:14 PM

Bad dimm, Chris moved it from B3 to A3 on

Swapped DIMM B3 with DIMM A3 and cleared the log.

And now it has the above errors on dimm a3

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.

Thanks,
Willy

RobH added a comment.EditedWed, Jul 24, 1:17 PM

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.
Thanks,
Willy

This should be enough. I'm going to run the memtest, and also put in a self dispatch for new memory to (hopefully) arrive this week.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T13:25:39Z] <robh> rebooting cloudvirt1015 into memtest for dell support repair via T220853

RobH added a comment.Wed, Jul 24, 1:47 PM

Ok, this failed with another memory error in the SEL for dimm A3 (the one in question this entire time). I've entered self dispatch SR995043467 with Dell to get a new dimm dispatched.

It should arrive on Thursday or Friday and I can swap it out.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T13:49:08Z] <robh> rebooting cloudvirt1015 into OS, memory error confirmed. new memory replacement dispatch entered via T220853

RobH added a comment.Thu, Jul 25, 1:08 AM

Dear Rob Halsell,
Your dispatch shipped on 7/24/2019 7:50 PM
What's Next?

If you need to make any changes to the dispatch contact information, please visit our Support Center or Click Here to chat with a live support representative.
For expedited service to our premium tech agents please use Express Service Code when calling Dell. The Express Service Code is located under your Portables or on the back of desktop.
You may also check for updates via our Online Status page.

Please see below for important information.
Dispatch Number: 713921885
Work Order Number: SR995043467
Waybill Number: 109793257685
Service Tag: 31R9KH2
PO/Reference: T220853

RobH added a comment.EditedThu, Jul 25, 1:11 AM

parts arrival for thursday has EQ inbound shipment ticket - 1-191287024247

Mentioned in SAL (#wikimedia-operations) [2019-07-25T13:35:24Z] <robh> cloudvirt1015 offline for ram swap via T220853

RobH added a comment.EditedThu, Jul 25, 1:48 PM

copying the SEL to this task before I erase it

Record:      1
Date/Time:   07/24/2019 13:23:07
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/24/2019 13:32:47
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/24/2019 13:33:15
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/25/2019 13:47:03
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/25/2019 13:47:08
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
RobH reassigned this task from Cmjohnson to Andrew.EditedThu, Jul 25, 1:50 PM

@Andrew:

We've swapped out the failed memory dimm on this system and the new one hasn't reported any errors (as of yet.)

Can you return this to service (perhaps with those test vms you mention) and see if any other issues crop up?

Please note: We will either resolve this task soon, or remove the ops-eqiad tag. Either way, we want to clear it off our workboard for onsite tasks. If you want to keep this task open for your own reference, please just remove ops-eqiad.

Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vacation next week.

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T14:06:58Z] <jeh> create 4 testing VMs on cloudvirt1015 T220853

Created these VMs

openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015
| 30f17a94-252e-46d2-aa28-e6f24c9c457e | cv1015-testing03                  | cloudvirt1015   |
| d1b13075-ace4-44ba-8f26-c9c12a360184 | cv1015-testing02                  | cloudvirt1015   |
| b99a2376-1bb1-48f9-9889-00d3aedb9a43 | cv1015-testing01                  | cloudvirt1015   |
| e65ff310-f0ef-451c-956c-8d21b21cc12a | cv1015-testing04                  | cloudvirt1015   |

Each VM has stress-ng running with the command

sudo screen -d -m /usr/bin/stress-ng --timeout 600 --fork 4 --cpu 4 --vm 30 --vm-bytes 1G --vm-method all --verify

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T14:49:50Z] <jeh> running cpu and ram stress tests on cloudvirt1015 T220853

Cmjohnson closed this task as Resolved.Thu, Jul 25, 6:34 PM

@Andrew Resolving this task (again) if the same issue returns please reopen. If it's a different issue please create a new task.

Andrew reopened this task as Open.Sat, Jul 27, 10:27 PM
Andrew reassigned this task from Andrew to wiki_willy.

I put this system under a realistic load today (running ~80 VMs) and it crashed after not all that long. I had to reboot in order to get access. I don't see anything in the syslog that presaged a crash...

Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.364 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.376 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.434 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 07:58:01 cloudvirt1015 CRON[41276]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 07:58:20 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:20.504 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Auditing locally available compute resources for node cloudvirt1015.eqiad.wmnet
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.264 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Total usable vcpus: 72, total allocated vcpus: 90
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.265 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.277 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.343 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 07:59:01 cloudvirt1015 CRON[42583]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 07:59:20 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:20.501 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Auditing locally available compute resources for node cloudvirt1015.eqiad.wmnet
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.670 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Total usable vcpus: 72, total allocated vcpus: 90
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.671 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.681 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.742 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 08:00:01 cloudvirt1015 CRON[43916]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'br_netfilter'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'ipmi_devintf'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'nbd'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'iscsi_tcp'

I haven't dug up much else on account of it being Saturday :)

RobH added a comment.Mon, Jul 29, 7:27 PM

I don't see any errors in the Service Event Log:

/admin1-> racadm getsel
Record:      1
Date/Time:   07/25/2019 13:49:05
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------

It just has the entry of me clearing it of the last error after the replacement memory. I don't really see any kind of errors in the above comment either, so I'm going to reboot it (it appears locked up at this time) into the dell hardware test suite.

RobH added a comment.Mon, Jul 29, 7:29 PM

ePSA Pre-boot System Assessment is now running, will update task with results

wiki_willy reassigned this task from wiki_willy to RobH.Tue, Jul 30, 10:51 PM

Assigning to @RobH for results from ePSA pre-boot system assessment, before determining the next steps.

RobH added a comment.Wed, Aug 7, 10:13 PM

I neglected to update this, but it passed all dell epsa tests without crash.

If all we have is the log from T220853#5371114, then it really isn't much to go on. I suppose we can insist to our Dell team they send us a new mainboard since we've tried everything else.

wiki_willy reassigned this task from RobH to Cmjohnson.Wed, Aug 7, 10:17 PM

Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If they give you push back, let me know and I can try escalating with our account manager.

Thanks,
Willy

Submitted the ticket with Dell. We will see what happens

You have successfully submitted request SR996138617.

Dell approved my ticket. I talked to the technician today and he will be
out Monday morning to replace the motherboard.

Thanks Chris, hopefully this will solve things.