Page MenuHomePhabricator

VMs on cloudvirt1015 crashing - bad mainboard/memory
Open, NormalPublic

Description

I put cloudvirt1015 into service on Monday the 8th. Yesterday (the 11th) tools-prometheus-01 crashed with a kernel panic. On Friday 12th, tools-worker-1023 also crashed.
On Saturday 13th, @Zppix's puppet-lta.lta-tracker.eqiad.wmflabs crashed

We've replaced lots of parts in this box, to no avail:

Event Timeline

Andrew created this task.Apr 12 2019, 8:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2019, 8:50 PM

Change 503501 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Depool cloudvirt1015

https://gerrit.wikimedia.org/r/503501

from tools-worker-1023:

tools-worker-1023 login: [239649.270695] general protection fault: 0000 [#1] SMP
[239649.272361] Modules linked in: veth xfs xt_nat ipt_REJECT nf_reject_ipv4 xt_comment xt_mark br_netfilter xt_addrtype bridge xfrm_user xfrm_algo dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat vxlan ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp xt_pkttype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter ip_tables x_tables nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache sch_ingress cls_u32 sch_htb qxl crct10dif_pclmul ttm crc32_pclmul evdev drm_kms_helper ghash_clmulni_intel virtio_console pcspkr serio_raw virtio_balloon drm button ifb act_mirred nf_conntrack autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache dm_mod ata_generic virtio_blk virtio_net crc32c_intel aesni_intel aes_x86_64 glue_helper psmouse lrw gf128mul ablk_helper cryptd ata_piix uhci_hcd virtio_pci libata virtio_ring ehci_hcd i2c_piix4 virtio usbcore scsi_mod usb_common
[239649.302023] CPU: 0 PID: 14120 Comm: kubelet Not tainted 4.9.0-0.bpo.6-amd64 #1 Debian 4.9.88-1+deb9u1~bpo8+1
[239649.304368] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1 04/01/2014
[239649.306306] task: ffff90b8e1349140 task.stack: ffffb9e482fcc000
[239649.307743] RIP: 0010:[<ffffffff82ca3b2b>]  [<ffffffff82ca3b2b>] sched_set_stop_task+0x1b/0xb0
[239649.309850] RSP: 0018:ffffb9e482fcf880  EFLAGS: 00010046
[239649.311168] RAX: 0000d9f5b19b36ff RBX: ffff90b93fc18ec0 RCX: 0000000000000000
[239649.312867] RDX: 0000000000000001 RSI: ffff90b8e1349140 RDI: ffff90b93fff1000
[239649.314491] RBP: ffffb9e482fcf8e8 R08: 000000000000060a R09: ffffb9e482fcfd90
[239649.316132] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90b8e1349140
[239649.317764] R13: 0000000000018ec0 R14: ffff90b93fc18ec0 R15: ffff90b8e13496f8
[239649.319382] FS:  00007f8a818a6700(0000) GS:ffff90b93fc00000(0000) knlGS:0000000000000000
[239649.321234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[239649.322543] CR2: 000000c82189b000 CR3: 0000000201c9c000 CR4: 0000000000360670
[239649.324191] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[239649.325707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[239649.327202] Stack:
[239649.327661]  ffff90b8e1349140 000000013fc18ec0 ffff90b93fc18ec0 ffffffff8320eb47
[239649.329355]  ffffffff82ceaa9a 0000000000000000 0000000000014d80 0000000000000000
[239649.331026]  ffff90b8e1349140 000000000000c350 0000000000000000 0000000000000000
[239649.332697] Call Trace:
[239649.333254]  [<ffffffff8320eb47>] ? __schedule+0x417/0x6f0
[239649.334417]  [<ffffffff82ceaa9a>] ? hrtimer_start_range_ns+0x1aa/0x3d0
[239649.335808]  [<ffffffff8320ee52>] ? schedule+0x32/0x80
[239649.336830]  [<ffffffff832127b5>] ? schedule_hrtimeout_range_clock+0xa5/0x130
[239649.338232]  [<ffffffff82cea660>] ? hrtimer_init+0x110/0x110
[239649.339355]  [<ffffffff82e1ce26>] ? poll_schedule_timeout+0x46/0x70
[239649.340608]  [<ffffffff82e1d829>] ? do_select+0x579/0x7e0
[239649.341683]  [<ffffffff82cad173>] ? account_entity_dequeue+0xa3/0xc0
[239649.342937]  [<ffffffff82cb0a60>] ? dequeue_entity+0xe0/0x400
[239649.344086]  [<ffffffff82cb1867>] ? put_prev_entity+0x47/0x850
[239649.345241]  [<ffffffff82c257c9>] ? __switch_to+0x2c9/0x730
[239649.346346]  [<ffffffff8320e97b>] ? __schedule+0x24b/0x6f0
[239649.347434]  [<ffffffff8320ee52>] ? schedule+0x32/0x80
[239649.348456]  [<ffffffff82cfa44b>] ? futex_wait_queue_me+0xeb/0x140
[239649.349612]  [<ffffffff82cfb402>] ? futex_wait+0x232/0x260
[239649.350641]  [<ffffffff82ca4904>] ? try_to_wake_up+0x54/0x390
[239649.351721]  [<ffffffff82e1dc61>] ? core_sys_select+0x1d1/0x2d0
[239649.352877]  [<ffffffff82cfce17>] ? do_futex+0x2b7/0xb40
[239649.353923]  [<ffffffff82c59eda>] ? kvm_clock_get_cycles+0x1a/0x20
[239649.355095]  [<ffffffff82cefbf6>] ? ktime_get_ts64+0x46/0xf0
[239649.356180]  [<ffffffff82c59eda>] ? kvm_clock_get_cycles+0x1a/0x20
[239649.357420]  [<ffffffff82cefbf6>] ? ktime_get_ts64+0x46/0xf0
[239649.358489]  [<ffffffff82e1de1a>] ? SyS_select+0xba/0x110
[239649.359521]  [<ffffffff82c03bd1>] ? do_syscall_64+0x91/0x1a0
[239649.360626]  [<ffffffff832137ce>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[239649.361999] Code: 34 24 eb 91 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 41 54 49 89 f4 55 53 48 63 ef 48 c7 c3 c0 8e 01 00 48 83 ec <10> 48 8b 14 ed c0 e4 66 83 65 48 8b 04 25 28 00 00 00 48 89 44 
[239649.367173] RIP  [<ffffffff82ca3b2b>] sched_set_stop_task+0x1b/0xb0
[239649.368355]  RSP <ffffb9e482fcf880>

Change 503501 merged by Andrew Bogott:
[operations/puppet@production] Depool cloudvirt1015

https://gerrit.wikimedia.org/r/503501

Zppix moved this task from Backlog to Monitor on the User-Zppix board.
Zppix added a subscriber: Zppix.

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T18:25:56Z] <bd808> Restarted nova-compute service on cloudvirt1015 (T220853)

bd808 added a subscriber: bd808.Apr 13 2019, 6:55 PM
$ nova list --all-tenants --host cloudvirt1015
+--------------------------------------+----------------------------+-----------------+--------+------------+-------------+----------------------------------------+
| ID                                   | Name                       | Tenant ID       | Status | Task State | Power State | Networks                               |
+--------------------------------------+----------------------------+-----------------+--------+------------+-------------+----------------------------------------+
| bf38a115-ec86-4f72-8593-26e7d480cb1f | canary1015-01              | testlabs        | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.3.113 |
| 5db3a3ba-2137-4b32-9205-d42a6655135f | product-analytics-bayes    | discovery-stats | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.98  |
| a8f427fe-258e-48fa-92e4-afc3385736ae | product-analytics-test     | discovery-stats | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.107 |
| 8b764eb9-2dca-4902-a9c5-ed54fa3fc57d | puppet-lta                 | lta-tracker     | ERROR  | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.132 |
| adcfb94d-be6f-4d90-8d16-8fa6bdbb2419 | tools-clushmaster-02       | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.183 |
| 3077fa2a-853d-472b-871b-8938ba92982b | tools-elastic-02           | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.74  |
| de3e7366-f6f0-464e-a682-b3d615b97127 | tools-flannel-etcd-02      | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.145 |
| cdd9a533-0f70-4a0e-baf2-4cc13b1138e8 | tools-k8s-etcd-03          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.149 |
| 6c516b3c-8aa0-425b-ac02-5b14a5a089c9 | tools-paws-worker-1002     | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.188 |
| 581bd80b-60b0-4b17-acba-85cf1a29729e | tools-prometheus-01        | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.5.194 |
| 9911d41c-076d-4d25-b1eb-cd84e6175f6e | tools-worker-1014          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.3.94  |
| e875c8e9-155c-40ca-8e19-a5e60a52c742 | tools-worker-1015          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.3.95  |
| b2682ab1-fa37-4463-86be-71dfe553a101 | tools-worker-1016          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.12  |
| c9188b8f-b1a2-4946-9568-006bdf8be270 | tools-worker-1017          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.125 |
| 351e3410-6b7a-4e42-81b1-4cc5e44fba89 | tools-worker-1020          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.158 |
| 7e99f250-c643-475b-b67d-183a24e18c8d | tools-worker-1021          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.192 |
| 9b07f16f-0080-48e7-8a0f-ee9dfcd350d1 | tools-worker-1023          | tools           | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.4.196 |
| a03c8dd0-7d34-4f58-93f3-0f840a5725bd | toolsbeta-flannel-etcd-01  | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.177 |
| db180047-318c-45fe-bd92-cca4c154ad84 | toolsbeta-k8s-etcd-01      | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.170 |
| 268ff37d-f5eb-4cbc-9c28-6782f3a94f50 | toolsbeta-paws-master-01   | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.187 |
| 496f5c76-ee3d-49d1-a8db-20bf33b30153 | toolsbeta-paws-worker-1002 | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.181 |
| 7fef05bd-a384-45a4-9d2a-8d15e6423e32 | toolsbeta-proxy-01         | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.18  |
| 7d7e96a7-59bb-4cc6-bd32-17560c4c500c | toolsbeta-services-01      | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.168 |
| ba3a0660-fb86-4fd7-86ee-a8e28a737ad0 | toolsbeta-worker-1001      | toolsbeta       | ACTIVE | -          | Running     | lan-flat-cloudinstances2b=172.16.6.179 |
+--------------------------------------+----------------------------+-----------------+--------+------------+-------------+----------------------------------------+
Krenair updated the task description. (Show Details)Apr 13 2019, 6:56 PM
bd808 added a subscriber: Bstorm.Apr 13 2019, 6:57 PM
[18:29]  <    bd808>	!log admin Rebooting cloudvirt1015 (T220853)

This reboot hung on shutdown. @Bstorm looked via serial console and found:

[***   ] A stop job is running for Suspend/R…bvirt Guests (8min 38s / no limit)[681756.640740] INFO: rcu_sched detected stalls on CPUs/tasks:
[681756.646977]         52-...: (0 ticks this GP) idle=bea/0/0 softirq=7083989/7083989 fqs=0
[681756.655525]         53-...: (694 GPs behind) idle=40e/0/0 softirq=2166089/2166094 fqs=0
[681756.663975]         55-...: (929 GPs behind) idle=032/0/0 softirq=2253600/2253602 fqs=0
[681756.672426]         57-...: (1043 GPs behind) idle=432/0/0 softirq=2199382/2199382 fqs=0
[681756.680971]         59-...: (780 GPs behind) idle=5fc/0/0 softirq=2237008/2237011 fqs=0
[681756.689420]         (detected by 18, t=5264 jiffies, g=17526939, c=17526938, q=588632)
[681756.697853] rcu_sched kthread starved for 5266 jiffies! g17526939 c17526938 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[  *** ] A stop job is running for Suspend/R…ibvirt Guests (9min 7s / no limit)
Krenair added a subscriber: mpopov.EditedApr 13 2019, 6:57 PM

FYI @mpopov
Apr 12 22:12:25 <andrewbogott> bearloga: I've just seen two VMs crash in a row on cloudvirt1015, so I'm a bit worried that the processor there is making mistakes (we recently replaced a processor there and I thought it was fixed but now I'm not so sure…) product-analytics-bayes and product-analytics-test are on that host.
Apr 12 22:13:02 <andrewbogott> To be on the safe side I can move them, or you could delete and rebuild if that's easy for you. cloudvirt1015 is depooled now so if you recreate they'll get scheduled someplace else
Apr 12 22:13:05 <andrewbogott> lmk your preference

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T19:51:56Z] <bstorm_> started migrating tools-flannel-etcd-02 to cloudvirt1013 T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T19:58:50Z] <bstorm_> started migrating tools-k8s-etcd-03 to cloudvirt1012 T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:17:43Z] <bstorm_> migrated product-analytics-bayes to cloudvirt1009 for T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:28:39Z] <bstorm_> migrated product-analytics-test to cloudvirt1009 for T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:36:14Z] <bstorm_> moving tools-elastic-02 to cloudvirt1009 for T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T20:43:59Z] <bstorm_> Moving puppet-lta to a new server because of hardware problems T220853

Mentioned in SAL (#wikimedia-cloud) [2019-04-13T21:08:05Z] <bstorm_> Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for T220853

After forcing powercycle on the console, I have been evacuating things that are more important. Once the last three finish, it will be some k8s workers (one paws worker) and a bunch of toolsbeta stuff. I'll leave it at that for testing and such, but the tools-workers are depooled at this time (most are also still down after the reboot). The paws worker is off as well.

The downtime will expire soon. After that, there are three VMs up right now to keep it from sending pages. I figure the rest is probably ok to stay on the server until Monday, unless someone wants to move the toolsbeta things, which wouldn't suck in case the server gets worse.

A more complete version of what I copied out of the console before I rebooted can be found at P8394

Bstorm added a comment.EditedApr 13 2019, 9:45 PM

From dmesg:

[ 5829.292669] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 6289.446681] brq7425e328-56: port 4(tap1f652fcb-db) entered disabled state
[ 6289.447040] device tap1f652fcb-db left promiscuous mode
[ 6289.447044] brq7425e328-56: port 4(tap1f652fcb-db) entered disabled state
[ 6609.259082] perf: interrupt took too long (3146 > 3136), lowering kernel.perf_event_max_sample_rate to 63500
[ 6871.260131] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 45.361 msecs
[ 6871.260138] perf: interrupt took too long (354548 > 3932), lowering kernel.perf_event_max_sample_rate to 500
[ 6871.327471] INFO: NMI handler (ghes_notify_nmi) took too long to run: 112.076 msecs
[ 6872.646908] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[ 6872.646911] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 6872.646913] {1}[Hardware Error]: event severity: corrected
[ 6872.646915] {1}[Hardware Error]:  Error 0, type: corrected
[ 6872.646917] {1}[Hardware Error]:  fru_text: B3
[ 6872.646918] {1}[Hardware Error]:   section_type: memory error
[ 6872.646920] {1}[Hardware Error]:   error_status: 0x0000000000000400
[ 6872.646921] {1}[Hardware Error]:   physical_address: 0x00000077b4ea7800
[ 6872.646925] {1}[Hardware Error]:   node: 1 card: 2 module: 0 rank: 0 bank: 2 row: 45906 column: 992
[ 6872.646927] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[ 6872.646949] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 6872.646951] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[ 6872.646952] EDAC sbridge MC0: TSC eb28091adcc
[ 6872.646954] EDAC sbridge MC0: ADDR 77b4ea7800
[ 6872.646956] EDAC sbridge MC0: MISC 0
[ 6872.646958] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1555188932 SOCKET 0 APIC 0
[ 6872.646982] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x77b4ea7 offset:0x800 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:0)
[ 7164.888310] mce: [Hardware Error]: Machine check events logged
Andrew added subscribers: Cmjohnson, RobH.

@RobH and/or @Cmjohnson, I'm hoping the above is enough to pass on to Dell for a replacement part. Let us know if you need more details.

Restricted Application added a project: Operations. · View Herald TranscriptApr 13 2019, 9:53 PM

I finished draining cloudvirt1015 and put it in downtime, so it's ready for whatever reboots/rebuilds/hardware changes might be needed.

Andrew updated the task description. (Show Details)Apr 15 2019, 3:08 PM
bd808 updated the task description. (Show Details)Apr 15 2019, 3:10 PM

I just deleted product-analytics-test and product-analytics-bayes so y'all don't need to worry about those instances :)

colewhite triaged this task as Normal priority.Apr 16 2019, 6:07 PM
Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Apr 16 2019, 6:14 PM

*bump* Chris, do you have any thoughts about what we should do next here?

Mentioned in SAL (#wikimedia-operations) [2019-05-15T20:20:29Z] <robh> rebooting cloudvirt1015 into dell hardware tests per T220853

RobH added a comment.May 15 2019, 8:41 PM

Ok, so this has had CPU issues from the get go, tracked on both T215012 and T171473. It seems that the CPUs have been swapped, but not the mainboard. Considering its throwing CPU errors after all the CPU swaps, I advise we swap the mainboard next.

I'm attempting to run Dell hw test to get a failure code.

RobH claimed this task.May 15 2019, 8:42 PM
RobH reassigned this task from RobH to Cmjohnson.May 15 2019, 8:50 PM
RobH moved this task from Cloud Tasks to Hardware Failure / Troubleshoot on the ops-eqiad board.

Error output:

The event log indicates degraded or disabled ECC functionality.  Memory testing cannot continue until the problems are corrected, the log cleared and the system rebooted.

Technical Support will need this information to diagnose the problem.
Please record the information below.

Service Tag : 31R9KH2
Error Code : 2000-0125                                      
Validation : 126785

The error logs in question are:

-------------------------------------------------------------------------------
Record:      10
Date/Time:   05/15/2019 20:41:12
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   05/15/2019 20:41:45
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------

So either the memory is bad, or the mainboard is bad. Since the mainboard is also throwing CPU errors (see above), we should try to get both mainboard and memory dimm replaced.

RobH renamed this task from VMs on cloudvirt1015 crashing to VMs on cloudvirt1015 crashing - bad mainboard/memory.May 15 2019, 8:50 PM
mpopov removed a subscriber: mpopov.May 17 2019, 4:57 PM

Swapped DIMM B3 with DIMM A3 and cleared the log.

Andrew added a comment.Wed, Jun 5, 6:23 PM

Any update about this? Are parts on the way?

@Andrew what parts? There is nothing that suggests that it is CPU on the server side of things. I reseated and moved the DIMM and that error has not returned. It may very well have been poorly seated DIMM. I checked dmesg and do not see any more errors related to memory or CPU. Try putting it back into production and let's see if anything comes back. Unfortunately, I need to demonstrate and prove there is a problem for Dell to do anything and right now I do not have anything to give them.

I'm put eight test VMs on 1015, will let them run for a few days and then see if they're still up :)

Mentioned in SAL (#wikimedia-cloud) [2019-06-17T13:59:22Z] <andrewbogott> moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )