Page MenuHomePhabricator

cloudvirt1015: apparent hardware errors in CPU/Memory
Closed, ResolvedPublic

Description

Apparent memory issues:

[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: TSC 2c44127a3c3 
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: ADDR 77b4ea7000 
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1548935183 SOCKET 0 APIC 0
[Thu Jan 31 11:46:23 2019] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x77b4ea7 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:0)

In racadm:

Record:      6
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

-------------------------------------------------------------------------------
Record:      2
Date/Time:   11/16/2018 19:16:14
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   11/16/2018 19:16:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

Perhaps similar to T175585: cp4021 memory hardware issue - DIMM B1.

Netbox link: https://netbox.wikimedia.org/dcim/devices/1359/
System is in warranty via Dell until 2020-05-06.

Details

Related Gerrit Patches:
operations/puppet : productioncloudvirt1015: reimage as Debian Stretch
operations/puppet : productioncloudvirt1015: disable notifications

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 31 2019, 11:23 AM
aborrero triaged this task as High priority.Jan 31 2019, 11:23 AM
aborrero moved this task from Inbox to Important on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-operations) [2019-01-31T11:24:32Z] <arturo> T215012 reboot cloudvirt1015

Mentioned in SAL (#wikimedia-operations) [2019-01-31T11:30:29Z] <arturo> T215012 icinga downtime cloudvirt1015 for 4h while investigating issues

After the reboot, manually booting up the VM instances living in this hypervisor:

root@cloudcontrol1004:~# nova list --all-tenants --host cloudvirt1015
+--------------------------------------+-------------------------------+-----------------------------+---------+------------+-------------+------------------------------------------------------+
| ID                                   | Name                          | Tenant ID                   | Status  | Task State | Power State | Networks                                             |
+--------------------------------------+-------------------------------+-----------------------------+---------+------------+-------------+------------------------------------------------------+
| 6aa9f540-216f-4fa6-be97-2729201b09aa | accounts-appserver5           | account-creation-assistance | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.8                 |
| caaebb61-a6b2-4d86-8351-43a2dccd1fdd | canary1015-01                 | testlabs                    | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.153               |
| 57dd2a5e-c54c-4356-8efb-44ede3887d63 | deployment-deploy01           | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.18                |
| 0f039fb4-1b81-4e38-ab33-16bc2638f4fc | deployment-deploy02           | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.19                |
| d62d1274-b304-42d5-b1a7-471ae90fbc78 | deployment-fluorine02         | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.71                |
| 7309d324-5a77-4124-a89f-1b6e2e945b72 | deployment-kafka-jumbo-2      | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.47                |
| 6952e747-9c47-410f-8145-b8f25c5e9658 | deployment-kafka-main-1       | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.116               |
| e469cff8-0791-4a83-aa86-3ba9bf3780d9 | deployment-maps04             | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.10                |
| 843a28b3-14b1-4755-8eba-5c692ba42318 | deployment-mcs01              | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.64                |
| 7698fd54-d917-439f-ba83-fa6595dfad24 | deployment-mediawiki-09       | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.106               |
| 5856aecf-0a41-4fcf-a1fb-042e78311847 | deployment-memc04             | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.76                |
| 9e2a099b-845d-41b8-837e-c0eab353f433 | deployment-ms-be03            | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.51                |
| 5abd39f5-3738-4aee-a182-95c27c968af5 | deployment-ms-fe02            | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.66                |
| 3690cdde-e85b-4d96-bfe1-7f257a077fac | deployment-parsoid09          | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.63                |
| 44019fe9-c42b-48ca-a692-5f4c9fe3b80e | deployment-sca04              | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.54                |
| 307f1873-c6ad-4ada-b1bf-18b997865e88 | deployment-webperf12          | deployment-prep             | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.24                |
| 3f5a2697-7fdc-43c0-8961-1b5b15aaa497 | drmf                          | math                        | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.69                |
| 8113d2c5-6788-43f6-beeb-123b0b717af3 | drmf-beta                     | math                        | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.62                |
| 169b3260-4f7e-43dc-94c2-e699308a3426 | ecmabot                       | webperf                     | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.90                |
| 29e875e3-15d5-4f74-9716-c0025c2ea098 | encoding02                    | video                       | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.13                |
| 1b2b8b50-d463-4b7f-a3a9-6363eeb3ca8b | encoding03                    | video                       | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.201               |
| 5421f938-7a11-499c-bc6a-534da1f4e27d | hafnium                       | rcm                         | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.3.175               |
| 041d42b9-df36-4176-9f5d-a508989bbebc | hound-app-01                  | hound                       | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.2.120               |
| 6149375b-8a08-4f03-882a-6fc0f5f77499 | integration-slave-docker-1044 | integration                 | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.109               |
| 4d64b032-d93a-4a8c-a7e5-569c17e5063f | integration-slave-docker-1046 | integration                 | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.115               |
| ad48959a-9eb9-46a9-bec4-a2bf23cdf655 | integration-slave-docker-1047 | integration                 | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.116               |
| 21644632-0972-448f-83d0-b76f9d1d28e0 | ldfclient-new                 | wikidata-query              | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.9                 |
| c2a30fe0-2c87-4b01-be53-8e2a3d0f40a7 | math-docker                   | math                        | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.6.140               |
| df8f17fb-03fe-4725-b9cf-3d9fe76f4654 | mediawiki2latex               | collection-alt-renderer     | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.62                |
| d73f36e6-7534-4910-9a6e-64a6b9088d1e | neon                          | rcm                         | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.3.180               |
| 2d035965-ba53-41b3-b6ef-d2ebbe50656a | novaadminmadethis             | quotatest                   | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.5.198               |
| c84f61c0-4fd2-47a5-b6ab-dd6b5ea98d41 | ores-puppetmaster-01          | ores                        | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.6.134               |
| 585bb328-8078-4437-b076-9e555683e27d | ores-sentinel-01              | ores                        | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.6.135               |
| 0538bfed-d7b5-4751-9431-8feecbaf78c0 | oxygen                        | rcm                         | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.3.176, 185.15.56.39 |
| e8090d9e-7529-46a9-b1e1-c4ba523a2898 | packaging                     | thumbor                     | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.107               |
| c7fe4663-7f2b-4d23-a79b-1a2e01c80d93 | twlight-prod                  | twl                         | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.25                |
| 2370b38f-7a65-4ccf-a635-7a2fa5e12b3e | twlight-staging               | twl                         | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.63                |
| 464577c6-86f0-42f9-9c49-86f9ec9a0210 | twlight-tracker               | twl                         | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.1.23                |
| 5325322d-a57e-4a9b-85b7-37643f03bfea | wikidata-misc                 | wikidata-dev                | SHUTOFF | -          | Shutdown    | lan-flat-cloudinstances2b=172.16.4.123               |
+--------------------------------------+-------------------------------+-----------------------------+---------+------------+-------------+------------------------------------------------------+
root@cloudcontrol1004:~# for i in $(nova list --all-tenants --host cloudvirt1015 | grep SHUTOFF | awk -F' ' '{print $2}') ; do nova start $i ; sleep 10 ; done
[...]
Request to start server 307f1873-c6ad-4ada-b1bf-18b997865e88 has been accepted.
Request to start server 8113d2c5-6788-43f6-beeb-123b0b717af3 has been accepted.
Request to start server 169b3260-4f7e-43dc-94c2-e699308a3426 has been accepted.
[...]

After the machines are back to live. I'm seeing this weird messages in the kernel log:

[  566.942931] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x34
[  588.358456] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x611
[  588.367048] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x639
[  588.375631] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x641
[  588.384196] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x619
[  588.436435] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x611
[  588.445002] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x639
[  588.453562] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x641
[  588.462122] kvm [9027]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x619
[  656.827034] kvm [10583]: vcpu0, guest rIP: 0xffffffffa5c5a6b2 unhandled rdmsr: 0x34
[  659.782994] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x34
[  661.849391] kvm [10802]: vcpu0, guest rIP: 0xffffffffa365a6b2 unhandled rdmsr: 0x34
[  665.991972] kvm [11034]: vcpu0, guest rIP: 0xffffffff92c5a6b2 unhandled rdmsr: 0x34
[  668.469732] kvm [11139]: vcpu0, guest rIP: 0xffffffffa605a6b2 unhandled rdmsr: 0x34
[  669.093288] kvm [10917]: vcpu0, guest rIP: 0xffffffff9185a6e3 unhandled rdmsr: 0x34
[  670.282268] kvm [11253]: vcpu0, guest rIP: 0xffffffffa125a6b2 unhandled rdmsr: 0x34
[  674.673440] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x34
[  677.894087] kvm [11387]: vcpu0, guest rIP: 0xffffffff9f05a6e3 unhandled rdmsr: 0x34
[  678.252428] kvm [10583]: vcpu0, guest rIP: 0xffffffffa5c5a6b2 unhandled rdmsr: 0x611
[  678.261141] kvm [10583]: vcpu0, guest rIP: 0xffffffffa5c5a6b2 unhandled rdmsr: 0x639
[  678.269811] kvm [10583]: vcpu0, guest rIP: 0xffffffffa5c5a6b2 unhandled rdmsr: 0x641
[  678.278552] kvm [10583]: vcpu0, guest rIP: 0xffffffffa5c5a6b2 unhandled rdmsr: 0x619
[  681.280826] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x611
[  681.289503] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x639
[  681.298168] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x641
[  681.306827] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x619
[  681.370332] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x611
[  681.378982] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x639
[  681.387641] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x641
[  681.396297] kvm [10678]: vcpu0, guest rIP: 0xffffffffa425a6b2 unhandled rdmsr: 0x619
[  681.980292] kvm [11609]: vcpu0, guest rIP: 0xffffffffa425a6e3 unhandled rdmsr: 0x34
[  683.303012] kvm [10802]: vcpu0, guest rIP: 0xffffffffa365a6b2 unhandled rdmsr: 0x611
[  687.308563] kvm [11034]: vcpu0, guest rIP: 0xffffffff92c5a6b2 unhandled rdmsr: 0x611
[  687.317232] kvm [11034]: vcpu0, guest rIP: 0xffffffff92c5a6b2 unhandled rdmsr: 0x639
[  687.325902] kvm [11034]: vcpu0, guest rIP: 0xffffffff92c5a6b2 unhandled rdmsr: 0x641
[  687.334569] kvm [11034]: vcpu0, guest rIP: 0xffffffff92c5a6b2 unhandled rdmsr: 0x619
[  688.577341] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x34
[  689.832274] kvm [11139]: vcpu0, guest rIP: 0xffffffffa605a6b2 unhandled rdmsr: 0x611
[  689.840953] kvm [11139]: vcpu0, guest rIP: 0xffffffffa605a6b2 unhandled rdmsr: 0x639
[  689.849616] kvm [11139]: vcpu0, guest rIP: 0xffffffffa605a6b2 unhandled rdmsr: 0x641
[  689.858262] kvm [11139]: vcpu0, guest rIP: 0xffffffffa605a6b2 unhandled rdmsr: 0x619
[  690.193179] kvm [10917]: vcpu0, guest rIP: 0xffffffff9185a6e3 unhandled rdmsr: 0x611
[  695.813870] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x611
[  695.822534] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x639
[  695.831194] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x641
[  695.839865] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x619
[  695.883375] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x611
[  695.892034] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x639
[  695.900704] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x641
[  695.909373] kvm [11506]: vcpu0, guest rIP: 0xffffffffbe45a6b2 unhandled rdmsr: 0x619
[  698.979497] kvm [11387]: vcpu0, guest rIP: 0xffffffff9f05a6e3 unhandled rdmsr: 0x611
[  698.988188] kvm [11387]: vcpu0, guest rIP: 0xffffffff9f05a6e3 unhandled rdmsr: 0x639
[  703.015023] kvm [11609]: vcpu0, guest rIP: 0xffffffffa425a6e3 unhandled rdmsr: 0x611
[  703.023724] kvm [11609]: vcpu0, guest rIP: 0xffffffffa425a6e3 unhandled rdmsr: 0x639
[  703.033417] kvm [11609]: vcpu0, guest rIP: 0xffffffffa425a6e3 unhandled rdmsr: 0x641
[  703.042711] kvm [11609]: vcpu0, guest rIP: 0xffffffffa425a6e3 unhandled rdmsr: 0x619
[  705.977601] kvm [11758]: vcpu0, guest rIP: 0xffffffff9ca5a6e3 unhandled rdmsr: 0x611
[  705.986299] kvm [11758]: vcpu0, guest rIP: 0xffffffff9ca5a6e3 unhandled rdmsr: 0x639
[  705.994966] kvm [11758]: vcpu0, guest rIP: 0xffffffff9ca5a6e3 unhandled rdmsr: 0x641
[  706.003626] kvm [11758]: vcpu0, guest rIP: 0xffffffff9ca5a6e3 unhandled rdmsr: 0x619
[  706.070431] kvm [11758]: vcpu0, guest rIP: 0xffffffff9ca5a6e3 unhandled rdmsr: 0x611
[  706.079138] kvm [11758]: vcpu0, guest rIP: 0xffffffff9ca5a6e3 unhandled rdmsr: 0x639
[  709.714351] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x611
[  709.723048] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x639
[  709.731737] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x641
[  709.740410] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x619
[  709.770160] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x611
[  709.778828] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x639
[  709.787499] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x641
[  709.796153] kvm [12060]: vcpu0, guest rIP: 0xffffffffaea5a6e3 unhandled rdmsr: 0x619
[  711.808846] kvm [12195]: vcpu0, guest rIP: 0xffffffffb5a5a6e3 unhandled rdmsr: 0x611
[  711.817517] kvm [12195]: vcpu0, guest rIP: 0xffffffffb5a5a6e3 unhandled rdmsr: 0x639
[  757.702915] kvm [14244]: vcpu0, guest rIP: 0xffffffff9785a6b2 unhandled rdmsr: 0x34
[  779.165616] kvm [14244]: vcpu0, guest rIP: 0xffffffff9785a6b2 unhandled rdmsr: 0x611
[  779.174288] kvm [14244]: vcpu0, guest rIP: 0xffffffff9785a6b2 unhandled rdmsr: 0x639
[  779.182951] kvm [14244]: vcpu0, guest rIP: 0xffffffff9785a6b2 unhandled rdmsr: 0x641
[  779.191622] kvm [14244]: vcpu0, guest rIP: 0xffffffff9785a6b2 unhandled rdmsr: 0x619
[  782.174671] kvm [14994]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x34
[  800.109797] kvm [15218]: vcpu0, guest rIP: 0xffffffffabe5a6e3 unhandled rdmsr: 0x34
[  803.705520] kvm [14994]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x611
[  803.714198] kvm [14994]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x639
[  803.722863] kvm [14994]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x641
[  803.731518] kvm [14994]: vcpu0, guest rIP: 0xffffffffb6c5a6b2 unhandled rdmsr: 0x619
[  812.402839] kvm [15647]: vcpu0, guest rIP: 0xffffffffa565a6e3 unhandled rdmsr: 0x34
[  821.303298] kvm [15218]: vcpu0, guest rIP: 0xffffffffabe5a6e3 unhandled rdmsr: 0x611
[  821.312005] kvm [15218]: vcpu0, guest rIP: 0xffffffffabe5a6e3 unhandled rdmsr: 0x639
[  821.320676] kvm [15218]: vcpu0, guest rIP: 0xffffffffabe5a6e3 unhandled rdmsr: 0x641
[  821.329360] kvm [15218]: vcpu0, guest rIP: 0xffffffffabe5a6e3 unhandled rdmsr: 0x619
[  823.968414] kvm [16067]: vcpu0, guest rIP: 0xffffffffa9e5a6e3 unhandled rdmsr: 0x34
[  833.731290] kvm [15647]: vcpu0, guest rIP: 0xffffffffa565a6e3 unhandled rdmsr: 0x611
[  833.739970] kvm [15647]: vcpu0, guest rIP: 0xffffffffa565a6e3 unhandled rdmsr: 0x639
[  833.748631] kvm [15647]: vcpu0, guest rIP: 0xffffffffa565a6e3 unhandled rdmsr: 0x641
[  833.757284] kvm [15647]: vcpu0, guest rIP: 0xffffffffa565a6e3 unhandled rdmsr: 0x619
[  836.389505] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x34
[  845.080354] kvm [16067]: vcpu0, guest rIP: 0xffffffffa9e5a6e3 unhandled rdmsr: 0x611
[  845.089026] kvm [16067]: vcpu0, guest rIP: 0xffffffffa9e5a6e3 unhandled rdmsr: 0x639
[  845.097712] kvm [16067]: vcpu0, guest rIP: 0xffffffffa9e5a6e3 unhandled rdmsr: 0x641
[  845.106409] kvm [16067]: vcpu0, guest rIP: 0xffffffffa9e5a6e3 unhandled rdmsr: 0x619
[  849.579324] kvm [16706]: vcpu0, guest rIP: 0xffffffff8fe5a6e3 unhandled rdmsr: 0x34
[  857.322427] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x611
[  857.331102] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x639
[  857.339776] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x641
[  857.348449] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x619
[  857.395147] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x611
[  857.404049] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x639
[  857.412778] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x641
[  857.421489] kvm [16458]: vcpu0, guest rIP: 0xffffffffa365a6e3 unhandled rdmsr: 0x619
[  861.727806] kvm [16996]: vcpu0, guest rIP: 0xffffffff8105a6e3 unhandled rdmsr: 0x34

Actually, with a bit more context:

[...]
[Thu Jan 31 11:42:01 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:01 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:04 2019] kvm [18714]: vcpu0, guest rIP: 0xffffffffa4e5b652 unhandled rdmsr: 0x34
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x611
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x639
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x641
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x619
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x611
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x639
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x641
[Thu Jan 31 11:42:12 2019] kvm [18449]: vcpu0, guest rIP: 0xffffffff8d25a6b2 unhandled rdmsr: 0x619
[Thu Jan 31 11:42:13 2019] brq7425e328-56: port 32(tap148656c1-84) entered blocking state
[Thu Jan 31 11:42:13 2019] brq7425e328-56: port 32(tap148656c1-84) entered disabled state
[Thu Jan 31 11:42:13 2019] device tap148656c1-84 entered promiscuous mode
[Thu Jan 31 11:42:13 2019] brq7425e328-56: port 32(tap148656c1-84) entered blocking state
[Thu Jan 31 11:42:13 2019] brq7425e328-56: port 32(tap148656c1-84) entered forwarding state
[Thu Jan 31 11:42:14 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:14 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:16 2019] kvm [19141]: vcpu0, guest rIP: 0xffffffffa845b652 unhandled rdmsr: 0x34
[Thu Jan 31 11:42:25 2019] kvm [18714]: vcpu0, guest rIP: 0xffffffffa4e5b652 unhandled rdmsr: 0x611
[Thu Jan 31 11:42:25 2019] kvm [18714]: vcpu0, guest rIP: 0xffffffffa4e5b652 unhandled rdmsr: 0x639
[Thu Jan 31 11:42:25 2019] kvm [18714]: vcpu0, guest rIP: 0xffffffffa4e5b652 unhandled rdmsr: 0x641
[Thu Jan 31 11:42:25 2019] kvm [18714]: vcpu0, guest rIP: 0xffffffffa4e5b652 unhandled rdmsr: 0x619
[Thu Jan 31 11:42:26 2019] brq7425e328-56: port 33(tap66272afa-29) entered blocking state
[Thu Jan 31 11:42:26 2019] brq7425e328-56: port 33(tap66272afa-29) entered disabled state
[Thu Jan 31 11:42:26 2019] device tap66272afa-29 entered promiscuous mode
[Thu Jan 31 11:42:26 2019] brq7425e328-56: port 33(tap66272afa-29) entered blocking state
[Thu Jan 31 11:42:26 2019] brq7425e328-56: port 33(tap66272afa-29) entered forwarding state
[Thu Jan 31 11:42:26 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:26 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:28 2019] kvm [19769]: vcpu0, guest rIP: 0xffffffffa8e5b652 unhandled rdmsr: 0x34
[Thu Jan 31 11:42:37 2019] kvm [19141]: vcpu0, guest rIP: 0xffffffffa845b652 unhandled rdmsr: 0x611
[Thu Jan 31 11:42:37 2019] kvm [19141]: vcpu0, guest rIP: 0xffffffffa845b652 unhandled rdmsr: 0x639
[Thu Jan 31 11:42:37 2019] kvm [19141]: vcpu0, guest rIP: 0xffffffffa845b652 unhandled rdmsr: 0x641
[Thu Jan 31 11:42:37 2019] kvm [19141]: vcpu0, guest rIP: 0xffffffffa845b652 unhandled rdmsr: 0x619
[Thu Jan 31 11:42:38 2019] brq7425e328-56: port 34(tap48c0c270-95) entered blocking state
[Thu Jan 31 11:42:38 2019] brq7425e328-56: port 34(tap48c0c270-95) entered disabled state
[Thu Jan 31 11:42:38 2019] device tap48c0c270-95 entered promiscuous mode
[Thu Jan 31 11:42:38 2019] brq7425e328-56: port 34(tap48c0c270-95) entered blocking state
[Thu Jan 31 11:42:38 2019] brq7425e328-56: port 34(tap48c0c270-95) entered forwarding state
[Thu Jan 31 11:42:38 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:38 2019] kvm: zapping shadow pages for mmio generation wraparound
[Thu Jan 31 11:42:40 2019] kvm [20051]: vcpu0, guest rIP: 0xffffffff98c5b652 unhandled rdmsr: 0x34
[...]

For the record, I captured the racadm logs:

/admin1-> racadm getsel
Record:      1
Date/Time:   10/29/2018 17:26:21
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   11/16/2018 19:16:14
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   11/16/2018 19:16:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   01/19/2019 01:57:00
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/19/2019 02:47:50
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   01/31/2019 11:08:27
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
/admin1->

Worth noting this event from today, at the same time as the original DOWN report from icinga:

Record:      6
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

Could this be a reason for the machine going cold? Adding DCops folks to the ticket for further advice.

Neither /var/log/syslog or /var/log/kern.log contain relevant information about the issue. This may indicate a hardware issue.

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:04:56Z] <arturo> VM instances accounts-appserver5, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:06Z] <arturo> VM instances accounts-appserver5, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:14Z] <arturo> VM instances deployment-deploy01,deployment-deploy02,deployment-fluorine02,deployment-kafka-jumbo-2,deployment-kafka-main-1,deployment-maps04,deployment-mcs01,deployment-mediawiki-09,deployment-memc04,deployment-ms-be03,deployment-ms-fe02,deployment-parsoid09,deployment-sca04,deployment-webperf12, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:26Z] <arturo> VM instances hound-app-01, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:34Z] <arturo> VM instances integration-slave-docker-1044,integration-slave-docker-1046,integration-slave-docker-1047, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:41Z] <arturo> VM instances drmf,drmf-beta,math-docker, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:48Z] <arturo> VM instances ores-puppetmaster-01,ores-sentinel-01, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:05:56Z] <arturo> VM instances novaadminmadethis, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:06:05Z] <arturo> VM instances hafnium,neon,oxygen, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:06:16Z] <arturo> VM instances canary1015-01, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:06:24Z] <arturo> VM instances packaging, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:06:33Z] <arturo> VM instances twlight-prod,twlight-staging,twlight-tracker, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:06:40Z] <arturo> VM instances encoding02,encoding03, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:06:51Z] <arturo> VM instances deployment-webperf12,ecmabot, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:07:00Z] <arturo> VM instances wikidata-misc, were stopped briefly due to issue in hypervisor (T215012).

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:07:07Z] <arturo> VM instances ldfclient-new, were stopped briefly due to issue in hypervisor (T215012)

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:07:58Z] <arturo> VM instances mediawiki2latex, were stopped briefly due to issue in hypervisor (T215012)

TL;DR:

Record:      6
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.

I ask @RobH and @Cmjohnson for advice.

More interesting dmesg entries:

[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]: event severity: corrected
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:  fru_text: B3
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:   section_type: memory error
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:   physical_address: 0x00000077b4ea7000
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:   node: 1 card: 2 module: 0 rank: 0 bank: 2 row: 45906 column: 960 
[Thu Jan 31 11:46:23 2019] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: TSC 2c44127a3c3 
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: ADDR 77b4ea7000 
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 31 11:46:23 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1548935183 SOCKET 0 APIC 0
[Thu Jan 31 11:46:23 2019] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x77b4ea7 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:0)
[Thu Jan 31 11:47:36 2019] mce: [Hardware Error]: Machine check events logged
aborrero renamed this task from cloudvirt1015: server down to cloudvirt1015: apparent hardware errors in CPU/Memory.Jan 31 2019, 12:29 PM

Mentioned in SAL (#wikimedia-cloud) [2019-01-31T12:44:29Z] <arturo> T215012 depooling cloudvirt1015 and migrating all VMs to cloudvirt1024

Similar to T175585 (memory error, DIMM needs replacement)

Change 487370 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvirt1015: disable notifications

https://gerrit.wikimedia.org/r/487370

Change 487370 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvirt1015: disable notifications

https://gerrit.wikimedia.org/r/487370

Andrew added a subscriber: Andrew.Feb 4 2019, 3:01 PM

I see (and have seen) those 'unhandled rdmsr' messages all over the place, and I'm pretty sure they're harmless:

root@cloudvirt1013:~# dmesg | grep "unhandled rdmsr" | wc
    248    2492   22098
root@cloudvirt1022:~# dmesg | grep "unhandled rdmsr" | wc
   2459   24590  222966

That said, the 'Hardware error from APEI' looks like it might be the real deal.

Restricted Application added a project: Operations. · View Herald TranscriptFeb 4 2019, 3:10 PM
Andrew added a comment.Feb 4 2019, 3:16 PM

Since this host is empty we should rebuild it with Stretch before putting any real VMs back on it. Maybe best to resolve the hardware issue first though.

aborrero reassigned this task from aborrero to RobH.Feb 7 2019, 4:44 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
RobH added a comment.Feb 7 2019, 4:46 PM
root@cloudvirt1015.mgmt.eqiad.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   10/29/2018 17:26:21
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   11/16/2018 19:16:14
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   11/16/2018 19:16:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   01/19/2019 01:57:00
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/19/2019 02:47:50
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   01/31/2019 11:08:27
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
/admin1->
aborrero updated the task description. (Show Details)Feb 7 2019, 4:46 PM
RobH updated the task description. (Show Details)Feb 7 2019, 4:52 PM

Since this host is empty we should rebuild it with Stretch before putting any real VMs back on it. Maybe best to resolve the hardware issue first though.

Seems this can be taken down for troubleshooting, though it should be put into maint mode in icinga first!

RobH reassigned this task from RobH to Cmjohnson.Feb 7 2019, 4:53 PM
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH updated the task description. (Show Details)
RobH removed a project: Patch-For-Review.
RobH set Due Date to Feb 14 2019, 12:00 AM.Feb 7 2019, 4:56 PM
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptFeb 7 2019, 4:56 PM
RobH removed Due Date.Feb 7 2019, 4:57 PM
Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptFeb 7 2019, 4:57 PM
Andrew added a comment.Feb 7 2019, 4:59 PM

Note that this isn't the first time we've had issues with 1015: T171473

Requested a new CPU from Dell

You have successfully submitted request SR986941687.

Andrew moved this task from Epics to Blocked on the cloud-services-team (Kanban) board.

While we wait for the new CPU, this server is not serving any purpose, so I will be using it to test some puppet changes (including reimage from jessie to stretch).
This should not affect any DC operations, since it still can be shutdown as soon as required.

Change 496201 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvirt1015: reimage as Debian Stretch

https://gerrit.wikimedia.org/r/496201

Change 496201 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvirt1015: reimage as Debian Stretch

https://gerrit.wikimedia.org/r/496201

@aborrero the CPU is here...let me know when it's safe for me to change.

@aborrero the CPU is here...let me know when it's safe for me to change.

There's nothing important on that server, so you can change the CPU any time. You might want to ping arturo before you do it in case he's working on it (but he's definitely out for the rest of the day).

@aborrero the CPU is here...let me know when it's safe for me to change.

Thanks, you can do anytime.

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM
Cmjohnson closed this task as Resolved.Mar 26 2019, 3:59 PM

The cpu has been swapped out, logged cleared

return tracking
USPS 9202 3945 5301 2441 1123 60
FEDEX 9611918 2393026 77862388

Andrew added a parent task: Unknown Object (Task).Jul 23 2019, 3:10 PM