Page MenuHomePhabricator

cloudvirt1016 crash
Closed, ResolvedPublic

Description

We just now got an alert that cloudvirt1016 was offline. I logged into mgmt and checked the console which appeared to be hanging so I cycled power.

We've lost cloudvirt1013, 1014 and 1016 in recent weeks. Note that 1016 is a dell, unlike 1013 and 1014 which are HPs.

Details

Related Gerrit Patches:
operations/puppet : productionnova: depool cloudvirt1016

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
ResolvedJclark-ctr

Event Timeline

Andrew created this task.Jan 4 2020, 2:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2020, 2:18 PM
Andrew added a comment.Jan 4 2020, 2:23 PM

Again, nothing interesting in syslog, just a sudden stop.

Jan  4 14:02:01 cloudvirt1016 CRON[14615]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:03:01 cloudvirt1016 CRON[15769]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:04:01 cloudvirt1016 CRON[17058]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:05:01 cloudvirt1016 CRON[18259]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan  4 14:05:01 cloudvirt1016 CRON[18260]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:06:01 cloudvirt1016 CRON[19489]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:07:01 cloudvirt1016 CRON[20808]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:08:01 cloudvirt1016 CRON[21978]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:09:01 cloudvirt1016 CRON[23321]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:10:01 cloudvirt1016 CRON[24681]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:11:01 cloudvirt1016 CRON[25842]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan  4 14:18:43 cloudvirt1016 systemd-modules-load[1401]: Inserted module 'br_netfilter'
Jan  4 14:18:43 cloudvirt1016 systemd-modules-load[1401]: Inserted module 'ipmi_devintf'
Jan  4 14:18:43 cloudvirt1016 systemd-modules-load[1401]: Inserted module 'nbd'
Jan  4 14:18:43 cloudvirt1016 lvm[1420]:   1 logical volume(s) in volume group "tank" monitored
Jan  4 14:18:43 cloudvirt1016 systemd-modules-load[1401]: Inserted module 'iscsi_tcp'
Jan  4 14:18:43 cloudvirt1016 systemd-modules-load[1401]: Inserted module 'ib_iser'
Jan  4 14:18:43 cloudvirt1016 systemd-sysctl[1447]: Couldn't write '262144' to 'net/netfilter/nf_conntrack_max', ignoring: No such file or directory
Jan  4 14:18:43 cloudvirt1016 systemd-sysctl[1447]: Couldn't write '65' to 'net/netfilter/nf_conntrack_tcp_timeout_time_wait', ignoring: No such file or directory
Jan  4 14:18:43 cloudvirt1016 systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Jan  4 14:18:43 cloudvirt1016 systemd[1]: Started Load/Save Random Seed.
Jan  4 14:18:43 cloudvirt1016 systemd[1]: Started Create Static Device Nodes in /dev.
Jan  4 14:18:43 cloudvirt1016 systemd[1]: Started Apply Kernel Variables.
Jan  4 14:18:43 cloudvirt1016 systemd[1]: Starting udev Kernel Device Manager...
Andrew added a comment.Jan 4 2020, 2:26 PM

affected VMs:

| 4a85bfa1-f920-489c-badb-e8cc9f8d5692 | wikilabels-backups-01        | ACTIVE  | lan-flat-cloudinstances2b=172.16.0.147               | debian-10.0-buster (deprecated 2019-12-18)  |
| 8ce285c5-1b12-4209-a751-74ec7ab0919b | canary1016-01                | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.225               | debian-10.0-buster (deprecated 2019-07-29)  |
| e4e71a4b-7121-45ef-844b-775017691e37 | language-eg                  | ACTIVE  | lan-flat-cloudinstances2b=172.16.5.253               | debian-9.6-stretch (deprecated 2019-01-22)  |
| 123949a5-8b58-40e9-97db-a52709a80d5c | tools-sgegrid-master         | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.197               | debian-9.6-stretch (deprecated 2019-01-22)  |
| 4a87c9dd-d42c-42ff-ad5f-8b59125c49e0 | wcdo                         | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.141               | debian-9.3-stretch (deprecated 2018-06-05)  |
| 9e0f875e-5a12-49b9-b2e9-8d2943e4bb97 | medbox3-iiab                 | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.131, 185.15.56.46 | debian-9.3-stretch (deprecated 2018-06-05)  |
| 89ef3df5-2a09-499e-a73d-682f4454c449 | utrs-database2               | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.37                | debian-9.6-stretch (deprecated 2019-01-22)  |
| 275ccc6d-3730-42c8-8a05-5293ef0db44a | labs-bootstrapvz-jessie      | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.154               | debian-8.3-jessie (deprecated 2016-06-13)   |
| 10798c2b-d5da-462c-b318-9b2ba2ff2c7d | mwoffliner2                  | ACTIVE  | lan-flat-cloudinstances2b=172.16.7.219               | debian-9.6-stretch (deprecated 2019-01-22)  |
| 58cf8c32-5af5-4313-8bcd-1d48124faf09 | toolsbeta-sgebastion-04      | ACTIVE  | lan-flat-cloudinstances2b=172.16.3.240               | debian-9.6-stretch (deprecated 2019-01-22)  |
| a5c14b0d-5013-4ec6-995a-728e1d5eb49a | petscan3                     | ACTIVE  | lan-flat-cloudinstances2b=172.16.3.171               | debian-9.5-stretch (deprecated 2018-11-22)  |
| 78d6bf30-a972-47cd-9b07-8354730ead64 | dumps-1                      | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.79                | debian-8.7-jessie (deprecated 2017-04-07)   |
| 12c8bd52-2bbc-42fb-b974-b67b9d46cf3b | deployment-server            | ACTIVE  | lan-flat-cloudinstances2b=172.16.5.168               | debian-9.5-stretch (deprecated 2018-11-22)  |
| f0a0822d-4a84-493d-a28b-df985bf739ba | deployment-elastic05         | ACTIVE  | lan-flat-cloudinstances2b=172.16.5.136               | debian-9.5-stretch (deprecated 2018-11-22)  |
| 4d078e3f-4dfe-420a-a632-d33ea32e8ed5 | deployment-kafka-main-2      | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.100               | debian-9.3-stretch (deprecated 2018-06-05)  |
| 9d0fee92-58e4-44a6-8562-081c598b75ed | wmde-wikidiff2-patched       | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.232               | debian-8.7-jessie (deprecated 2017-04-07)   |
| 14ad3fb6-6394-4fce-b2d2-7370f3202019 | sqltest02                    | ACTIVE  | lan-flat-cloudinstances2b=172.16.4.226               | debian-8.10-jessie (deprecated 2018-06-05)  |
| e9b465c6-8b84-433b-8a3b-a1a472ca5265 | toolsbeta-sgeexec-0901       | ACTIVE  | lan-flat-cloudinstances2b=172.16.3.149               | debian-9.5-stretch (deprecated 2018-11-22)  |
| e4fd5e76-7d45-4cee-a5e1-7e4029a6ee54 | mcr-full                     | SHUTOFF | lan-flat-cloudinstances2b=172.16.2.112               | debian-9.3-stretch (deprecated 2018-06-05)  |
| 31697c87-0a10-4036-9cfd-15531014800e | language-mleb-master         | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.104               | debian-9.4-stretch (deprecated 2018-08-01)  |
| f312cd3f-f547-42fa-8ef8-607ad133e0ac | quarry-web-01                | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.100               | debian-9.5-stretch (deprecated 2018-11-22)  |
| 582c08ea-5a75-459d-b88e-8dc77ac7fd0f | quarry-worker-02             | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.97                | debian-9.5-stretch (deprecated 2018-11-22)  |
| fdb069da-4c57-4703-bc63-b0f1a7640bb0 | wbregistry-01                | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.78                | debian-9.3-stretch (deprecated 2018-06-05)  |
| 3b753fef-5d9d-4c13-af1e-c0e2d0d4ee97 | mc-clusterB-1                | SHUTOFF | lan-flat-cloudinstances2b=172.16.2.72                | debian-9.3-stretch (deprecated 2018-06-05)  |
| d5f3298b-c7ab-4a87-ba60-aec7306c19d8 | readingwebstaging            | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.64                | debian-9.4-stretch (deprecated 2018-08-01)  |
| 4b3ea8d4-b7e0-4665-a681-51601dae6db4 | pluggableauth-server         | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.43                | debian-9.0-stretch (deprecated 2017-09-27)  |
| 8a6b1007-5341-4c45-98c9-5d6838a51f50 | mwstake                      | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.40, 185.15.56.26  | debian-9.1-stretch (deprecated  2017-11-16) |
| 20cb72ca-41f6-4f30-ba57-c168a594d084 | apps-team-tools              | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.37                | debian-9.5-stretch (deprecated 2018-11-22)  |
| 9baabe5e-2c2a-4b5f-9579-5f9b08903e48 | huggle-wl                    | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.30                | debian-9.1-stretch (deprecated  2017-11-16) |
| 3c209f09-1a74-49f8-8d54-6942cf220942 | fastcci-worker2              | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.24                | debian-9.5-stretch (deprecated 2018-11-22)  |
| 44418c81-d480-4c95-a711-69c88137f80b | cyberbot-exec-01             | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.21                | debian-9.1-stretch (deprecated  2017-11-16) |
| 0d836fb8-be3b-4a2a-af52-eb4d532f04ff | antiharassment-web1          | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.5                 | debian-9.5-stretch (deprecated 2018-11-22)  |
| e54bc408-220e-4ec5-ac4b-04264477e123 | wdsearch2                    | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.229               | debian-9.5-stretch (deprecated 2018-11-22)  |
| f1d52375-67b2-42f2-8089-aa2095c4aad3 | traffic-puppetmaster         | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.186               | debian-8.7-jessie (deprecated 2017-07-19)   |
| 54e36627-6f0c-46e7-a035-994a2526a0ab | diffscan                     | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.185               | debian-9.0-stretch (deprecated 2017-07-19)  |
| 18fcdcc4-3368-4222-b9e8-8ed0fa87c284 | builder01                    | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.175               | debian-9.3-stretch (deprecated 2018-06-05)  |
| 20e6a5e6-a4a3-4494-8d4e-8ceec78d90c9 | af-puppetmaster02            | ACTIVE  | lan-flat-cloudinstances2b=172.16.2.166               | debian-8.6-jessie (deprecated 2017-02-24)   |
| 8745343e-fa35-493e-b628-62f2542e48d2 | mx-out01                     | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.239, 185.15.56.18 | debian-9.5-stretch (deprecated 2018-11-22)  |
| 35ef6320-6ceb-4fe6-93f2-e3594ed4e9db | hashtags-prod                | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.222               | debian-9.5-stretch (deprecated 2018-11-22)  |
| f5c621c1-56c1-4062-833b-c36bba7bc178 | gerrit-test                  | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.182               | debian-8.5-jessie (deprecated 2016-08-01)   |
| 382e322e-493a-48cb-8e34-7a4358d090ce | phab-tin                     | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.170               | debian-8.5-jessie (deprecated 2016-08-01)   |
| 90de440d-03f3-495d-a4ec-b2202fabe089 | parsing-qa-01                | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.159               | debian-9.5-stretch (deprecated 2018-11-22)  |
| b3edc934-0fb2-4d46-a805-7d02358d227a | bastion-restricted-eqiad1-01 | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.135, 185.15.56.14 | debian-9.5-stretch (deprecated 2018-11-22)  |
Bstorm added a subscriber: Bstorm.EditedJan 4 2020, 2:39 PM

We've got a memory error on DIMM A8 (from the idrac web console).

Change 561985 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: depool cloudvirt1016

https://gerrit.wikimedia.org/r/561985

Change 561985 merged by Andrew Bogott:
[operations/puppet@production] nova: depool cloudvirt1016

https://gerrit.wikimedia.org/r/561985

Bstorm added a comment.Jan 6 2020, 6:06 PM

Note for DCOps: this is not yet evacuated. Please get the DIMM on the way, but coordinate with WMCS before shutting down.

wiki_willy added projects: DC-Ops, ops-eqiad.
Restricted Application added a project: Operations. · View Herald TranscriptJan 6 2020, 6:11 PM

@Jclark-ctr looks like this one is warranty until May 2020, so you can just RMA this with Dell. Thanks, Willy

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Jan 6 2020, 6:13 PM

Confirmed: Service Request 1009577756 was successfully submitted.

Confirmed: Service Request 1009577756 was successfully submitted.

JHedden added a subscriber: JHedden.Fri, Feb 7, 5:39 PM

This host has been taken out of service, we can perform maintenance on it anytime.

Mentioned in SAL (#wikimedia-cloud) [2020-02-07T18:11:26Z] <jeh> shutdown cloudvirt1016 for hardware maintenance T241882

replaced failed dimm A8

JHedden closed this task as Resolved.Fri, Feb 7, 6:35 PM

Thanks, @Jclark-ctr for replacing the DIMM. I verified the host looks good now.