Page MenuHomePhabricator

es2019 is not responsive
Closed, ResolvedPublic

Description

The host is down and the the serial console is unresponsive.
I was not able to reset the host with idrac, so now I am depooling it, and then try to continue the investigation.
The host is not in service.
According to grafana (https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=es2019&var-port=9104&from=1546494156394&to=1546504956394) the host become unresponsible at 2019-01-03 07:40

Event Timeline

Banyek triaged this task as Unbreak Now! priority.Jan 3 2019, 8:23 AM

Change 481989 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] mariadb: depool es2019

https://gerrit.wikimedia.org/r/481989

Change 481989 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: depool es2019

https://gerrit.wikimedia.org/r/481989

Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:24Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: depool es2019, host is unsresponsible - T212833 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:49Z] <banyek> depooled es2019 as host was unsresponsive - T212833

Banyek lowered the priority of this task from Unbreak Now! to High.Jan 3 2019, 8:44 AM
Banyek updated the task description. (Show Details)

I triage this as 'high' not unbreak, because the host wasn't in service

according to https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30 I reset the host with
racadm serveraction hardreset, now the console is available

after hard reset, I didn't find anything in the logs
/var/log/syslog

Jan  3 07:35:01 es2019 CRON[16225]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan  3 07:35:01 es2019 CRON[16226]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan  3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'nf_conntrack'
Jan  3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'ipmi_devintf'
Jan  3 09:07:53 es2019 systemd-sysctl[1081]: Couldn't write '65' to 'net/netfilter/nf_conntrack_tcp_timeout_time_wait', ignoring: No such file or directory

The mariadb started without any problem, and replication is resumed

Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:26:06Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: repool es2019 - T212833 (duration: 01m 33s)

Banyek claimed this task.
Banyek updated the task description. (Show Details)
Marostegui added a subscriber: Marostegui.

Please run a full check of the tables to make sure data is ok
Should be easy as there is only one table per DB

I'll start with this in the morning

Marostegui added a subscriber: Papaul.

The cause of the crash was apparently memory related

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20190103073754.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
		RecordFormat = string Description
		RecordID = 8
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record2

	properties
		CreationTimestamp = 20190103073754.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
		RecordFormat = string Description
		RecordID = 7

Let's make sure data is ok before repooling.
Let's also upgrade MySQL, kernel, BIOS and firmware? @Papaul can you help us with the firmware and BIOS part?

On Cumin2001 I have a comparison screen running inside of a screen in /home/banyek
The script is used the following:

#!/bin/bash

for db in $(mysql.py -h es2018 -BN -e "SHOW DATABASES"); do
  echo "checking database $db" >> compare_es.log
  ./wmfmariadbpy/wmfmariadbpy/compare.py $db blobs_cluster25 blob_id es2018 es2019 >>compare_es.log
done

The comparison finished, and the data is OK.

Change 482814 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool es2019

https://gerrit.wikimedia.org/r/482814

Change 482814 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool es2019

https://gerrit.wikimedia.org/r/482814

Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:10:22Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool es2019 - T212833 (duration: 00m 44s)

I have depooled es2019 so it is ready to be powered off once @Papaul is ready for it

Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:32:35Z] <marostegui> Stop MySQL on es2019 for upgrade - T212833

Update

BIOS from 2.4.3 to 2.8.0
IDRAC from 2.40 to 2.61

system is power on

Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:34:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool es2019 - T212833 (duration: 02m 51s)

Thank you! I have repooled the server!