Page MenuHomePhabricator

es2019 is not responsive
Closed, ResolvedPublic

Description

The host is down and the the serial console is unresponsive.
I was not able to reset the host with idrac, so now I am depooling it, and then try to continue the investigation.
The host is not in service.
According to grafana (https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=es2019&var-port=9104&from=1546494156394&to=1546504956394) the host become unresponsible at 2019-01-03 07:40

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-codfw.php: Depool es2019
operations/mediawiki-config : mastermariadb: depool es2019

Event Timeline

Banyek created this task.Jan 3 2019, 8:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2019, 8:23 AM
Banyek triaged this task as Unbreak Now! priority.Jan 3 2019, 8:23 AM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptJan 3 2019, 8:23 AM

Change 481989 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] mariadb: depool es2019

https://gerrit.wikimedia.org/r/481989

Change 481989 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: depool es2019

https://gerrit.wikimedia.org/r/481989

Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:24Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: depool es2019, host is unsresponsible - T212833 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:49Z] <banyek> depooled es2019 as host was unsresponsive - T212833

Banyek lowered the priority of this task from Unbreak Now! to High.Jan 3 2019, 8:44 AM
Banyek updated the task description. (Show Details)

I triage this as 'high' not unbreak, because the host wasn't in service

Banyek added a comment.Jan 3 2019, 9:07 AM

according to https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30 I reset the host with
racadm serveraction hardreset, now the console is available

Banyek added a comment.Jan 3 2019, 9:14 AM

after hard reset, I didn't find anything in the logs
/var/log/syslog

Jan  3 07:35:01 es2019 CRON[16225]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan  3 07:35:01 es2019 CRON[16226]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan  3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'nf_conntrack'
Jan  3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'ipmi_devintf'
Jan  3 09:07:53 es2019 systemd-sysctl[1081]: Couldn't write '65' to 'net/netfilter/nf_conntrack_tcp_timeout_time_wait', ignoring: No such file or directory
Banyek added a comment.Jan 3 2019, 9:14 AM

The mariadb started without any problem, and replication is resumed

Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:18:39Z] <banyek> repooling es2019 - T212833

Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:26:06Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: repool es2019 - T212833 (duration: 01m 33s)

Banyek closed this task as Resolved.Jan 3 2019, 9:27 AM
Banyek claimed this task.
Banyek updated the task description. (Show Details)
Marostegui reopened this task as Open.Jan 3 2019, 7:27 PM
Marostegui added a subscriber: Marostegui.

Please run a full check of the tables to make sure data is ok
Should be easy as there is only one table per DB

Banyek added a comment.Jan 3 2019, 9:51 PM

I'll start with this in the morning

Banyek moved this task from Triage to In progress on the DBA board.Jan 3 2019, 10:35 PM
Marostegui added a subscriber: Papaul.

The cause of the crash was apparently memory related

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20190103073754.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
		RecordFormat = string Description
		RecordID = 8
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record2

	properties
		CreationTimestamp = 20190103073754.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
		RecordFormat = string Description
		RecordID = 7

Let's make sure data is ok before repooling.
Let's also upgrade MySQL, kernel, BIOS and firmware? @Papaul can you help us with the firmware and BIOS part?

Restricted Application added a project: Operations. · View Herald TranscriptJan 4 2019, 3:20 PM
Banyek added a comment.Jan 4 2019, 3:44 PM

On Cumin2001 I have a comparison screen running inside of a screen in /home/banyek
The script is used the following:

#!/bin/bash

for db in $(mysql.py -h es2018 -BN -e "SHOW DATABASES"); do
  echo "checking database $db" >> compare_es.log
  ./wmfmariadbpy/wmfmariadbpy/compare.py $db blobs_cluster25 blob_id es2018 es2019 >>compare_es.log
done
Banyek added a comment.Jan 7 2019, 9:32 AM

The comparison finished, and the data is OK.

Change 482814 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool es2019

https://gerrit.wikimedia.org/r/482814

Change 482814 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool es2019

https://gerrit.wikimedia.org/r/482814

Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:10:22Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool es2019 - T212833 (duration: 00m 44s)

I have depooled es2019 so it is ready to be powered off once @Papaul is ready for it

Marostegui reassigned this task from Banyek to Papaul.Jan 8 2019, 3:21 PM

Assigning to @Papaul as per our chat

Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:32:35Z] <marostegui> Stop MySQL on es2019 for upgrade - T212833

Papaul reassigned this task from Papaul to Marostegui.Jan 8 2019, 4:19 PM

Update

BIOS from 2.4.3 to 2.8.0
IDRAC from 2.40 to 2.61

system is power on

Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:34:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool es2019 - T212833 (duration: 02m 51s)

Marostegui closed this task as Resolved.Jan 8 2019, 4:34 PM

Thank you! I have repooled the server!