es2019 is not responsive
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Banyek
	Jan 3 2019, 8:23 AM

Description

The host is down and the the serial console is unresponsive.
I was not able to reset the host with idrac, so now I am depooling it, and then try to continue the investigation.
The host is not in service.
According to grafana (https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=es2019&var-port=9104&from=1546494156394&to=1546504956394) the host become unresponsible at 2019-01-03 07:40

Details

	Subject	Repo	Branch	Lines +/-
	db-codfw.php: Depool es2019	operations/mediawiki-config	master	+1 -1
	mariadb: depool es2019	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T130702 Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March
		Resolved		Marostegui	T212833 es2019 is not responsive

Event Timeline

• Banyek created this task.Jan 3 2019, 8:23 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2019, 8:23 AM

• Banyek triaged this task as Unbreak Now! priority.Jan 3 2019, 8:23 AM

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptJan 3 2019, 8:23 AM

Change 481989 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] mariadb: depool es2019

https://gerrit.wikimedia.org/r/481989

gerritbot added a project: Patch-For-Review.Jan 3 2019, 8:25 AM

Change 481989 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: depool es2019

https://gerrit.wikimedia.org/r/481989

Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:24Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: depool es2019, host is unsresponsible - T212833 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:49Z] <banyek> depooled es2019 as host was unsresponsive - T212833

I triage this as 'high' not unbreak, because the host wasn't in service

according to https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30 I reset the host with
racadm serveraction hardreset, now the console is available

after hard reset, I didn't find anything in the logs
/var/log/syslog

Jan  3 07:35:01 es2019 CRON[16225]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan  3 07:35:01 es2019 CRON[16226]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan  3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'nf_conntrack'
Jan  3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'ipmi_devintf'
Jan  3 09:07:53 es2019 systemd-sysctl[1081]: Couldn't write '65' to 'net/netfilter/nf_conntrack_tcp_timeout_time_wait', ignoring: No such file or directory

The mariadb started without any problem, and replication is resumed

Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:18:39Z] <banyek> repooling es2019 - T212833

Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:26:06Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: repool es2019 - T212833 (duration: 01m 33s)

• Banyek closed this task as Resolved.Jan 3 2019, 9:27 AM

• Banyek claimed this task.

• Banyek updated the task description. (Show Details)

Please run a full check of the tables to make sure data is ok
Should be easy as there is only one table per DB

I'll start with this in the morning

• Banyek moved this task from Triage to In progress on the DBA board.Jan 3 2019, 10:35 PM

The cause of the crash was apparently memory related

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20190103073754.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
		RecordFormat = string Description
		RecordID = 8
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record2

	properties
		CreationTimestamp = 20190103073754.000000-360
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
		RecordFormat = string Description
		RecordID = 7

Let's make sure data is ok before repooling.
Let's also upgrade MySQL, kernel, BIOS and firmware? @Papaul can you help us with the firmware and BIOS part?

Restricted Application added a project: SRE. · View Herald TranscriptJan 4 2019, 3:20 PM

Marostegui added a parent task: T130702: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March.Jan 4 2019, 3:20 PM

On Cumin2001 I have a comparison screen running inside of a screen in /home/banyek
The script is used the following:

#!/bin/bash

for db in $(mysql.py -h es2018 -BN -e "SHOW DATABASES"); do
  echo "checking database $db" >> compare_es.log
  ./wmfmariadbpy/wmfmariadbpy/compare.py $db blobs_cluster25 blob_id es2018 es2019 >>compare_es.log
done

The comparison finished, and the data is OK.

Change 482814 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool es2019

https://gerrit.wikimedia.org/r/482814

Change 482814 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool es2019

https://gerrit.wikimedia.org/r/482814

Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:10:22Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool es2019 - T212833 (duration: 00m 44s)

I have depooled es2019 so it is ready to be powered off once @Papaul is ready for it

Assigning to @Papaul as per our chat

Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:32:35Z] <marostegui> Stop MySQL on es2019 for upgrade - T212833

Update

BIOS from 2.4.3 to 2.8.0
IDRAC from 2.40 to 2.61

system is power on

Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:34:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool es2019 - T212833 (duration: 02m 51s)

Thank you! I have repooled the server!

Liuxinyu970226 unsubscribed.Jan 16 2019, 1:33 AM

es2019 is not responsiveClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

es2019 is not responsive
Closed, ResolvedPublic
Actions

Related Objects
Search...