The host is down and the the serial console is unresponsive.
I was not able to reset the host with idrac, so now I am depooling it, and then try to continue the investigation.
The host is not in service.
According to grafana (https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=es2019&var-port=9104&from=1546494156394&to=1546504956394) the host become unresponsible at 2019-01-03 07:40
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
db-codfw.php: Depool es2019 | operations/mediawiki-config | master | +1 -1 | |
mariadb: depool es2019 | operations/mediawiki-config | master | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T130702 Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March | |||
Resolved | Marostegui | T212833 es2019 is not responsive |
Event Timeline
Change 481989 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] mariadb: depool es2019
Change 481989 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: depool es2019
Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:24Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: depool es2019, host is unsresponsible - T212833 (duration: 00m 49s)
Mentioned in SAL (#wikimedia-operations) [2019-01-03T08:35:49Z] <banyek> depooled es2019 as host was unsresponsive - T212833
according to https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30 I reset the host with
racadm serveraction hardreset, now the console is available
after hard reset, I didn't find anything in the logs
/var/log/syslog
Jan 3 07:35:01 es2019 CRON[16225]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Jan 3 07:35:01 es2019 CRON[16226]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'nf_conntrack' Jan 3 09:07:53 es2019 systemd-modules-load[1062]: Inserted module 'ipmi_devintf' Jan 3 09:07:53 es2019 systemd-sysctl[1081]: Couldn't write '65' to 'net/netfilter/nf_conntrack_tcp_timeout_time_wait', ignoring: No such file or directory
Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:18:39Z] <banyek> repooling es2019 - T212833
Mentioned in SAL (#wikimedia-operations) [2019-01-03T09:26:06Z] <banyek@deploy1001> Synchronized wmf-config/db-codfw.php: repool es2019 - T212833 (duration: 01m 33s)
Please run a full check of the tables to make sure data is ok
Should be easy as there is only one table per DB
The cause of the crash was apparently memory related
/admin1/system1/logs1/log1-> show record1 properties CreationTimestamp = 20190103073754.000000-360 ElementName = System Event Log Entry RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. RecordFormat = string Description RecordID = 8 associations targets verbs cd show help version /admin1/system1/logs1/log1-> show record2 properties CreationTimestamp = 20190103073754.000000-360 ElementName = System Event Log Entry RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. RecordFormat = string Description RecordID = 7
Let's make sure data is ok before repooling.
Let's also upgrade MySQL, kernel, BIOS and firmware? @Papaul can you help us with the firmware and BIOS part?
On Cumin2001 I have a comparison screen running inside of a screen in /home/banyek
The script is used the following:
#!/bin/bash for db in $(mysql.py -h es2018 -BN -e "SHOW DATABASES"); do echo "checking database $db" >> compare_es.log ./wmfmariadbpy/wmfmariadbpy/compare.py $db blobs_cluster25 blob_id es2018 es2019 >>compare_es.log done
Change 482814 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool es2019
Change 482814 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool es2019
Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:10:22Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool es2019 - T212833 (duration: 00m 44s)
Mentioned in SAL (#wikimedia-operations) [2019-01-08T15:32:35Z] <marostegui> Stop MySQL on es2019 for upgrade - T212833
Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:34:17Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool es2019 - T212833 (duration: 02m 51s)