Page MenuHomePhabricator

db2047 got rebooted
Closed, ResolvedPublic

Description

db2047 got rebooted today around 11:50 UTC and paged. Things I've quickly checked:

  • Nothing particularly anomalous in grafana dashboards (both machine and MySQL stats)
  • Nothing in syslog (see below)
  • Nothing in mysql error log (last entry from the 21st)
  • All the 14 event log entries in the management console seems to be post reboot, starting at 12:07:17 UTC and I've checked the management reported current time is correct.

Last entries from syslog show a standard puppet run before the reboot:

Sep 24 11:46:02 db2047 puppet-agent[47855]: Retrieving pluginfacts
Sep 24 11:46:02 db2047 puppet-agent[47855]: Retrieving plugin
Sep 24 11:46:03 db2047 puppet-agent[47855]: Loading facts
Sep 24 11:46:10 db2047 puppet-agent[47855]: Caching catalog for db2047.codfw.wmnet
Sep 24 11:46:12 db2047 puppet-agent[47855]: Applying configuration version '1506253566'
Sep 24 11:46:16 db2047 crontab[48240]: (root) LIST (root)
Sep 24 11:46:16 db2047 crontab[48242]: (root) LIST (prometheus)
Sep 24 11:46:19 db2047 puppet-agent[47855]: Finished catalog run in 7.89 seconds
Sep 24 11:50:20 db2047 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="979" x-info="http://www.rsyslog.com"] start
Sep 24 11:50:20 db2047 systemd-modules-load[459]: Inserted module 'nf_conntrack'
Sep 24 11:50:20 db2047 systemd-modules-load[459]: Inserted module 'ipmi_devintf'
Sep 24 11:50:20 db2047 systemd[1]: Started Load/Save Random Seed.
Sep 24 11:50:20 db2047 systemd[1]: Started Apply Kernel Variables.

Being db2047 a slave of s7 in codfw I'm just ACKing the alarms on Icinga and referring them to this task for now WITHOUT starting MySQL, so that the DBA can check its data consistency and evaluate if it needs reimporting.

Event Timeline

Volans triaged this task as High priority.Sep 24 2017, 12:13 PM

Logs:

769	 Informational	iLO 4	09/24/2017 11:49	09/24/2017 11:49	1	On-board clock set; was 09/24/2017 11:32:42.
768	 Caution	iLO 4	09/24/2017 11:30	09/24/2017 11:30	1	Server reset.
767	 Informational	iLO 4	09/24/2017 11:30	09/24/2017 11:30	1	Embedded Flash/SD-CARD: Restarted.
766	 Informational	iLO 4	09/24/2017 11:30	09/24/2017 11:30	2	Server power restored.
765	 Caution	iLO 4	09/24/2017 11:30	09/24/2017 11:30	1	Server reset.
764	 Informational	iLO 4	09/24/2017 11:30	09/24/2017 11:30	1	Server power removed.
763	 Informational	iLO 4	09/24/2017 11:30	09/24/2017 11:30	1	Power on request received by: Automatic Power Recovery.
...
760	 Informational	iLO 4	05/24/2017 17:07	05/24/2017 17:07	1	On-board clock set; was 05/24/2017 16:16:16.
759	 Informational	iLO 4	05/24/2017 16:13	05/24/2017 16:13	1	Embedded Flash/SD-CARD: Restarted.
758	 Informational	iLO 4	05/24/2017 16:13	05/24/2017 16:13	1	Server power restored.
757	 Caution	iLO 4	05/24/2017 16:13	05/24/2017 16:13	1	Server reset.

It looks like a power loss to me (?).

Actually:

Critical	Environment	09/24/2017 11:30	09/24/2017 11:30	1	Critical Temperature Threshold Exceeded (Temperature Sensor 17, Location System, Temperature 127C)

127C, is the datacenter in flames?

@jcrespo interesting, I guess the documentation in https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0#Show_system_event_log_entries needs to be updataed to include the command to show those other logs too. In the event logs there was no event reported before the reboot AFAICT.
There were others referring to the temperature after the reboot too, but the icinga check was ok so I assumed it was a false reading during the reboot process, like other errors that were referring to failed disks. But I might have misunderstood them.

Change 380311 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2047, hardware issues

https://gerrit.wikimedia.org/r/380311

Change 380311 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2047, hardware issues

https://gerrit.wikimedia.org/r/380311

This comment has been deleted.

@jcrespo interesting, I guess the documentation in https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0#Show_system_event_log_entries needs to be updataed to include the command to show those other logs too. In the event logs there was no event reported before the reboot AFAICT.
There were others referring to the temperature after the reboot too, but the icinga check was ok so I assumed it was a false reading during the reboot process, like other errors that were referring to failed disks. But I might have misunderstood them.

This is what I did:

</system1/log1>hpiLO-> show record14

status=0
status_tag=COMMAND COMPLETED
Mon Sep 25 04:27:07 2017



/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Critical
    date=09/24/2017
    time=11:30
    description=Critical Temperature Threshold Exceeded (Temperature Sensor 17, Location System, Temperature 127C)
  Verbs
    cd version exit show set


</system1/log1>hpiLO->

I have checked other hosts on that same rack (C6) and there are no warnings on ILO or anything related.
@Papaul can you visually check the rack to see if there is any temperature warnings somewhere?

The temperature in C6 is about 106 F and the ILO log shows Critical Temperature Threshold Exceeded on 9-24 at 11:30 pm (see attachment)

Selection_007.png (113×1 px, 22 KB)

The temperature in C6 is about 106 F and the ILO log shows Critical Temperature Threshold Exceeded on 9-24 at 11:30 pm (see attachment)

Selection_007.png (113×1 px, 22 KB)

Thanks

Yep, that is what we saw on that host.
106F translates into 41C, so that looks normal.
Could that be a temporary thing for that particular host? Assuming all the fans and all that are working finely...?

Looking at the PDU temperature graphs, I cannot see anything weird there, so it might have been just a punctual thing with this host.

Change 380687 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2047

https://gerrit.wikimedia.org/r/380687

The rack looks fine and so do the PDU and their temperature graphs.
Going to repool this host and if it happens again we will really need to look into this as the rack might have issues or the hardware itself.

Change 380687 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2047

https://gerrit.wikimedia.org/r/380687

Marostegui claimed this task.

Mentioned in SAL (#wikimedia-operations) [2017-09-26T05:48:33Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2047 - T176573 (duration: 00m 44s)