Page MenuHomePhabricator

mw2448.codfw.wmnet is down
Closed, ResolvedPublicBUG REPORT

Description

While running scap for a security deploy, mw2448.codfw.wmnet had php-fpm fail to restart. I ran scap for a second time and received the same error. sync-apaches also failed on mw2448.codfw.wmnet. Log is below:

22:41:05 Started sync-apaches
22:44:27 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--include', 'php-1.42.0-wmf.9', '--include', 'php-1.42.0-wmf.9/extensions', '--include', 'php-1.42.0-wmf.9/extensions/PageTriage', '--include', 'php-1.42.0-wmf.9/extensions/PageTriage/modules', '--include', 'php-1.42.0-wmf.9/extensions/PageTriage/modules/***', 'mw2289.codfw.wmnet', 'mw1366.eqiad.wmnet', 'mw1420.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw2300.codfw.wmnet', 'mw2259.codfw.wmnet', 'deploy1002.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'mw1398.eqiad.wmnet'] (ran as mwdeploy@mw2448.codfw.wmnet) returned [255]: ssh: connect to host mw2448.codfw.wmnet port 22: Connection timed out

22:44:27 sync-apaches: 100% (in-flight: 0; ok: 358; fail: 1; left: 0)
22:44:27 Per-host sync duration: average 2.0s, median 1.0s
22:44:27 rsync transfer: average 0 bytes/host, total 0 bytes
22:44:27 1 apaches had sync errors
22:44:27 Finished sync-apaches (duration: 03m 22s)
22:44:27 Started php-fpm-restarts
22:44:27 Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807' on 295 host(s)
22:48:27 /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 (ran as mwdeploy@mw2448.codfw.wmnet) returned [255]: ssh: connect to host mw2448.codfw.wmnet port 22: Connection timed out

22:48:27 php-fpm-restart: 100% (in-flight: 0; ok: 294; fail: 1; left: 0)
22:48:27 1 hosts had failures restarting php-fpm

Event Timeline

bd808 renamed this task from Php-fpm failed to start on one host during Scap to Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out).Dec 18 2023, 11:30 PM

Thanks for filing this!

There have been several reports of mw2448 going offline this year: T345597, T334429 + T334420 (probably the same event)

-------------------------------------------------------------------------------
Record:      3
Date/Time:   12/18/2023 19:48:23
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2023-12-18T23:40:58Z] <taavi> conftool codfw/appserver/nginx/mw2448.codfw.wmnet: pooled changed yes => inactive # T353679, not sure why it was not logged automatically

I've set this host as pooled=inactive in conftool. Tagging serviceops since this is their host and ops-codfw directly given the history of this host.

taavi renamed this task from Php-fpm failed to start on one host during Scap (ssh to mw2448.codfw.wmnet timed out) to mw2448.codfw.wmnet is down.Dec 18 2023, 11:45 PM

Thanks @taavi for setting this host to inactive. The CPU 1 machine check error was also logged one more time,

-------------------------------------------------------------------------------
Record:      9
Date/Time:   12/18/2023 20:52:16
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------

surrounded by a lot of An OEM diagnostic event occurred. as well as 1 instance of

-------------------------------------------------------------------------------
Record:      8
Date/Time:   12/18/2023 20:52:16
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------

Side note: I love how the detection of a problem has a severity of OK.

RAC log also reports these 2 interesting entries. (ractime is correct, so timestamps should be trusteable)

--------------------------------------------------------------------------------
SeqNumber       = 6926
Message ID      = SYS1005
Category        = Audit
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2023-12-18 19:49:41
Message         = The server power action is initiated because the host device initiated a warm-reset operation.
Message Arg   1 = the host device initiated a warm-reset operation
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 6925
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2023-12-18 19:48:55
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------

@Jhancock.wm, @Papaul, I 've refrained from taking any action like rebooting the server so you can debug it further if you wish. I should note that this is the 3rd ticket for this host in 8 months and apparently the 4th occurrence (I see we had 2 events in 1 ticket).

Should the icinga alert for host down and related service alerts in icinga and alertmanager be silenced given it's known and there is a task?

Icinga downtime and Alertmanager silence (ID=c526ca54-768b-461b-9bc7-1666a80b4153) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hw failure

mw2448.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=caa5cb6a-6361-4151-9946-cb47394c0eec) set by cgoubert@cumin2002 for 14 days, 0:00:00 on 1 host(s) and their services with reason: hw failure

mw2448.codfw.wmnet

I've put in a dispatch request with Dell to replace the CPU. I have power cycled the server and replaced the CMOS battery. which fixed one of the errors. will update when part arrives.

Jhancock.wm claimed this task.

Got the part replaced. The bios settings got wiped when the CMOS battery got swapped out. I believe everything is back to how it should be. the idrac and bios versions are current.

Going to close this for now but please let me know if I missed anything in the bios that impacts the server and we can schedule downtime to remedy it.

@Jhancock.wm the easiest and safest way to reconfigure a BIOS is to run the sre.hosts.provision cookbook like it was a new host just with some options to skip unnecessary steps like --no-dhcp, --no-switch, --no-users. See the --help for more details.

thank you. I'll remember that one.

This comment was removed by sbassett.

Mentioned in SAL (#wikimedia-operations) [2023-12-23T15:40:26Z] <taavi> fix date-time on mw2448 (which thought it is the year 2098) by manually setting it once and then restarting systemd-timesyncd.service after bios was reset in T353679

The machine isn't pooled yet into traffic. There is an alert for frequent changes due to puppet run. Indeed the following happens at every puppet run

Notice: /Stage[main]/Cpufrequtils/Exec[cpufrequtils_reload]/returns: executed successfully (corrective)

cpufreq-info, the binary that is executed as the check triggering that reload spews out the following 48 times (1 per CPU)

analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
  maximum transition latency: 4294.55 ms.

I 've run the sre.hosts.provision cookbook as instructed by @Volans and indeed it applied

[snip]
Updated value for attribute BIOS.Setup.1-1 -> ProcPwrPerf (marked Set On Import to True): SysDbpm => OsDbpm 
Updated value for attribute BIOS.Setup.1-1 -> SysProfile: PerfPerWattOptimizedDapc => PerfPerWattOptimizedOs 
Updated value for attribute BIOS.Setup.1-1 -> UsbPorts: AllOn => OnlyBackPortsOn
Updated value for attribute iDRAC.Embedded.1 -> WebServer.1#HostHeaderCheck: Disabled => Enabled

A reboot later (which was probably not warranted, I just did it anyway) I see everything ok.

I ran scap pull to synchronize the deployment of mediawiki with other mw hosts since the host was as as inactive and was getting the latest and I am pooling all services on mw2448.