Page MenuHomePhabricator

servers freeze across the caching cluster
Open, HighPublic

Description

During the last 9 days three caching nodes went down with the same symptoms:

  • Nothing on the SEL
  • KVM unresponsive
  • Network down
  • Nothing on the logs

A power cycle fixed them.

So far the affected systems are PowerEdge R440:

  • cp3053 - T239041
  • cp1077 - T238289
  • cp3057 - T237348 T239502
  • cp3065 - T238032 and 2020-01-05
  • db2125 - T239042 Kernel at the time of the crash: Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 GNU/Linux
  • cp3063 - T239310
  • cp1087 - T239449
  • cp3055 - T240425 (twice, same task, I think the firmware has not yet been updated)
  • backup2001 - T240177 T237730 T240177#5773711 (crashed 3 times, the second crash happened with the firmware running the latest version)
  • cp3051 - T241306
  • cp3061 - crashed 2019-12-28T23:36

Maybe a kernel upgrade or a CPU microcode update is messing with them?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a project: Operations. · View Herald TranscriptNov 14 2019, 9:36 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I tried to narrow this down a bit, but no real luck:

  • These haven't seen a microcode update yet (and the previous microcode update round dates back quite a while).
  • All of the three affected servers are now running 4.9.189, but cp1077 should only do that since the reboot (where it picked up .189, it was previously running 4.9.168). cp3065 and cp3057 OTOH were running 4.9.189 when they crashed. So it also doesn't seem to be related to the new 4.9.189 kernel

cp1077 might also be a totally different issue than cp3* (which are from a the same model/generation/ordering batch ; in kern.log on cp1077 there's two oopses from Nov 5, it's not unlikely that this corrupted some internal state, which made the server eventually crash later.

Peachey88 updated the task description. (Show Details)Nov 14 2019, 11:28 AM
CDanis added a subscriber: CDanis.Nov 25 2019, 3:43 AM

db2125 crashed too and it is a new R440: T239042: db2125 crashed

Marostegui updated the task description. (Show Details)Nov 25 2019, 8:39 AM
Marostegui updated the task description. (Show Details)Nov 25 2019, 8:56 AM
Volans added a subscriber: Volans.Nov 25 2019, 1:52 PM

If needed, full list of R440 available here: https://puppetboard.wikimedia.org/fact/productname/PowerEdge+R440 (intentionally not mentioning their count here)

BBlack added a comment.EditedNov 25 2019, 5:43 PM

It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in which case it's also quite likely this can be pre-empted on the ones that haven't crashed yet by giving them a reboot (e.g. something deep has changed while the servers are live, and they stabilize once they've done a fresh boot with it, possibly a live update of some microcode or firmware?)

Vgutierrez updated the task description. (Show Details)Nov 27 2019, 7:05 AM
06:13:59 <+icinga-wm> PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100

Could be another case of R440 going down?

Vgutierrez updated the task description. (Show Details)

It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in which case it's also quite likely this can be pre-empted on the ones that haven't crashed yet by giving them a reboot (e.g. something deep has changed while the servers are live, and they stabilize once they've done a fresh boot with it, possibly a live update of some microcode or firmware?)

Please note that this is no longer the case, cp3057 has been affected twice already in less than a month, see T237348 and T239502

And [10:23:27] <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% which already failed: T239041

ema added a comment.Dec 11 2019, 9:12 AM

On 2019-12-10 cp3055 went down too:

19:33 <+icinga-wm> PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100%

Depooled and power-cycled by @elukey on 2019-12-11T08:04.

ema updated the task description. (Show Details)Dec 11 2019, 9:13 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-11T09:14:47Z] <ema> repool cp3055 T238305

ema updated the task description. (Show Details)Dec 11 2019, 9:19 AM

See: T240177 T237730 backup2001 was updated to new bios last time it crashed.

Do we have somewhere to collect the kernel versions of the hosts and whether they were upgraded before/after the crash?
I upgraded db2125's kernel when it crashed to:

root@db2125:~# uname -a
Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

And as I listed on the task description, the running kernel at the time of the crash: Linux db2125 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 GNU/Linux

Marostegui updated the task description. (Show Details)Dec 11 2019, 9:50 AM

Some observations:

  • I'm pretty sure this is unrelated to the kernel, we've seen these crashes with both 4.9 and 4.19
  • backup2001 had latest firmware when it crashed
  • backup2001 had almost the latest CPU microcode when it crashed (2019-11-12 release, there's a 2019-11-15 release, but some CPUs are failing to reboot with that microcode and there are reports of overheating, last night there was a update in Debian unstable which rolled back the microcode for one CPU type, but 2019-11-15 could be an option to test. Intel doesn't really tell what they changed between 2019-11-12 and 2019-11-15, but they went through the hassle of doing a new release so there must be some reason...

I also found which sounds very similar: https://www.dell.com/community/PowerEdge-OS-Forum/Random-Reboot-R740/td-p/5169703/page/3

I think we could try two different things independenf of each other (to see whether they are effective by itself):

  • Disable the C / C1E states in Performance settings on 2-3 affected servers
  • Upgrade 2-3 affected servers to the 2019-11-15 microcode
ema added a comment.Dec 11 2019, 11:06 AM

See: T240177 T237730 backup2001 was updated to new bios last time it crashed.

cp3053 too (T239041) and has been running fine since, FWIW.

See: T240177 T237730 backup2001 was updated to new bios last time it crashed.

cp3053 too (T239041) and has been running fine since, FWIW.

Answering the too, backup2001 crashed before, and again after being upgraded.

faidon added a subscriber: faidon.Dec 13 2019, 3:55 PM

Note that R440s comprise 23.5% of the whole fleet, 84.1% of all servers purchased in the last 12 months, and 67.5% of all servers purchased in the last 24 months (I wish I had a graph!). Given this sample size, this may be just correlated to R440s and not specifically tied to them.

jcrespo added a comment.EditedDec 13 2019, 4:44 PM

I believe the main concern is that it seems to happen only recent batches, if I am not mistaken. Indeed the correlation could be based on CPU models or something else (kernel version).

Volans updated the task description. (Show Details)Dec 21 2019, 11:27 PM
ema updated the task description. (Show Details)Sun, Dec 29, 10:55 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-29T10:57:03Z] <ema> repool cp3061 T238305

ema added a comment.Sun, Dec 29, 11:08 AM

cp3061 crashed today, yet another cache_upload node in esams, continuing the trend mentioned in T241306#5759233. DC-Ops: is there anything you can think of that differentiates esams upload hosts, cp30(5[13579]|6[135]), from text cp30(5[02468]|6[024])? An obvious one is network utilization, significantly higher on upload hosts, but maybe there's something else hardware-related that we're overlooking?

ema updated the task description. (Show Details)Sun, Dec 29, 11:14 AM
Marostegui updated the task description. (Show Details)Fri, Jan 3, 1:14 PM

In going through all the affected systems in this task, I'd like to treat db2125 and backup2001 separately, since they seem like one-offs and could very well be hardware issues. (db2125 was 1 of 10 systems in the batch and backup2001 was ordered 1.5yrs ago) . Even the two cp1077 and cp1087 systems in eqiad have been around since May 2018, so those could also be related or unrelated to the cp crashes in esams as well.

But if we were to focus on just the cp machines in esams, there's nothing from a racking/cabling perspective that we did differently between the odd numbered (upload hosts) and even numbered (text hosts) hostnames. They both share the same racks and the same asw switches. So if only the upload hosts are seeing issues, it seems like it might be something from that side. I can definitely reach out to Dell as well, to see if there are any known firmware issues, etc....though I can tell you the first thing they're going to want us to do beforehand is upgrade firmware on everything, then send them the logs from the diagnostics testing. Let me know if you guys want us to go that route. We could focus on just cp3055 for now too, since that one looks like its firmware has already been upgraded and is still seeing issues.

Thanks,
Willy

Mentioned in SAL (#wikimedia-operations) [2020-01-05T23:56:00Z] <effie> powecycle cp3065.esams.wmnet T238305

Mentioned in SAL (#wikimedia-operations) [2020-01-06T00:06:11Z] <effie> pool cp3065 T238305

ema updated the task description. (Show Details)Mon, Jan 6, 1:24 PM

In going through all the affected systems in this task, I'd like to treat db2125 and backup2001 separately, since they seem like one-offs and could very well be hardware issues. (db2125 was 1 of 10 systems in the batch and backup2001 was ordered 1.5yrs ago) . Even the two cp1077 and cp1087 systems in eqiad have been around since May 2018, so those could also be related or unrelated to the cp crashes in esams as well.

backup2001 has crashed 3 times already (even with up-to-date BIOS and firmwares) so I am not fully sure it should be treated separately. The crashes follow the exact same pattern we've seen so far (no OS logs and not HW logs either).
There is not much else we can do without Dell's assistance (T240177#5727654) I think, as there are no logs to provide or send.

Papaul added a subscriber: Papaul.Tue, Jan 7, 11:50 PM

backup2001 is at 1.3.7 for BIOS version and the last time we did only the IDRAC upgrade since sometimes when the IDRAC version is not up to date we might not see and log at system crash. so i think let us start by getting all those servers at the latest firmware BIOS and IDRAC and go from there (see comment on T237730 )

Thanks for the clarification. My thoughts were that we upgraded also BIOS. Let's start with that indeed.

ema added a comment.Wed, Jan 8, 7:41 AM

sometimes when the IDRAC version is not up to date we might not see and log at system crash

Interesting!

so i think let us start by getting all those servers at the latest firmware BIOS and IDRAC and go from there (see comment on T237730 )

+1, thanks @Papaul

Jan 12 22:51:15 <icinga-wm>	PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100%
Jan 12 22:53:51 <icinga-wm>	PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100%

Perhaps a little close together in timing?

Mentioned in SAL (#wikimedia-operations) [2020-01-13T00:22:18Z] <effie> depool and restart cp3065 cp3061 - T238305

jijiki added a subscriber: jijiki.Mon, Jan 13, 12:59 AM

prometheus-trafficserver-tls-exporter.service initially failed to start on both cp3065 and cp3061 after reboot

Mentioned in SAL (#wikimedia-operations) [2020-01-18T04:15:53Z] <cdanis> cp3065.mgmt: /admin1-> racadm serveraction hardreset T238305

03:16:58	<+icinga-wm>	PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100%

Nothing in racadm getsel or racadm lclog view (latter just has me logging in over SSH).

Mentioned in SAL (#wikimedia-operations) [2020-01-19T00:46:30Z] <cdanis> T238305 cp3053.mgmt /admin1-> racadm serveraction hardreset

CDanis added a comment.EditedSun, Jan 19, 12:46 AM
22:22:06	<+icinga-wm>	PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%

nothing in logs as usual

Is there any action plan to investigate these issues?

Is there any action plan to investigate these issues?

Currently T242579 is our only hope of getting more information about this issue

18:17:56 <+icinga-wm> PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100%
Might be another case...

Mentioned in SAL (#wikimedia-traffic) [2020-01-20T08:07:05Z] <ema> powercycle cp3061 T238305