Page MenuHomePhabricator

investigate lead hardware issue
Closed, ResolvedPublic

Description

Due to issues on system lead, hardware request task T147596 was created.

Lead experienced hardware issues, as documented on https://etherpad.wikimedia.org/p/gerrit-outage-20161006 The contents of that pad have been copied below:

Responders: Chad, Alex, Brandon, Daniel

UTC on Oct 5
23:06 <     yurik> gerrit is really unhappy today :( 
23:22 <+   greg-g> yurik: first I heard, can you say more? otherwise I'll just have to ignore the comment
23:52 <   Krenair> seems fine to me yurik
23:52 <     yurik> greg-g, Krenair, sorry, just saw your replies. For some reason it took ~4-5 min for "git review" to go throuw
23:52 <     yurik> though
23:53 <     yurik> might have been just my connection, but IRC and other sites seem to be okayish
23:54 <     yurik> actually just checked - git pull takes considerable time, even though gerrit.wikimedia.org opens pretty fast 

* Starting at 17:49UTC on 6 Oct gerrit started becoming unresponsive. CPU usage was through the roof
** Puppet halted, default error page for Gerrit being shown

* Other symptoms:
** Apache using far too much CPU
** IO appears fine
** Too much sys cpu?
** Network traffic seems normal

* Restarting gerrit & apache had little effect
** Gerrit suffering from unacceptably slow startup times (logging module?)
* Rebooted server once, did not help
** No hardware errors appeared on reboot
* Rebooting a second time with older kernel
** Also did not seem to help


[22:59:10]        <bblack> the cpu cores are all running at like 200mhz right now
[23:00:12]        <bblack> root@lead:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq 
[23:00:14]        <bblack> 185253
[23:00:22]        <bblack> yeah they're all running at ~200Mhz
  • end etherpad --

Event Timeline

lead has been removed from puppet site.pp, i revoked the puppet cert and salt key.

i used "puppet node clean" which revoked the cert and also claimed it removed storedconfigs, but lead is still showing up in Icinga as of now. i did disable notifications.

lead does not get any traffic, since gerrit.wm.org definitely points to cobalt, so it's fine to shut it down or whatever is needed though

lead has been removed from puppet site.pp, i revoked the puppet cert and salt key.

i used "puppet node clean" which revoked the cert and also claimed it removed storedconfigs, but lead is still showing up in Icinga as of now. i did disable notifications.

Since the migration to puppetDB, we 've had to slightly alter that. In fact one more command has been added.

puppet node deactivate <fqdn>

See https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS

It's unfortunate that we still need both, but for now that's it :-(. I 've executed the command for lead

lead.wikimedia.org is still up and running, with puppet disable and stern warning to not re-enable it. We left it up Just In Case we found some issue that we need to go back to its data for. I'm guessing we're past that point and we should re-role it to spare or something and turn puppet back on before going for a hardware fix?

Yes, i'll talk to Chad about it next week when they are back from their offsite.

Mentioned in SAL (#wikimedia-operations) [2016-10-26T01:36:39Z] <mutante> lead - (formerly gerrit) - shutdown -h now (T147905)

Change 318033 had a related patch set uploaded (by Dzahn):
remove lead.wikimedia.org, keep lead.mgmt.eqiad

https://gerrit.wikimedia.org/r/318033

Change 318035 had a related patch set uploaded (by Dzahn):
decom lead (ex-gerrit)

https://gerrit.wikimedia.org/r/318035

Change 318033 merged by Dzahn:
remove lead.wikimedia.org, keep lead.mgmt.eqiad

https://gerrit.wikimedia.org/r/318033

@RobH @Cmjohnson lead has been removed from puppet/install/DNS and shutdown. mgmt DNS has been kept. I am now moving the ticket to ops-eqiad to follow-up on it with the vendor. After that we can return it to spares/reinstall with a new name/repurpose/decom forever, depending what makes sense in this case.

Added lead to spares list (google tracking sheet) (@RobH )

I'm re-opening this task, as there was a CPU frequency issues on this that were never resolved. This came up today, as mira had similar frequency issues that were resolved by a reboot.

This machine should likely be reinstalled and then put under some kind of load to see if it still has the issue.

Mentioned in SAL (#wikimedia-operations) [2017-03-07T19:08:46Z] <bblack> rebooting baham (ns1) - low cpu frequencies issues like T147905

Mentioned in SAL (#wikimedia-operations) [2017-03-07T19:20:30Z] <bblack> rebooting baham (ns1) AGAIN - low cpu frequencies issues like T147905 - checking bios/idrac stuff

What am I supposed to be doing with this task?

This sounds similar to the other tickets linked to "tracking" task T162850. We have observed downthrottling to 200MHz on other servers before. The interesting part is that they were all R320 while lead is a R420 (or racktables says so, is it really?). That would be unfortunate if it means more models are potentially affected by this bug.

What we have done is blacklist the acpi_pad kernel module and add Icinga monitoring for CPU frequency under 600MHz. But... so far only for R320's.

There is probably not much that we can do besides brining it back up and running it and see if it ever happens again.

Heh, I was hoping T162850 would have solved it. It's a bit concerning that a R420 (it is indeed a R420) has possibly exhibited the same symptoms. The box will be 3 years old next Monday (May 1st). Apart from looking at thermal paste and trying to figure out if it has enough, I honestly can't think of a good enough way to get an "YES" or "NO" that does not involve powering the box and observing it (and potentially doing the stuff in T162850).

Heh, I was hoping T162850 would have solved it. It's a bit concerning that a R420 (it is indeed a R420) has possibly exhibited the same symptoms. The box will be 3 years old next Monday (May 1st). Apart from looking at thermal paste and trying to figure out if it has enough, I honestly can't think of a good enough way to get an "YES" or "NO" that does not involve powering the box and observing it (and potentially doing the stuff in T162850).

Eh, i didn't mean to quote all that, i just wanted to change T162580 to T162850 (the latter is the right one)

Thanks, I 've updated my comment.