investigate lead hardware issue
Open, NormalPublic

Description

Due to issues on system lead, hardware request task T147596 was created.

Lead experienced hardware issues, as documented on https://etherpad.wikimedia.org/p/gerrit-outage-20161006 The contents of that pad have been copied below:

Responders: Chad, Alex, Brandon, Daniel

UTC on Oct 5
23:06 <     yurik> gerrit is really unhappy today :( 
23:22 <+   greg-g> yurik: first I heard, can you say more? otherwise I'll just have to ignore the comment
23:52 <   Krenair> seems fine to me yurik
23:52 <     yurik> greg-g, Krenair, sorry, just saw your replies. For some reason it took ~4-5 min for "git review" to go throuw
23:52 <     yurik> though
23:53 <     yurik> might have been just my connection, but IRC and other sites seem to be okayish
23:54 <     yurik> actually just checked - git pull takes considerable time, even though gerrit.wikimedia.org opens pretty fast 

* Starting at 17:49UTC on 6 Oct gerrit started becoming unresponsive. CPU usage was through the roof
** Puppet halted, default error page for Gerrit being shown

* Other symptoms:
** Apache using far too much CPU
** IO appears fine
** Too much sys cpu?
** Network traffic seems normal

* Restarting gerrit & apache had little effect
** Gerrit suffering from unacceptably slow startup times (logging module?)
* Rebooted server once, did not help
** No hardware errors appeared on reboot
* Rebooting a second time with older kernel
** Also did not seem to help


[22:59:10]        <bblack> the cpu cores are all running at like 200mhz right now
[23:00:12]        <bblack> root@lead:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq 
[23:00:14]        <bblack> 185253
[23:00:22]        <bblack> yeah they're all running at ~200Mhz
  • end etherpad --
RobH created this task.Oct 12 2016, 12:19 AM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptOct 12 2016, 12:19 AM

lead has been removed from puppet site.pp, i revoked the puppet cert and salt key.

i used "puppet node clean" which revoked the cert and also claimed it removed storedconfigs, but lead is still showing up in Icinga as of now. i did disable notifications.

lead does not get any traffic, since gerrit.wm.org definitely points to cobalt, so it's fine to shut it down or whatever is needed though

lead has been removed from puppet site.pp, i revoked the puppet cert and salt key.

i used "puppet node clean" which revoked the cert and also claimed it removed storedconfigs, but lead is still showing up in Icinga as of now. i did disable notifications.

Since the migration to puppetDB, we 've had to slightly alter that. In fact one more command has been added.

puppet node deactivate <fqdn>

See https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS

It's unfortunate that we still need both, but for now that's it :-(. I 've executed the command for lead

lead.wikimedia.org is still up and running, with puppet disable and stern warning to not re-enable it. We left it up Just In Case we found some issue that we need to go back to its data for. I'm guessing we're past that point and we should re-role it to spare or something and turn puppet back on before going for a hardware fix?

Dzahn claimed this task.Oct 21 2016, 12:38 AM

Yes, i'll talk to Chad about it next week when they are back from their offsite.

Mentioned in SAL (#wikimedia-operations) [2016-10-26T01:36:39Z] <mutante> lead - (formerly gerrit) - shutdown -h now (T147905)

Change 318033 had a related patch set uploaded (by Dzahn):
remove lead.wikimedia.org, keep lead.mgmt.eqiad

https://gerrit.wikimedia.org/r/318033

Change 318035 had a related patch set uploaded (by Dzahn):
decom lead (ex-gerrit)

https://gerrit.wikimedia.org/r/318035

Change 318035 merged by Dzahn:
decom lead (ex-gerrit)

https://gerrit.wikimedia.org/r/318035

Change 318033 merged by Dzahn:
remove lead.wikimedia.org, keep lead.mgmt.eqiad

https://gerrit.wikimedia.org/r/318033

Dzahn added a subscriber: Cmjohnson.EditedOct 26 2016, 7:35 PM

@RobH @Cmjohnson lead has been removed from puppet/install/DNS and shutdown. mgmt DNS has been kept. I am now moving the ticket to ops-eqiad to follow-up on it with the vendor. After that we can return it to spares/reinstall with a new name/repurpose/decom forever, depending what makes sense in this case.

Dzahn reassigned this task from Dzahn to Cmjohnson.
Cmjohnson closed this task as "Resolved".Dec 2 2016, 5:15 PM

Added lead to spares list (google tracking sheet) (@RobH )

RobH reopened this task as "Open".Jan 9 2017, 6:51 PM

I'm re-opening this task, as there was a CPU frequency issues on this that were never resolved. This came up today, as mira had similar frequency issues that were resolved by a reboot.

This machine should likely be reinstalled and then put under some kind of load to see if it still has the issue.

greg added a subscriber: greg.Jan 9 2017, 6:52 PM

Mentioned in SAL (#wikimedia-operations) [2017-03-07T19:08:46Z] <bblack> rebooting baham (ns1) - low cpu frequencies issues like T147905

Mentioned in SAL (#wikimedia-operations) [2017-03-07T19:20:30Z] <bblack> rebooting baham (ns1) AGAIN - low cpu frequencies issues like T147905 - checking bios/idrac stuff