Page MenuHomePhabricator

scs-c1-eqiad CPU usage over 85%
Open, HighPublic

Description

Since 01:08 UTC, scs-c1-eqiad reports a CPU usage at 100% as it can be seen in librenms: https://librenms.wikimedia.org/device/device=158/tab=health/metric=processor/

CPU usage suddenly went from a 15% on average to a 100%

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-11-12T16:21:02Z] <XioNoX> reboot scs-c1-eqiad.mgmt.eqiad.wmnet - T238036

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
1855 root      20   0  4576 2084  876 S  2.7  0.8  26623:27 portmanager

did a kill -9 portmanager just in case but it didn't change anything (the process restarted with the same 2% CPU load).
Then killed snmpd, which lower the CPU for a bit but then went back up.
Trying a reboot.

ayounsi claimed this task.

CPU is back to normal.

ayounsi removed ayounsi as the assignee of this task.
ayounsi triaged this task as High priority.
ayounsi edited projects, added DC-Ops; removed netops.

This has been alerting since a few days ago. It might be worth following up with the vendor instead of rebooting the console servers every few months.

Cmjohnson added subscribers: RobH, Cmjohnson.

@ayounsi I am not sure if there is a vendor to follow up with on this. checking with @RobH

Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:12:46Z] <robh> updating firmware on scs-c1-eqiad via T238036

Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:14:52Z] <robh> updating firmware on scs-c1-eqiad via T238036

scs-a1-eqiad firmware was 3.16.6u4, newest stable at this time is 4.9.0u1, updating

Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:20:14Z] <robh> scs-c1-eqiad firmware update complete and back online T238036

RobH removed RobH as the assignee of this task.

Firmware updated to the newest version. If it happens again, we can reopen and investigate with OpenGear.

So the firmware on scs-c1-codfw is 4.5.0, current release is 4.9.0, upgrading now.

Mentioned in SAL (#wikimedia-operations) [2020-10-13T18:01:47Z] <robh> scs-c1-codfw firmware update via T238036

Mentioned in SAL (#wikimedia-operations) [2020-10-13T18:09:42Z] <robh> scs-c1-codfw mgmt firmware updated, updating scs-a1-codfw T238036

I've successfully upgraded the scs firmware fleetwide, with the exception of two devices:

  • future-scs-a8-eqiad - new CM7148, cannot upgrade until its racked in place of the older model its replacing.
RobH claimed this task.

If this happens again on any scs device, other than scs-a8-eqiad, it means the firmware update to 4.9.0u1 (fleetwide) doesn't fix the CPU spike issue.

So far when a firmware update is applied to any scs, that scs hasn't spiked again, BUT the spikes are not exactly regular or easily reproduced so who knows.

RobH removed RobH as the assignee of this task.Oct 14 2020, 4:25 PM