Page MenuHomePhabricator

scs-c1-eqiad CPU usage over 85%
Closed, ResolvedPublic

Description

Since 01:08 UTC, scs-c1-eqiad reports a CPU usage at 100% as it can be seen in librenms: https://librenms.wikimedia.org/device/device=158/tab=health/metric=processor/

CPU usage suddenly went from a 15% on average to a 100%

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-11-12T16:21:02Z] <XioNoX> reboot scs-c1-eqiad.mgmt.eqiad.wmnet - T238036

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
1855 root      20   0  4576 2084  876 S  2.7  0.8  26623:27 portmanager

did a kill -9 portmanager just in case but it didn't change anything (the process restarted with the same 2% CPU load).
Then killed snmpd, which lower the CPU for a bit but then went back up.
Trying a reboot.

ayounsi claimed this task.

CPU is back to normal.

ayounsi removed ayounsi as the assignee of this task.
ayounsi triaged this task as High priority.
ayounsi edited projects, added DC-Ops; removed netops.

This has been alerting since a few days ago. It might be worth following up with the vendor instead of rebooting the console servers every few months.

Cmjohnson added subscribers: RobH, Cmjohnson.

@ayounsi I am not sure if there is a vendor to follow up with on this. checking with @RobH

Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:12:46Z] <robh> updating firmware on scs-c1-eqiad via T238036

Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:14:52Z] <robh> updating firmware on scs-c1-eqiad via T238036

scs-a1-eqiad firmware was 3.16.6u4, newest stable at this time is 4.9.0u1, updating

Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:20:14Z] <robh> scs-c1-eqiad firmware update complete and back online T238036

RobH removed RobH as the assignee of this task.

Firmware updated to the newest version. If it happens again, we can reopen and investigate with OpenGear.

So the firmware on scs-c1-codfw is 4.5.0, current release is 4.9.0, upgrading now.

Mentioned in SAL (#wikimedia-operations) [2020-10-13T18:01:47Z] <robh> scs-c1-codfw firmware update via T238036

Mentioned in SAL (#wikimedia-operations) [2020-10-13T18:09:42Z] <robh> scs-c1-codfw mgmt firmware updated, updating scs-a1-codfw T238036

I've successfully upgraded the scs firmware fleetwide, with the exception of two devices:

  • future-scs-a8-eqiad - new CM7148, cannot upgrade until its racked in place of the older model its replacing.
RobH claimed this task.

If this happens again on any scs device, other than scs-a8-eqiad, it means the firmware update to 4.9.0u1 (fleetwide) doesn't fix the CPU spike issue.

So far when a firmware update is applied to any scs, that scs hasn't spiked again, BUT the spikes are not exactly regular or easily reproduced so who knows.

RobH removed RobH as the assignee of this task.Oct 14 2020, 4:25 PM

I am not sure what needs to be done with this task. There really isn't anything actionable other than to replace the scs with something else.

Next step is to open a ticket with the vendor if possible.

https://netbox.wikimedia.org/dcim/devices/1955/ was purchased on 2017-10-01, and has a 4 year warranty, expiring on 2021-10-01.

https://opengear.com/support/contact-tech-support

A support ticket can be opened without calling, to provide details. I'll leave the ticket opening to either @Cmjohnson or @Jclark-ctr, but they should do so before the warranty expires!

A ticket has been submitted

Your request (#82025) has been received, and is being reviewed by our support staff.

For questions concerning Opengear's Console Server products, please submit a support report if you have not already done so.

To generate a support report, go to the web interface of the Opengear Console Server. In the left hand menu groups go to the Status group and click on Support Report. Copy and paste the entire contents into a .txt file and attach to a reply to this e-mail.

To review the status of the request and add additional comments, follow the link below:
http://opengear.zendesk.com/hc/requests/82025

Sent the requested report to their tech support team.

Opengear's response was for me to update the f/w. It appears to be a newer version than the one Robh had installed.
The newest version is cm71xx-4.11.0.flash and has been installed. I will resolve this task for now, if the error returns please open task.