Since 01:08 UTC, scs-c1-eqiad reports a CPU usage at 100% as it can be seen in librenms: https://librenms.wikimedia.org/device/device=158/tab=health/metric=processor/
CPU usage suddenly went from a 15% on average to a 100%
Since 01:08 UTC, scs-c1-eqiad reports a CPU usage at 100% as it can be seen in librenms: https://librenms.wikimedia.org/device/device=158/tab=health/metric=processor/
CPU usage suddenly went from a 15% on average to a 100%
Mentioned in SAL (#wikimedia-operations) [2019-11-12T16:21:02Z] <XioNoX> reboot scs-c1-eqiad.mgmt.eqiad.wmnet - T238036
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1855 root 20 0 4576 2084 876 S 2.7 0.8 26623:27 portmanager
did a kill -9 portmanager just in case but it didn't change anything (the process restarted with the same 2% CPU load).
Then killed snmpd, which lower the CPU for a bit but then went back up.
Trying a reboot.
This has been alerting since a few days ago. It might be worth following up with the vendor instead of rebooting the console servers every few months.
Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:12:46Z] <robh> updating firmware on scs-c1-eqiad via T238036
Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:14:52Z] <robh> updating firmware on scs-c1-eqiad via T238036
Mentioned in SAL (#wikimedia-operations) [2020-09-02T19:20:14Z] <robh> scs-c1-eqiad firmware update complete and back online T238036
Firmware updated to the newest version. If it happens again, we can reopen and investigate with OpenGear.
This has been alerting again this time for scs-c1-codfw. See https://librenms.wikimedia.org/graphs/device=170/type=device_processor/from=1601991300/legend=yes/popup_title=CPU+Usage/to=1602596100/
Re-opening to keep context.
Mentioned in SAL (#wikimedia-operations) [2020-10-13T18:01:47Z] <robh> scs-c1-codfw firmware update via T238036
Mentioned in SAL (#wikimedia-operations) [2020-10-13T18:09:42Z] <robh> scs-c1-codfw mgmt firmware updated, updating scs-a1-codfw T238036
I've successfully upgraded the scs firmware fleetwide, with the exception of two devices:
If this happens again on any scs device, other than scs-a8-eqiad, it means the firmware update to 4.9.0u1 (fleetwide) doesn't fix the CPU spike issue.
So far when a firmware update is applied to any scs, that scs hasn't spiked again, BUT the spikes are not exactly regular or easily reproduced so who knows.
This is alerting again: https://librenms.wikimedia.org/device/device=158/tab=health/metric=processor/
I am not sure what needs to be done with this task. There really isn't anything actionable other than to replace the scs with something else.
https://netbox.wikimedia.org/dcim/devices/1955/ was purchased on 2017-10-01, and has a 4 year warranty, expiring on 2021-10-01.
https://opengear.com/support/contact-tech-support
A support ticket can be opened without calling, to provide details. I'll leave the ticket opening to either @Cmjohnson or @Jclark-ctr, but they should do so before the warranty expires!
A ticket has been submitted
Your request (#82025) has been received, and is being reviewed by our support staff.
For questions concerning Opengear's Console Server products, please submit a support report if you have not already done so.
To generate a support report, go to the web interface of the Opengear Console Server. In the left hand menu groups go to the Status group and click on Support Report. Copy and paste the entire contents into a .txt file and attach to a reply to this e-mail.
To review the status of the request and add additional comments, follow the link below:
http://opengear.zendesk.com/hc/requests/82025
Opengear's response was for me to update the f/w. It appears to be a newer version than the one Robh had installed.
The newest version is cm71xx-4.11.0.flash and has been installed. I will resolve this task for now, if the error returns please open task.