db1114 (stretch s1 API slave) has been suffering mysterious connections errors that were observed on logtash and grafana (T191996) which triggered an investigation which identified the following facts:
- Connection errors every 10 minutes (to the second) being shown on logtash: T191996#4129039
- Packets being dropped T191996#4129334
- Multiple cores going to 100% usage T191996#4139429
After lots of investigation on the HW side and on the MySQL side it was found that atop was causing this issue as the overloaded CPU timings matched the time of the errors: T191996#4139494
And stopping atop resulted on no more errors. Starting it again, errors back, stopping it, errors gone. So it was clear.
db1114 is an s1 API slave along with db1060 and db1080. Errors were _only_ happening on db1114.
db1114: stretch - atop version 2.2.6
db1060: jessie - atop version 1.26
db1080: jessie - atop version 1.26
This is how atop runs (T191996#4139494):
/usr/bin/atop -a -R -w /var/log/atop/atop_20180418 600
We don't have any more slaves running stretch with the same amount of load db1114 so we have not observed these connections errors that triggered all the investigation on db1114.
This might be affecting other stretch hosts so it is worth a proper and more general across the fleet investigation.
So for now, atop on db1114 has been stopped.
All the investigation done is on T191996
Proposal: Either drop the package from all production hosts or reconfigure it without -R