Since 2019-03-26 we 've seen an increased rate of kubelet operational latency alerts from kubernetes. Those usually recover quickly, but in a number of cases, especially lately they seem to be flapping a lot. Some rough numbers indicate 525 individual alerts from that date to 2019-04-12, most of which informational and indicative of an issue but not directly actionable. We should re-evaluate how we currently alert on this and implement better alerts.
For posterity's sake, when those alerts were introduced but in 2017-12-11 they were added in the spirit of "We have no experience with this, we don't know what exactly to alert on, so let's monitor and alert on all latencies increases and improve from there".
The thresholds have been bump a number of time since then, namely in 50fc9afe2489a4 and bacbc62d909 but more in a reactionary manner than a re-evaluation. T219696 has also been opened, and has been resolved as the root cause was identified (the latter git commit above was the resolution)