We have been working on T372878, i.e. kubernetes nodes have been changing IP addresses. This has been triggering changes to ferm rules across the nodes of our cluster.
While looking into T374025, it was noted that many memcached errors observed by mediawiki were occurring during puppet runs that included changes to ferm rules.
After discussing with @JMeybohm, we concluded that there is brief amount of time where:
- Ferm has recreated all iptables rules
- Calico realises that the current iptables rules are not what it expected
- Calico applies the missing rules
- Errors stop
Calico logs from such an occurrence can be found here: https://logstash.wikimedia.org/goto/95a27dfbc3a90a960c53a20d4ade76bf
I assume that similar "connectivity" errors may be observed from other applications running on k8s.
Part of the problem would prolly go away with T365687, but not fully