We had an outage that is most likely related to mwscript creating quite a few k8s resources, from the last two days there are:
root@deploy1003:~# helm -n mw-script list |wc -l 257 root@deploy1003:~# kubectl -n mw-script get secret |wc -l 1883 root@deploy1003:~# kubectl -n mw-script get networkpolicies. |wc -l 1882 root@deploy1003:~# kubectl -n mw-script get jobs |wc -l 1877 root@deploy1003:~# kubectl -n mw-script get pods |wc -l 1875 root@deploy1003:~# kubectl -n mw-script get configmaps |wc -l 13170
We suspect that:
- The high number of networkpolicies led to a calico outage
- The high number of secret objects leads to helm-state-metrics being OOM killed (LIST /secrets calls piling up on the api servers)
- Multiple cert-manager components being OOM killed (not sure why yet)
Most of the objects are potentially redundant and the same for each job, we should try to consolidate those in a way.
It was also noted that foreachwiki currently launches mwscript once for each wiki, which further increases the amount of jobs in the mw-script namespace.