For 7 hours (`2022-02-06T23:00:00` to `2022-02-07T06:20:00`) the streaming updater in `eqiad` stopped working properly preventing edits to flow to all the wdqs machines in eqiad.
The lag started to rise in eqiad and caused edits to be throttled during this period:
{F34944091}
Investigations:
* the streaming updater for WCQS went down from `2022-02-06T16:32:00` to `2022-02-06T23:00:00`
* the streaming updater for WDQS went down from `2022-02-06T23:00:00` to `2022-02-07T06:20:00`
* the number of total task slots went down to 20 from 24 (4tasks == 1pod) between `2022-02-06T16:32:00` and `2022-02-07T06:20:00` causing resource starvation and preventing both jobs from running at the same time (`flink_jobmanager_taskSlotsTotal{kubernetes_namespace="rdf-streaming-updater"}`)
* kubernetes1014 (T301099) seemed to have showed problems during this same period (`2022-02-06T16:32:00` to `2022-02-07T06:20:00`)
* the deployment used by the updater used one POD (`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a`) from kubernetes1014
* the flink session cluster was able to regain its 24 slots after after `1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` came back (at `2022-02-07T08:07:00`), then this POD disappeared again in favor of another one and the service successfully restarted.
* during the whole incident k8s metrics & flink metrics seem to disagree:
** flink says that it lost 4 task managers (1 POD)
** k8s always reports at least 6 PODS (`count(container_memory_usage_bytes{namespace="rdf-streaming-updater", container="flink-session-cluster-main-taskmanager"})`)
Questions:
- why do flink and k8s metrics disagree (active PODs vs number of task manager)?
- why a new POD was not created after kubernetes1014 went down (making `1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` unavailable to the deployment)?
What could we have done better:
- we could have route wdqs traffic to codfw during the outage and avoid throttling edits