For 7 hours (`2022-02-06T23:00:00` to `2022-02-07T06:20:00`) the streaming updater in `eqiad` stopped working properly preventing edits to flow to all the wdqs machines in eqiad.
The lag started to rise in eqiad and caused edits to be throttled during this period:
{F34944091}
Investigations:
* the streaming updater for WCQS went down from `2022-02-06T16:32:00` to `2022-02-06T23:00:00`
* the streaming updater for WDQS went down from `2022-02-06T23:00:00` to `2022-02-07T06:20:00`
* the number of total task slots went down to 20 from 24 (4tasks == 1pod) between `2022-02-06T16:32:00` and `2022-02-07T06:20:00` causing resource starvation and preventing both jobs from running at the same time (`flink_jobmanager_taskSlotsTotal{kubernetes_namespace="rdf-streaming-updater"}`)
* kubernetes1014 (T301099) seemed to have showed problems during this same period (`2022-02-06T16:32:00` to `2022-02-07T06:20:00`)
* the deployment used by the updater used one POD (`1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a`) from kubernetes1014
* the flink session cluster was able to regain its 24 slots after `1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` came back (at `2022-02-07T08:07:00`), then this POD disappeared again in favor of another one and the service successfully restarted.
* during the whole incident k8s metrics & flink metrics seem to disagree:
** flink says that it lost 4 task managers (1 POD)
** k8s always reports at least 6 PODS (`count(container_memory_usage_bytes{namespace="rdf-streaming-updater", container="flink-session-cluster-main-taskmanager"})`)
Questions (answered):
- why do flink and k8s metrics disagree (active PODs vs number of task manager)?
-- Flink could not contact the container running on kubernetes1014 and thus freed it's resources (task slots), k8s attempted to kill the container as well but did not fully reclaim the resources (PODs) allocated to it
- why a new POD was not created after kubernetes1014 went down (making `1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a` unavailable to the deployment)?
-- From the k8s point of view kubernetes1014 was flapping between the ready and not ready state and preferred to reboot containers there
What could we have done better:
- we could have route wdqs traffic to codfw during the outage and avoid throttling edits
Action items:
- create an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-02-06_wdqs_updater
- T305068: Alert if the number of flink tasks slots go below what we expect
- T293063: adapt/create runbooks for the streaming updater and take this incident into account (esp. we should have had reacted to the alert and routed all wdqs traffic to codfw)
- To be discussed with service ops:
-- Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas
-- If the above is not possible could we mitigate this problem by over-allocating resources (increase the number of replicas) to the deployment to increase the chances of proper recovery if this situation happens again?