Page MenuHomePhabricator

The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)
Closed, ResolvedPublic3 Estimated Story Points

Description

For 7 hours (2022-02-06T23:00:00 to 2022-02-07T06:20:00) the streaming updater in eqiad stopped working properly preventing edits to flow to all the wdqs machines in eqiad.
The lag started to rise in eqiad and caused edits to be throttled during this period:

Capture d’écran du 2022-02-07 11-40-08.png (962×3 px, 420 KB)

Investigations:

  • the streaming updater for WCQS went down from 2022-02-06T16:32:00 to 2022-02-06T23:00:00
  • the streaming updater for WDQS went down from 2022-02-06T23:00:00 to 2022-02-07T06:20:00
  • the number of total task slots went down to 20 from 24 (4tasks == 1pod) between 2022-02-06T16:32:00 and 2022-02-07T06:20:00 causing resource starvation and preventing both jobs from running at the same time (flink_jobmanager_taskSlotsTotal{kubernetes_namespace="rdf-streaming-updater"})
  • kubernetes1014 (T301099) seemed to have showed problems during this same period (2022-02-06T16:32:00 to 2022-02-07T06:20:00)
  • the deployment used by the updater used one POD (1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a) from kubernetes1014
  • the flink session cluster was able to regain its 24 slots after 1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a came back (at 2022-02-07T08:07:00), then this POD disappeared again in favor of another one and the service successfully restarted.
  • during the whole incident k8s metrics & flink metrics seem to disagree:
    • flink says that it lost 4 task managers (1 POD)
    • k8s always reports at least 6 PODS (count(container_memory_usage_bytes{namespace="rdf-streaming-updater", container="flink-session-cluster-main-taskmanager"}))

Questions (answered):

  • why do flink and k8s metrics disagree (active PODs vs number of task manager)?
    • Flink could not contact the container running on kubernetes1014 and thus freed it's resources (task slots), k8s attempted to kill the container as well but did not fully reclaim the resources (PODs) allocated to it
  • why a new POD was not created after kubernetes1014 went down (making 1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a unavailable to the deployment)?
    • From the k8s point of view kubernetes1014 was flapping between the ready and not ready state and preferred to reboot containers there

What could we have done better:

  • we could have route wdqs traffic to codfw during the outage and avoid throttling edits

Action items:

  • create an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-02-06_wdqs_updater
  • T305068: Alert if the number of flink tasks slots go below what we expect
  • T293063: adapt/create runbooks for the streaming updater and take this incident into account (esp. we should have had reacted to the alert and routed all wdqs traffic to codfw)
  • To be discussed with service ops:
    • Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas
    • If the above is not possible could we mitigate this problem by over-allocating resources (increase the number of replicas) to the deployment to increase the chances of proper recovery if this situation happens again?
  • T277876: to possibly improve the resiliency of the k8s nodes

Event Timeline

@JMeybohm we're still investigating why the application did not properly recover while kubernetes1014 went down but if you have ideas on the two questions in the ticket description this would be very helpful, thanks!

k8s seems to have tried to kill the container for the whole period according to messages like: Container flink-session-cluster-main-taskmanager failed liveness probe, will be restarted (searching for k8s_event.involvedObject.uid:"1db45eb6-2405-4aa3-bec1-71fcdbbe4f9a").

@JMeybohm we're still investigating why the application did not properly recover while kubernetes1014 went down but if you have ideas on the two questions in the ticket description this would be very helpful, thanks!

Unfortunately I'm not exactly sure what happened to the node. What I know is that the system load surged (potentially due to high iowait) on the system, leaving running processes practically starving but the system was still responding to ICMP and kubernetes status heartbeats still (mostly) worked. Leaving the node flipping between Ready/NotReady state.
That means the node was not actually down from k8s POV, which is why no new Pods where created until I drained the node respectively before I powercycled it (as evicting pods was actually hanging as well, as k8s tries to be nice and the node still was in it's overloaded state).

@JMeybohm we're still investigating why the application did not properly recover while kubernetes1014 went down but if you have ideas on the two questions in the ticket description this would be very helpful, thanks!

Unfortunately I'm not exactly sure what happened to the node. What I know is that the system load surged (potentially due to high iowait) on the system, leaving running processes practically starving but the system was still responding to ICMP and kubernetes status heartbeats still (mostly) worked. Leaving the node flipping between Ready/NotReady state.
That means the node was not actually down from k8s POV, which is why no new Pods where created until I drained the node respectively before I powercycled it (as evicting pods was actually hanging as well, as k8s tries to be nice and the node still was in it's overloaded state).

Thanks! I've updated the task description with few action items, please let us know if you see something else we should do to improve this.

@RKemper or @bking will create an incident report from this ticket. If any actionable are identified, they will be tracked on their own tasks.

Gehel set the point value for this task to 3.Feb 28 2022, 4:47 PM

Discussion with service ops will happen on this ticket. Other action items will be tracked separately.

Tentatively moving this ticket to needs review as I'm not sure sure we can do much more from the search team perspective.
I think the last point to discuss was to investigate the reasons why a single k8s node that misbehaves could make a deployment unstable.
@JMeybohm do you see any additional action items that would improve the resilience of k8s in such scenario?

To be discussed with service ops:

  • Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas

The problem was more that the node did not really fail (to it's complete extend). It was heavily overloaded (for an unknown reason) and that's potentially why containers/processed running there seemed dead. But from K8s perspective the Pods where still running and a new pod was scheduled as soon as I power cycled the node (e.g. K8s was able to detect a mismatch in desired end existing replicas).

  • If the above is not possible could we mitigate this problem by over-allocating resources (increase the number of replicas) to the deployment to increase the chances of proper recovery if this situation happens again?

If that makes sense from your POV you could do that ofc. I can't speak on how problematic this situation was compared to the potential waste of resources another pod means. But if the current workload is already maxing out the capacity of the 6 replicas you have, maybe bumping that to 7 might be smart anyways to account for peaks?

@JMeybohm do you see any additional action items that would improve the resilience of k8s in such scenario?

Unfortunately we don't have any data on what went wrong on the node. I think T277876 would be a step in the right direction but I also doubt it would have fully prevented this issue (ultimately I can't say).

Thanks for the quick answer! (response inline)

  • If the above is not possible could we mitigate this problem by over-allocating resources (increase the number of replicas) to the deployment to increase the chances of proper recovery if this situation happens again?

If that makes sense from your POV you could do that ofc. I can't speak on how problematic this situation was compared to the potential waste of resources another pod means. But if the current workload is already maxing out the capacity of the 6 replicas you have, maybe bumping that to 7 might be smart anyways to account for peaks?

The additional PODs won't be used as a flink job does not automatically scale so it would be a pure waste of resources (2.5G of reserved mem per additional POD). It would help I guess to improve redundancy in this scenario only if k8s assigns every POD to a distinct machine, in which case even with a single machine misbehaving flink would have enough redundancy to allocate the job to the spare POD. If k8s does do allocation randomly or that there are not enough k8s worker nodes (1 spare POD in our case would mean spreading the PODs over 8 different machines) then it's probably not worth the waste of resources.

@JMeybohm do you see any additional action items that would improve the resilience of k8s in such scenario?

Unfortunately we don't have any data on what went wrong on the node. I think T277876 would be a step in the right direction but I also doubt it would have fully prevented this issue (ultimately I can't say).

Thanks, I'm adding it to the ticket description as a possible improvement.

The additional PODs won't be used as a flink job does not automatically scale so it would be a pure waste of resources (2.5G of reserved mem per additional POD). It would help I guess to improve redundancy in this scenario only if k8s assigns every POD to a distinct machine, in which case even with a single machine misbehaving flink would have enough redundancy to allocate the job to the spare POD. If k8s does do allocation randomly or that there are not enough k8s worker nodes (1 spare POD in our case would mean spreading the PODs over 8 different machines) then it's probably not worth the waste of resources.

K8s will try to schedule replicas of one Deployment onto different Nodes by default and we can also force it to do so. But tbh I would not so that in this case as in most of the cases it should be just fine. I expect this situation to be a rare exception (and I probably jinxed that now) as we have not seen it before or happen again. So as long as it's not super critical, I would refrain from trying to optimize the workload for this type of failure. Ultimately this should be taken care of by k8s so we should invest there - especially if should happen again.