Page MenuHomePhabricator

Tolerate dump task pods being OOM killed
Closed, ResolvedPublic

Description

I have noticed that the 4th attempt of the afwiki metahistorybz2dump task was blocked on

...
[2025-04-09, 17:56:16 UTC] {pod_manager.py:656} INFO - Pod mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c has phase Running
[2025-04-09, 17:56:18 UTC] {pod_manager.py:656} INFO - Pod mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c has phase Running
...

The pod was in the following state:

brouberol@deploy1003:~$ k get pod -w
NAME                                                       READY   STATUS      RESTARTS   AGE
mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c   2/3     OOMKilled   0          4h36m
mediawiki-dumps-legacy-toolbox-66b8dc5599-x29r2            2/2     Running     0          6d8h

Looking at the Pod Details dashboard, the php process was OOM-killed by the kernel, as we see on the dmesg on dse-k8s-worker1007.eqiad.wmnet:

[Wed Apr  9 15:50:58 2025] php8.1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=980
[Wed Apr  9 15:50:58 2025] CPU: 38 PID: 500339 Comm: php8.1 Not tainted 6.1.0-31-amd64 #1  Debian 6.1.128-1
[Wed Apr  9 15:50:58 2025] Hardware name: Dell Inc. PowerEdge R740xd/0WRPXK, BIOS 2.13.3 12/13/2021
[Wed Apr  9 15:50:58 2025] Call Trace:
[Wed Apr  9 15:50:58 2025]  <TASK>
[Wed Apr  9 15:50:58 2025]  dump_stack_lvl+0x44/0x5c
[Wed Apr  9 15:50:58 2025]  dump_header+0x4a/0x211
[Wed Apr  9 15:50:58 2025]  oom_kill_process.cold+0xb/0x10
[Wed Apr  9 15:50:58 2025]  out_of_memory+0x1fd/0x4c0
[Wed Apr  9 15:50:58 2025]  mem_cgroup_out_of_memory+0x134/0x150
[Wed Apr  9 15:50:58 2025]  try_charge_memcg+0x696/0x780
[Wed Apr  9 15:50:58 2025]  charge_memcg+0x39/0xf0
[Wed Apr  9 15:50:58 2025]  __mem_cgroup_charge+0x28/0x80
[Wed Apr  9 15:50:58 2025]  __handle_mm_fault+0x95c/0xfa0
[Wed Apr  9 15:50:58 2025]  ? psi_task_switch+0xb3/0x1f0
[Wed Apr  9 15:50:58 2025]  handle_mm_fault+0xdb/0x2d0
[Wed Apr  9 15:50:58 2025]  do_user_addr_fault+0x191/0x550
[Wed Apr  9 15:50:58 2025]  exc_page_fault+0x70/0x170
[Wed Apr  9 15:50:58 2025]  asm_exc_page_fault+0x22/0x30
[Wed Apr  9 15:50:58 2025] RIP: 0033:0x7f01fdb8fe94
[Wed Apr  9 15:50:58 2025] Code: 02 62 e1 fe 28 6f 5e 03 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 62 e1 7d 28 e7 07 62 e1 7d 28 e7 4f 01 62 e1 7d 28 e7 57 02 <62> e1 7d 28 e7 5f 03 48 81 c7 80 00 00 00 48 81 fa 80 00 00 00 77
[Wed Apr  9 15:50:58 2025] RSP: 002b:00007ffd7664a308 EFLAGS: 00010202
[Wed Apr  9 15:50:58 2025] RAX: 00007f00d2400000 RBX: 00007f01fb800040 RCX: 00007f00d63fffe0
[Wed Apr  9 15:50:58 2025] RDX: 0000000002565fe0 RSI: 00007f016309a020 RDI: 00007f00d3e99fa0
[Wed Apr  9 15:50:58 2025] RBP: 00007f0161600000 R08: ffffffffffffffe0 R09: 0000000000000000
[Wed Apr  9 15:50:58 2025] R10: 00007f00d6400000 R11: 00007f00d2400000 R12: 00000001169752d8
[Wed Apr  9 15:50:58 2025] R13: 00007f00d2400000 R14: 00007f01fb81ac30 R15: 00007f01f2476ac0
[Wed Apr  9 15:50:58 2025]  </TASK>
[Wed Apr  9 15:50:58 2025] memory: usage 4745216kB, limit 4745216kB, failcnt 1490074
[Wed Apr  9 15:50:58 2025] swap: usage 0kB, limit 0kB, failcnt 0
[Wed Apr  9 15:50:58 2025] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0327d1fe_6b08_4cb5_a682_d754be2ecb6b.slice/cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope:
[Wed Apr  9 15:50:58 2025] anon 4840988672
                           file 40960
                           kernel 17981440
                           kernel_stack 163840
                           pagetables 10047488
                           sec_pagetables 0
                           percpu 2664
                           sock 90112
                           vmalloc 8192
                           shmem 0
                           zswap 0
                           zswapped 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 3665821696
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 4840865792
                           active_anon 28672
                           inactive_file 0
                           active_file 0
                           unevictable 0
                           slab_reclaimable 6814792
                           slab_unreclaimable 782792
                           slab 7597584
                           workingset_refault_anon 0
                           workingset_refault_file 1024
                           workingset_activate_anon 0
                           workingset_activate_file 0
                           workingset_restore_anon 0
                           workingset_restore_file 0
                           workingset_nodereclaim 168
                           pgscan 374301
                           pgsteal 373693
                           pgscan_kswapd 0
                           pgscan_direct 374301
                           pgsteal_kswapd 0
                           pgsteal_direct 373693
                           pgfault 1195391
                           pgmajfault 0
                           pgrefill 19
                           pgactivate 85
                           pgdeactivate 19
                           pglazyfree 0
                           pglazyfreed 0
                           zswpin 0
                           zswpout 0
                           thp_fault_alloc 24835
                           thp_collapse_alloc 0
[Wed Apr  9 15:50:58 2025] Tasks state (memory values in pages):
[Wed Apr  9 15:50:58 2025] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Apr  9 15:50:58 2025] [ 500157]    33 500157     1513      912    57344        0           980 worker
[Wed Apr  9 15:50:58 2025] [ 500170]    33 500170    64134     5558   126976        0           980 python3
[Wed Apr  9 15:50:58 2025] [ 500317]    33 500317    56918    22173   290816        0           980 php8.1
[Wed Apr  9 15:50:58 2025] [ 500327]    33 500327      620      145    49152        0           980 sh
[Wed Apr  9 15:50:58 2025] [ 500330]    33 500330     2443     2175    57344        0           980 bzip2
[Wed Apr  9 15:50:58 2025] [ 500338]    33 500338      620      128    45056        0           980 sh
[Wed Apr  9 15:50:58 2025] [ 500339]    33 500339  1229433  1167930  9494528        0           980 php8.1
[Wed Apr  9 15:50:58 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0327d1fe_6b08_4cb5_a682_d754be2ecb6b.slice/cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0327d1fe_6b08_4cb5_a682_d754be2ecb6b.slice/cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope,task=php8.1,pid=500339,uid=33
[Wed Apr  9 15:50:58 2025] Memory cgroup out of memory: Killed process 500339 (php8.1) total-vm:4917732kB, anon-rss:4644612kB, file-rss:27108kB, shmem-rss:0kB, UID:33 pgtables:9272kB oom_score_adj:980

At the time of the OOMkill, we see an error in the dumps logs, but the overall process seems to be pretty resistant to failure, and seems to happily continue:

[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] 2025-04-09 15:50:35: afwiki (ID 11) 178839 pages (19.5|100136.8/sec all|curr), 2121000 revs (230.9|559.9/sec all|curr), ETA 2025-04-09 16:36:53 [max 2762535]
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] 2025-04-09 15:50:37: afwiki (ID 11) 179046 pages (19.5|96253.0/sec all|curr), 2122000 revs (231.0|537.6/sec all|curr), ETA 2025-04-09 16:36:50 [max 2762535]
2025-04-09T15:50:37.848012001Z
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] getting/checking text es:DB://cluster26/172664?flags=utf-8,gzip failed (Generic error while obtaining text for id es:DB://cluster26/172664?flags=utf-8,gzip) for revision 2497262
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump]       328
[2025-04-09, 15:50:42 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump]        (Will retry 4 more times)
[2025-04-09, 15:50:47 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] Spawning database subprocess: '/usr/bin/php8.1' '/srv/mediawiki/php-1.44.0-wmf.21/../multiversion/MWScript.php' 'fetchText.php' '--wiki' 'afwiki'

However, starting that point, the Pod was in the OOMKilled state, and Airflow never acked the task as succeeded, even though the dump process completed.

We can do several things here:

  • ask for much higher memory limits than requests, which should reduce the amount of OOMKills, as many small dumps won't ever reach their limit
  • tweak the method of the WikimediaDumpOperator in charge of enquiring about whether the pod finished, and return True if the pod is in a n-1/n OOMKilled state.
  • maintain a list of known wiki/jobs that require extra memory

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
test_k8s/dumpsv1: don't raise an exception cleaning up when the base container was oomkilledrepos/data-engineering/airflow-dags!1258brouberolT391510main
test_k8s/dumpsv1: fix OOMKill evaluationrepos/data-engineering/airflow-dags!1233brouberolT391510main
test_k8s/dumpsv1: The base container is not always the first one in the statusesrepos/data-engineering/airflow-dags!1230brouberolT391510main
test_k8s/dumpsv1: add more defenses in how we determine if a pod was oomkilledrepos/data-engineering/airflow-dags!1229brouberolT391510main
test_k8s/dumpsv1: tolerate completed pods with the OOMKilled statusrepos/data-engineering/airflow-dags!1227brouberolT391510main
Customize query in GitLab

Event Timeline

brouberol triaged this task as Medium priority.
brouberol updated the task description. (Show Details)

I've been able to reproduce the issue with a small pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: oom-demo
  namespace: mediawiki-dumps-legacy
spec:
  restartPolicy: Never
  containers:
  - name: oom-container
    image: docker-registry.wikimedia.org/repos/data-engineering/airflow:2025-03-31-153238-8a72661f8984fde4846ffd4545c6a3b7e2b965d4
    resources:
      limits:
        memory: "100Mi"
    command: ["python", "-c"]
    args:
    - |
      import os
      filename = '/tmp/large_file'
      with open(filename, 'wb') as f:
        # Write 150MB of data
        f.write(b'0' * 150 * 1024 * 1024)
      # Re-open the file to keep it in memory
      with open(filename, 'rb') as f:
        data = f.read()
      # Keep the process running
      while True:
        pass

Something is a bit weird with the requests/limits of the dumped pod. We see that the memory limit is ~4.5GB:

Screenshot 2025-04-10 at 08.55.12.png (1×2 px, 477 KB)

And indeed, looking at the yaml spec, we see

Containers:
  mediawiki-sql-xml-dump:
    Container ID:  containerd://6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5
    Image:         docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli:2025-03-20-170003-publish-81
    Image ID:      docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli@sha256:6a2a672e28e215b0aa10ea076209788d1a090a683a49dfffbc0e27dd6dd2138b
    Port:          <none>
    ...
    Restart Count:  0
    Limits:
      cpu:     5
      memory:  4634Mi
    Requests:
      cpu:     4500m
      memory:  2634Mi

However, we define the following in our code:

LARGE_WIKI_POD_RESOURCES = k8s_client.V1ResourceRequirements(
    limits={"cpu": "4000m", "memory": "8Gi"}, requests={"cpu": "2000m", "memory": "4Gi"}
)
DEFAULT_POD_RESOURCES = k8s_client.V1ResourceRequirements(
    limits={"cpu": "1000m", "memory": "2Gi"}, requests={"cpu": "500m", "memory": "1Gi"}
)

After a cursory search in our DAG code, it seems that we stopped assigning these resources to the container a while back, so the container is probably assigned some default resources.

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1229

test_k8s/dumpsv1: add more defenses in how we determine if a pod was oomkilled

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1230

test_k8s/dumpsv1: The base container is not always the first one in the statuses

Now that https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1233 has been merged, we should be able to resolve this once and for all. I'll monitor DAGs for any additional errors and will close if OOMs are being tolerated

I'm re-opening this one again.
We've had an occurrence of a Pod being OOMkilled, airflow detecting it, but still crashing as it considers the OOMKiilled as a failed state.

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1258

test_k8s/dumpsv1: don't raise an exception cleaning up when the base container was oomkilled

Let's wait until the next occurence of OOM to validate that the mitigation works.