I have noticed that the 4th attempt of the afwiki metahistorybz2dump task was blocked on
...
[2025-04-09, 17:56:16 UTC] {pod_manager.py:656} INFO - Pod mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c has phase Running
[2025-04-09, 17:56:18 UTC] {pod_manager.py:656} INFO - Pod mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c has phase Running
...The pod was in the following state:
brouberol@deploy1003:~$ k get pod -w NAME READY STATUS RESTARTS AGE mediawiki-afwiki-sql-xml-dump-metahistorybz2dump-v7fv07c 2/3 OOMKilled 0 4h36m mediawiki-dumps-legacy-toolbox-66b8dc5599-x29r2 2/2 Running 0 6d8h
Looking at the Pod Details dashboard, the php process was OOM-killed by the kernel, as we see on the dmesg on dse-k8s-worker1007.eqiad.wmnet:
[Wed Apr 9 15:50:58 2025] php8.1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=980
[Wed Apr 9 15:50:58 2025] CPU: 38 PID: 500339 Comm: php8.1 Not tainted 6.1.0-31-amd64 #1 Debian 6.1.128-1
[Wed Apr 9 15:50:58 2025] Hardware name: Dell Inc. PowerEdge R740xd/0WRPXK, BIOS 2.13.3 12/13/2021
[Wed Apr 9 15:50:58 2025] Call Trace:
[Wed Apr 9 15:50:58 2025] <TASK>
[Wed Apr 9 15:50:58 2025] dump_stack_lvl+0x44/0x5c
[Wed Apr 9 15:50:58 2025] dump_header+0x4a/0x211
[Wed Apr 9 15:50:58 2025] oom_kill_process.cold+0xb/0x10
[Wed Apr 9 15:50:58 2025] out_of_memory+0x1fd/0x4c0
[Wed Apr 9 15:50:58 2025] mem_cgroup_out_of_memory+0x134/0x150
[Wed Apr 9 15:50:58 2025] try_charge_memcg+0x696/0x780
[Wed Apr 9 15:50:58 2025] charge_memcg+0x39/0xf0
[Wed Apr 9 15:50:58 2025] __mem_cgroup_charge+0x28/0x80
[Wed Apr 9 15:50:58 2025] __handle_mm_fault+0x95c/0xfa0
[Wed Apr 9 15:50:58 2025] ? psi_task_switch+0xb3/0x1f0
[Wed Apr 9 15:50:58 2025] handle_mm_fault+0xdb/0x2d0
[Wed Apr 9 15:50:58 2025] do_user_addr_fault+0x191/0x550
[Wed Apr 9 15:50:58 2025] exc_page_fault+0x70/0x170
[Wed Apr 9 15:50:58 2025] asm_exc_page_fault+0x22/0x30
[Wed Apr 9 15:50:58 2025] RIP: 0033:0x7f01fdb8fe94
[Wed Apr 9 15:50:58 2025] Code: 02 62 e1 fe 28 6f 5e 03 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 62 e1 7d 28 e7 07 62 e1 7d 28 e7 4f 01 62 e1 7d 28 e7 57 02 <62> e1 7d 28 e7 5f 03 48 81 c7 80 00 00 00 48 81 fa 80 00 00 00 77
[Wed Apr 9 15:50:58 2025] RSP: 002b:00007ffd7664a308 EFLAGS: 00010202
[Wed Apr 9 15:50:58 2025] RAX: 00007f00d2400000 RBX: 00007f01fb800040 RCX: 00007f00d63fffe0
[Wed Apr 9 15:50:58 2025] RDX: 0000000002565fe0 RSI: 00007f016309a020 RDI: 00007f00d3e99fa0
[Wed Apr 9 15:50:58 2025] RBP: 00007f0161600000 R08: ffffffffffffffe0 R09: 0000000000000000
[Wed Apr 9 15:50:58 2025] R10: 00007f00d6400000 R11: 00007f00d2400000 R12: 00000001169752d8
[Wed Apr 9 15:50:58 2025] R13: 00007f00d2400000 R14: 00007f01fb81ac30 R15: 00007f01f2476ac0
[Wed Apr 9 15:50:58 2025] </TASK>
[Wed Apr 9 15:50:58 2025] memory: usage 4745216kB, limit 4745216kB, failcnt 1490074
[Wed Apr 9 15:50:58 2025] swap: usage 0kB, limit 0kB, failcnt 0
[Wed Apr 9 15:50:58 2025] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0327d1fe_6b08_4cb5_a682_d754be2ecb6b.slice/cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope:
[Wed Apr 9 15:50:58 2025] anon 4840988672
file 40960
kernel 17981440
kernel_stack 163840
pagetables 10047488
sec_pagetables 0
percpu 2664
sock 90112
vmalloc 8192
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 3665821696
file_thp 0
shmem_thp 0
inactive_anon 4840865792
active_anon 28672
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 6814792
slab_unreclaimable 782792
slab 7597584
workingset_refault_anon 0
workingset_refault_file 1024
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 168
pgscan 374301
pgsteal 373693
pgscan_kswapd 0
pgscan_direct 374301
pgsteal_kswapd 0
pgsteal_direct 373693
pgfault 1195391
pgmajfault 0
pgrefill 19
pgactivate 85
pgdeactivate 19
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 24835
thp_collapse_alloc 0
[Wed Apr 9 15:50:58 2025] Tasks state (memory values in pages):
[Wed Apr 9 15:50:58 2025] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Apr 9 15:50:58 2025] [ 500157] 33 500157 1513 912 57344 0 980 worker
[Wed Apr 9 15:50:58 2025] [ 500170] 33 500170 64134 5558 126976 0 980 python3
[Wed Apr 9 15:50:58 2025] [ 500317] 33 500317 56918 22173 290816 0 980 php8.1
[Wed Apr 9 15:50:58 2025] [ 500327] 33 500327 620 145 49152 0 980 sh
[Wed Apr 9 15:50:58 2025] [ 500330] 33 500330 2443 2175 57344 0 980 bzip2
[Wed Apr 9 15:50:58 2025] [ 500338] 33 500338 620 128 45056 0 980 sh
[Wed Apr 9 15:50:58 2025] [ 500339] 33 500339 1229433 1167930 9494528 0 980 php8.1
[Wed Apr 9 15:50:58 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0327d1fe_6b08_4cb5_a682_d754be2ecb6b.slice/cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0327d1fe_6b08_4cb5_a682_d754be2ecb6b.slice/cri-containerd-6d14689841acbe3c1d69edf3f851148407a88b82baae2c1fe3b9321f3bb076d5.scope,task=php8.1,pid=500339,uid=33
[Wed Apr 9 15:50:58 2025] Memory cgroup out of memory: Killed process 500339 (php8.1) total-vm:4917732kB, anon-rss:4644612kB, file-rss:27108kB, shmem-rss:0kB, UID:33 pgtables:9272kB oom_score_adj:980At the time of the OOMkill, we see an error in the dumps logs, but the overall process seems to be pretty resistant to failure, and seems to happily continue:
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] 2025-04-09 15:50:35: afwiki (ID 11) 178839 pages (19.5|100136.8/sec all|curr), 2121000 revs (230.9|559.9/sec all|curr), ETA 2025-04-09 16:36:53 [max 2762535]
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] 2025-04-09 15:50:37: afwiki (ID 11) 179046 pages (19.5|96253.0/sec all|curr), 2122000 revs (231.0|537.6/sec all|curr), ETA 2025-04-09 16:36:50 [max 2762535]
2025-04-09T15:50:37.848012001Z
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] getting/checking text es:DB://cluster26/172664?flags=utf-8,gzip failed (Generic error while obtaining text for id es:DB://cluster26/172664?flags=utf-8,gzip) for revision 2497262
[2025-04-09, 15:50:37 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] 328
[2025-04-09, 15:50:42 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] (Will retry 4 more times)
[2025-04-09, 15:50:47 UTC] {pod_manager.py:477} INFO - [mediawiki-sql-xml-dump] Spawning database subprocess: '/usr/bin/php8.1' '/srv/mediawiki/php-1.44.0-wmf.21/../multiversion/MWScript.php' 'fetchText.php' '--wiki' 'afwiki'However, starting that point, the Pod was in the OOMKilled state, and Airflow never acked the task as succeeded, even though the dump process completed.
We can do several things here:
- ask for much higher memory limits than requests, which should reduce the amount of OOMKills, as many small dumps won't ever reach their limit
- tweak the method of the WikimediaDumpOperator in charge of enquiring about whether the pod finished, and return True if the pod is in a n-1/n OOMKilled state.
- maintain a list of known wiki/jobs that require extra memory
