Page MenuHomePhabricator

kevinbazira (Kevin Bazira, KBazira)
Software Engineer (Machine Learning)

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Aug 3 2019, 6:58 AM (332 w, 2 d)
Availability
Available
IRC Nick
kevinbazira
LDAP User
Kevin Bazira
MediaWiki User
KBazira (WMF) [ Global Accounts ]

Recent Activity

Yesterday

kevinbazira added a comment to P86608 Prototype serving Qwen3 embeddings with KServe using HF Transformers and ROCm-compatible FlashAttention-2.

Nice implementation!
Does last token pooling come from qwenlm?

Yes, I saw it here when I was testing the Qwen3-Embedding-4B example that shows how to use HF transformers for embeddings inference.

Mon, Dec 15, 11:53 AM · Machine-Learning-Team
kevinbazira added a comment to T412338: Semantic Search - Embeddings Service for MVP.
  • Implementation. choose one of the followings:
    • sentence embeddings.
    • vllm (Clarify with Kevin and Dawid)
    • investigate more options.
Mon, Dec 15, 8:30 AM · Machine-Learning-Team
kevinbazira added a comment to P86608 Prototype serving Qwen3 embeddings with KServe using HF Transformers and ROCm-compatible FlashAttention-2.

The prototype above can be run on ml-lab using the steps below:

Mon, Dec 15, 8:17 AM · Machine-Learning-Team
kevinbazira created P86608 Prototype serving Qwen3 embeddings with KServe using HF Transformers and ROCm-compatible FlashAttention-2.
Mon, Dec 15, 7:30 AM · Machine-Learning-Team

Thu, Dec 11

kevinbazira added a comment to P72019 Build & Test vllm wheel using Python bdist_wheel.

I resolved the issue in P72019#288792 by installing typing_extensions==4.15.0 in vllm_from_bdist_wheel_venv and adjusting the PYTHONPATH so that this venv's site-packages (with typing_extensions 4.15.0) appear before the ROCm PyTorch venv (with typing_extensions 4.9.0) in the search path:

export PYTHONPATH=/home/kevinbazira/test_aya/build_vllm/test_wheel/vllm_from_bdist_wheel_venv/lib/python3.11/site-packages:/srv/pytorch-rocm/venv/lib/python3.11/site-packages/
Thu, Dec 11, 4:37 AM · Machine-Learning-Team
kevinbazira edited P72019 Build & Test vllm wheel using Python bdist_wheel.
Thu, Dec 11, 4:26 AM · Machine-Learning-Team

Wed, Dec 3

kevinbazira moved T410906: Update Aya LLM model-server to run on LiftWing GPUs from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Wed, Dec 3, 7:27 AM · Essential-Work, Machine-Learning-Team
kevinbazira closed T410906: Update Aya LLM model-server to run on LiftWing GPUs, a subtask of T403599: Setup & experiments for MI300x GPUs used for LiftWing, as Resolved.
Wed, Dec 3, 7:27 AM · Machine-Learning-Team
kevinbazira closed T410906: Update Aya LLM model-server to run on LiftWing GPUs as Resolved.
Wed, Dec 3, 7:27 AM · Essential-Work, Machine-Learning-Team

Tue, Dec 2

kevinbazira added a comment to T409388: Test liftwing wikidata revert risk API for scale and latency.

@prabhat, has the WME team had a chance to run scale and latency tests on the revertrisk-wikidata inference service? Does this service meet your performance requirements?

Tue, Dec 2, 6:20 AM · Wikimedia Enterprise (WME Kanban), Machine-Learning-Team

Mon, Dec 1

kevinbazira added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

In T410906#11415517, we successfully tested the llm model-server on LiftWing with MI300X GPU.

Mon, Dec 1, 5:26 AM · Machine-Learning-Team
kevinbazira added a subtask for T403599: Setup & experiments for MI300x GPUs used for LiftWing: T410906: Update Aya LLM model-server to run on LiftWing GPUs.
Mon, Dec 1, 5:15 AM · Machine-Learning-Team
kevinbazira added a parent task for T410906: Update Aya LLM model-server to run on LiftWing GPUs: T403599: Setup & experiments for MI300x GPUs used for LiftWing.
Mon, Dec 1, 5:15 AM · Essential-Work, Machine-Learning-Team

Fri, Nov 28

kevinbazira added a comment to T394778: Build and push images to the docker registry from ml-lab.

I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:

  • wipe the machine and manage the basics with puppet

Thank you for picking this up, @DPogorzelski-WMF. If you proceed with the plan to wipe ml-lab1001, could you please move the contents of my (and/or other people's) home directory to ml-lab1002? Thanks in advance.

I will tar gz all home folders separately on 01 and copy them into corresponding home folders on 02. Individual users can then untar and pick what they need

Fri, Nov 28, 1:37 PM · Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

Finally, as shown below, the llm model-server using MI300X GPU in LiftWing production is able to serve the aya-expanse-8B model:

$ kubectl get pods
NAME                                                         READY   STATUS    RESTARTS   AGE
aya-llm-predictor-00015-deployment-65b4577748-6wh2c          3/3     Running   0          2m11s
Fri, Nov 28, 12:00 PM · Essential-Work, Machine-Learning-Team

Thu, Nov 27

kevinbazira edited P85707 Build and test bitsandbytes wheel with multiple ROCm targets (gfx90a and gfx942).
Thu, Nov 27, 3:02 PM · Machine-Learning-Team
kevinbazira added a comment to P71677 Build & Test Flash Attention 2 wheel using Python bdist_wheel.

In T410906#11409323, we found that the flashattention2 wheel built above didn't support MI300X GPUs. In P85813, we built a wheel from source that supports both gfx90a and gfx942 ROCm targets.

Thu, Nov 27, 1:30 PM · Machine-Learning-Team
kevinbazira added a comment to T394778: Build and push images to the docker registry from ml-lab.

I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:

  • wipe the machine and manage the basics with puppet
  • the machine will have docker installed
  • the machine will be enrolled into gitlab as a gitlab runner
  • the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
  • SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)

If the above is fine i'm going to start looking at the first steps.
Feel free to comment or add interested parties to the discussion.

Thu, Nov 27, 9:45 AM · Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

In P85813, we built a flash-attention2 wheel that supports both gfx90a and gfx942 ROCm targets. Now the llm model-server inference nolonger throws the error from T410906#11409323, but returns only <PAD> tokens:

$ kubectl get pods
NAME                                                         READY   STATUS    RESTARTS   AGE
aya-llm-predictor-00014-deployment-65ccd57d6d-pf79b          3/3     Running   0          2m23s
Thu, Nov 27, 9:16 AM · Essential-Work, Machine-Learning-Team
kevinbazira created P85813 Build and test flashattention2 wheel with multiple ROCm targets (gfx90a and gfx942).
Thu, Nov 27, 5:01 AM · Machine-Learning-Team

Wed, Nov 26

kevinbazira added a comment to P71986 Build & Test bitsandbytes wheel using Python bdist_wheel.

In T410906#11408603, we found that the bitsandbytes wheel built above didn't support MI300X GPUs. In P85707, we built a wheel from source that supports both gfx90a and gfx942 ROCm targets.

Wed, Nov 26, 1:10 PM · Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

In P85707, we built a bitsandbytes wheel that supports both gfx90a and gfx942 ROCm targets. Now the llm model-server starts well without an error, but runs into a familiar error: invalid device function caused by flash-attention2 on inference:

$ kubectl get pods
NAME                                                         READY   STATUS    RESTARTS   AGE
aya-llm-predictor-00013-deployment-cb8c6b54f-52d4t           3/3     Running   0          95s
langid-predictor-default-00014-deployment-5894db899b-hhn2r   3/3     Running   0          11d
Wed, Nov 26, 1:08 PM · Essential-Work, Machine-Learning-Team
kevinbazira created P85707 Build and test bitsandbytes wheel with multiple ROCm targets (gfx90a and gfx942).
Wed, Nov 26, 11:13 AM · Machine-Learning-Team
kevinbazira edited P71986 Build & Test bitsandbytes wheel using Python bdist_wheel.
Wed, Nov 26, 10:28 AM · Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

The llm model-server is no longer throwing the OOM issue after using BITSANDBYTES_DTYPE="int4" and packages built from source, as we did in P85433#343273. However, although the package built runs on the ML-Lab MI200 GPU (gfx90a) as shown in T410906#11405817, it's not compatible with MI300X GPUs (gfx942):

+ source common_settings.sh
+++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
++ CPU_COUNT=6
++ echo 'CPU count detected from get_cpu_count: 6'
++ export OMP_NUM_THREADS=6
CPU count detected from get_cpu_count: 6
OMP_NUM_THREADS set to: 6
++ OMP_NUM_THREADS=6
++ echo 'OMP_NUM_THREADS set to: 6'
+ MODEL_SERVER_PATH=src/models/llm/model.py
+ exec /usr/bin/python3 src/models/llm/model.py
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (Debian 12.2.0-14+deb12u1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Wed, Nov 26, 9:39 AM · Essential-Work, Machine-Learning-Team
kevinbazira closed T348156: Goal: Increase the number of models hosted on Lift Wing as Resolved.
Wed, Nov 26, 4:19 AM · Goal, Machine-Learning-Team
kevinbazira closed T347387: Add JavaScript examples to LiftWing API gateway docs as Resolved.
Wed, Nov 26, 4:17 AM · Documentation, Machine-Learning-Team

Tue, Nov 25

kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

Since we would like to use less VRAM which is causing the OOM issue in T410906#11404979, I am going to revert back to using BITSANDBYTES_DTYPE="int4" which had caused the error in T410906#11404187 but we have a solution in P85433#343273. Below I have tested the solution on ML-Lab and am going to try a similar solution in LiftWing.

$ MODEL_NAME=aya-expanse-8B LLM_CLASS=llm.Aya MODEL_PATH="/home/kevinbazira/.cache/huggingface/hub/models--CohereForAI--aya-expanse-8b/snapshots/554c52e22d0f713bab9d3e360734d25cd15dda16/" BITSANDBYTES_DTYPE="int4" DEVICE=auto ATTN_IMPLEMENTATION="flash_attention_2" DTYPE="float16" python3 src/models/llm/model.py
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (Debian 12.2.0-14+deb12u1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Tue, Nov 25, 3:09 PM · Essential-Work, Machine-Learning-Team
kevinbazira added a comment to T409388: Test liftwing wikidata revert risk API for scale and latency.

The revertrisk-wikidata inference service production endpoint uses similar scaling configs that other revertrisk inference-services use: https://github.com/wikimedia/operations-deployment-charts/blob/8412fc655d3b1e10b38cf0c954d910b820e93a05/helmfile.d/ml-services/revertrisk/values.yaml#L145-L150

Tue, Nov 25, 12:31 PM · Wikimedia Enterprise (WME Kanban), Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

Looks like the torch version 2.5.1+rocm6.1 that the llm model-server image currently uses doesn't support expandable_segments:

kevinbazira@deploy2002:~$ kubectl logs aya-llm-predictor-00009-deployment-54ccf6ddc6-b5r9w
+ source common_settings.sh
+++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
CPU count detected from get_cpu_count: 6
OMP_NUM_THREADS set to: 6
++ CPU_COUNT=6
++ echo 'CPU count detected from get_cpu_count: 6'
++ export OMP_NUM_THREADS=6
++ OMP_NUM_THREADS=6
++ echo 'OMP_NUM_THREADS set to: 6'
+ MODEL_SERVER_PATH=src/models/llm/model.py
+ exec /usr/bin/python3 src/models/llm/model.py
/opt/lib/venv/lib/python3.11/site-packages/accelerate/utils/modeling.py:841: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
  _ = torch.tensor([0], device=i)
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/srv/app/src/models/llm/model.py", line 145, in <module>
    model = llm_class(model_name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/srv/app/src/models/llm/aya/aya.py", line 12, in __init__
    super().__init__(model_name)
  File "/srv/app/src/models/llm/model.py", line 32, in __init__
    self.model, self.tokenizer = self.load()
                                 ^^^^^^^^^^^
  File "/srv/app/src/models/llm/aya/aya.py", line 21, in load
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
    caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5799, in caching_allocator_warmup
    _ = torch.empty(byte_count // factor, dtype=torch.float16, device=device, requires_grad=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.91 GiB. GPU 0 has a total capacity of 24.00 GiB of which 23.38 GiB is free. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Tue, Nov 25, 11:30 AM · Essential-Work, Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

The above error was fixed by setting BITSANDBYTES_DTYPE to None. Now we are running into OOO issue shown below:

kevinbazira@deploy2002:~$ kubectl logs aya-llm-predictor-00008-deployment-9759d96d5-jf5r8
+ source common_settings.sh
+++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
++ CPU_COUNT=6
++ echo 'CPU count detected from get_cpu_count: 6'
++ export OMP_NUM_THREADS=6
++ OMP_NUM_THREADS=6
++ echo 'OMP_NUM_THREADS set to: 6'
+ MODEL_SERVER_PATH=src/models/llm/model.py
+ exec /usr/bin/python3 src/models/llm/model.py
CPU count detected from get_cpu_count: 6
OMP_NUM_THREADS set to: 6
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/srv/app/src/models/llm/model.py", line 145, in <module>
    model = llm_class(model_name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/srv/app/src/models/llm/aya/aya.py", line 12, in __init__
    super().__init__(model_name)
  File "/srv/app/src/models/llm/model.py", line 32, in __init__
    self.model, self.tokenizer = self.load()
                                 ^^^^^^^^^^^
  File "/srv/app/src/models/llm/aya/aya.py", line 21, in load
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
    caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5799, in caching_allocator_warmup
    _ = torch.empty(byte_count // factor, dtype=torch.float16, device=device, requires_grad=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.91 GiB. GPU 0 has a total capacity of 24.00 GiB of which 23.38 GiB is free. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Tue, Nov 25, 9:01 AM · Essential-Work, Machine-Learning-Team
kevinbazira added a comment to T410906: Update Aya LLM model-server to run on LiftWing GPUs.

First deployment shows the model-server in a CrashLoopBackOff:

kevinbazira@deploy2002:~$ kubectl get pods
NAME                                                         READY   STATUS    RESTARTS      AGE
aya-llm-predictor-00007-deployment-84dd44b649-lh9vb          1/3     CrashLoopBackOff   4 (26s ago)   4m2s
Tue, Nov 25, 8:23 AM · Essential-Work, Machine-Learning-Team

Mon, Nov 24

kevinbazira added a project to T410906: Update Aya LLM model-server to run on LiftWing GPUs: Essential-Work.
Mon, Nov 24, 3:57 PM · Essential-Work, Machine-Learning-Team
kevinbazira moved T410906: Update Aya LLM model-server to run on LiftWing GPUs from Unsorted to In Progress on the Machine-Learning-Team board.
Mon, Nov 24, 3:50 PM · Essential-Work, Machine-Learning-Team
kevinbazira created T410906: Update Aya LLM model-server to run on LiftWing GPUs.
Mon, Nov 24, 3:50 PM · Essential-Work, Machine-Learning-Team
kevinbazira created P85522 Error when running Aya LLM model-server using image version: 2025-11-18-132643-publish.
Mon, Nov 24, 2:49 PM · Machine-Learning-Team
kevinbazira edited P85486 Steps to run the Aya LLM model-server on ML-Lab.
Mon, Nov 24, 10:52 AM · Machine-Learning-Team
kevinbazira edited P85486 Steps to run the Aya LLM model-server on ML-Lab.
Mon, Nov 24, 8:31 AM · Machine-Learning-Team
kevinbazira created P85486 Steps to run the Aya LLM model-server on ML-Lab.
Mon, Nov 24, 8:30 AM · Machine-Learning-Team
kevinbazira added a project to T409388: Test liftwing wikidata revert risk API for scale and latency: Machine-Learning-Team.
Mon, Nov 24, 5:37 AM · Wikimedia Enterprise (WME Kanban), Machine-Learning-Team
kevinbazira added a comment to T409388: Test liftwing wikidata revert risk API for scale and latency.

As we worked on T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing, we conducted locust load tests on the revertrisk-wikidata inference service staging endpoint. These tests ran for 120 seconds with 2 users, each sending requests at intervals between 1 and 5 seconds, using sample Wikidata revision IDs got from the Research team's expert_sample.csv.

Mon, Nov 24, 5:32 AM · Wikimedia Enterprise (WME Kanban), Machine-Learning-Team

Fri, Nov 21

kevinbazira added a comment to P85433 Error when testing bitsandbytes installed using bitsandbytes-1.0.0-py3-none-manylinux_2_24_x86_64.whl.

This error doesn't exist when I use this bitsandbytes wheel I built from source in P71986

kevinbazira@ml-lab1002:~/test_aya$ source .venv/bin/activate
(.venv) kevinbazira@ml-lab1002:~/test_aya$ python3 -m bitsandbytes
g++ (Debian 12.2.0-14+deb12u1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Fri, Nov 21, 12:32 PM · Machine-Learning-Team
kevinbazira created P85433 Error when testing bitsandbytes installed using bitsandbytes-1.0.0-py3-none-manylinux_2_24_x86_64.whl.
Fri, Nov 21, 12:25 PM · Machine-Learning-Team

Thu, Nov 20

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

The revertrisk-wikidata inference service is now live in LiftWing production. It can be accessed through:
1.External endpoint:

$ curl "https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H "Content-Type: application/json" --http1.1

2.Internal endpoint:

$ curl "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H  "Host: revertrisk-wikidata.revertrisk.wikimedia.org" -H "Content-Type: application/json" --http1.1

3.Documentation:

Thu, Nov 20, 5:16 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

@Trokhymovych, here are resources to help you create a comprehensive model card for the revertrisk-wikidata model:

  1. Section to create model card: https://meta.wikimedia.org/wiki/Machine_learning_models#Create_a_model_card
  2. FAQs to answer in the model card: https://docs.google.com/document/d/1Q5aJGGBJB4LN3dXS8_-IjZYi0a3T1MIWtDXDwZeEins/edit
  3. Model card template: https://meta.wikimedia.org/wiki/Machine_learning_models/Model_card_template
Thu, Nov 20, 4:32 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Wed, Nov 19

kevinbazira created P85386 revertrisk-wikidata deployed in LiftWing prod.
Wed, Nov 19, 12:32 PM · Machine-Learning-Team
kevinbazira created P85371 rrwikidata load test results for latest model-server image (2025-11-17-105041-publish) deployed in LiftWing staging.
Wed, Nov 19, 4:51 AM · Machine-Learning-Team

Tue, Nov 18

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

@Trokhymovych, thank you for reviewing the revertrisk-wikidata model-server and sharing detailed feedback (that's super useful).

Tue, Nov 18, 6:46 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira created P85352 rrwikidata latest model-server image (2025-11-17-105041-publish) deployed in LiftWing staging.
Tue, Nov 18, 6:27 AM · Machine-Learning-Team

Mon, Nov 17

kevinbazira created P85340 metrics (ROC AUC) comparison of revertrisk-wikidata model-server predictions vs expert_sample.csv predictions .
Mon, Nov 17, 10:07 AM · Machine-Learning-Team

Nov 13 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

I have run locust load tests on the revertrisk-wikidata staging isvc for 120s with 2 users, each sending requests between 1s to 5s, using sample Wikidata revision IDs that were shared in the expert_sample.csv in T406179#11333762. Results show the average response time is 568ms with 0% failure rate over 66 requests.

$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
...
MODEL=revertrisk_wikidata my_locust_venv/bin/locust --headless --csv results/revertrisk_wikidata
[2025-11-13 04:53:43,557] stat1008/INFO/locust.main: Run time limit set to 120 seconds
[2025-11-13 04:53:43,557] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-11-13 04:53:43,558] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-11-13 04:53:43,559] stat1008/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 2} (2 total users)
[2025-11-13 04:55:42,893] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-11-13 04:55:43,001] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/revertrisk-wikidata:predict                                            66     0(0.00%) |    568     375     886    550 |    0.56        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        66     0(0.00%) |    568     375     886    550 |    0.56        0.00
Nov 13 2025, 6:02 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Nov 12 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

As we prepare to run load tests, the revertrisk-wikidata isvc has been deployed in LiftWing staging:

# pod running in revision-models ns staging
$ kube_env revision-models ml-staging-codfw
$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
revertrisk-wikidata-predictor-00001-deployment-6fff6dbcbf-mxgmg   3/3     Running   0          77s
Nov 12 2025, 10:45 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

@Trokhymovych, following up on T406179#11353371, the revertrisk-wikidata model-server is now live in LiftWing's experimental namespace. Please test it by adjusting the rev_id in the curl command below and let us know whether it's returning correct predictions:

# ssh into WMF stat machine
$ ssh stat1008.eqiad.wmnet
Nov 12 2025, 5:26 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Nov 11 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

I have added unit tests for critical components of the model-server to make sure future changes do not break functionality. Here is the output when I build the test image and run the tests:

$ docker buildx build --target test -f .pipeline/revertrisk_wikidata/blubber.yaml --platform=linux/amd64 . -t rrw_unit_test
$ docker run --rm rrw_unit_test
...
Initialized empty Git repository in /srv/revertrisk_wikidata/.git/
ci-lint: install_deps> python -I -m pip install pre-commit
ci-lint: commands[0]> pre-commit run --all-files --show-diff-on-failure
[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Initializing environment for https://github.com/astral-sh/ruff-pre-commit.
[INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
ruff (legacy alias)......................................................Passed
ruff format..............................................................Passed
ci-lint: OK ✔ in 11.8 seconds
ci-unit: install_deps> python -I -m pip install -r /srv/revertrisk_wikidata/requirements-test.txt
ci-unit: commands[0]> pytest test/unit
============================= test session starts ==============================
platform linux -- Python 3.11.2, pytest-9.0.0, pluggy-1.6.0
cachedir: .tox/ci-unit/.pytest_cache
rootdir: /srv/revertrisk_wikidata
configfile: tox.ini
plugins: anyio-4.11.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 5 items
Nov 11 2025, 10:12 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Nov 7 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

The revertrisk-wikidata model-server has been deployed in the LiftWing experimental namespace. It is currently available through an internal endpoint that can only be accessed by tools that run within the WMF infrastructure (e.g deploy2002, stat1008, etc):

# pod running in experimental ns
$ kube_env experimental ml-staging-codfw
$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
revertrisk-wikidata-predictor-default-00019-deployment-557bkfdk   3/3     Running   0          96s
Nov 7 2025, 1:20 PM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

The revertrisk-wikidata model-server has been containerized and integrated into the CI/CD pipeline which published it successfully to the Wikimedia docker registry:

docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-revertrisk-wikidata:2025-11-07-042629-publish
Nov 7 2025, 5:02 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Nov 6 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

As we prepare to publish the revertrisk-wikidata model-server image to the Wikimedia Docker registry, here is a summary of the image layers:

$ docker history b601e2d84c63
IMAGE          CREATED         CREATED BY                                      SIZE      COMMENT
b601e2d84c63   2 minutes ago   [production] 📂 [common_settings.sh] -> comm…   1.36kB    buildkit.exporter.image.v0
<missing>      2 minutes ago   [production] 📂 [model_server_entrypoint.sh]…   303B      buildkit.exporter.image.v0
<missing>      2 minutes ago   [production] 📦 {build}[/opt/lib/venv/lib/py…   1.81GB    buildkit.exporter.image.v0
<missing>      4 minutes ago   [production] 📂 [python] -> python/             33.1kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   [production] 📂 [src/models/revertrisk_wikid…   30.7kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   mount / from exec /bin/sh -c (getent group "…   9.07kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   mount / from exec /bin/sh -c (getent group "…   8.88kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   mount / from exec /bin/sh -c apt-get update …   59.2MB    buildkit.exporter.image.v0
<missing>      11 days ago     /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      11 days ago     /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B        
<missing>      11 days ago     /bin/sh -c #(nop) ADD file:abeaf73dbbde23882…   74.8MB

The largest layer size is ~1.81GB which meets the Wikimedia Docker registry's 4GB compressed layer size limit.

Nov 6 2025, 5:22 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Nov 4 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

Thanks for the review @Trokhymovych. I have updated the model-server to remove the reliance on the static full_labels_2024-04_text_en.csv file.

Nov 4 2025, 8:39 PM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

@Trokhymovych thank you for sharing the full model and detailed instructions. The model has been uploaded to swift:

$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/revertrisk/wikidata/20251104121312/
                    DIR  s3://wmf-ml-models/revertrisk/wikidata/20251104121312/data/
2025-11-04 06:43   631M  s3://wmf-ml-models/revertrisk/wikidata/20251104121312/wikidata_revertrisk_graph2text_v2.pkl

and is also publicly accessible via the analytics portal: https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/wikidata/20251104121312/

Nov 4 2025, 10:03 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Oct 30 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

What would be the extra work required to deploy the full model?

Hi @Miriam, the plan is to deploy the full model. When the Research Team provides the full model, we shall build a model-server and host it on LiftWing.

Oct 30 2025, 10:12 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Oct 29 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

The revertrisk-wikidata-metadata model-server prototype has been updated in P84312 based on the discussion we have had regarding feature processing. We have fixed:

1.user_is_bot: expected as a string instead of an integer ("0" or "1")
2.event_user_groups: expected as a string instead of a float ("0.0" or "1.0")
3.user_age and page_age: should be expressed in years instead of seconds

Oct 29 2025, 2:05 PM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira edited P84312 revertrisk-wikidata-metadata prototype.
Oct 29 2025, 12:14 PM · Machine-Learning-Team

Oct 28 2025

kevinbazira added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

Thank you for sharing this information, @Trokhymovych.

Oct 28 2025, 2:10 PM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira created P84312 revertrisk-wikidata-metadata prototype.
Oct 28 2025, 2:07 PM · Machine-Learning-Team

Oct 24 2025

kevinbazira added a comment to T398970: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model.
  • Ran tone-check training job locally with model-ready training data to determine memory usage as there was an 8GB limit in wmf airflow that caused the job to fail. (T407212#11280133)
  • Tested GPU node labels that were set up by SRE. (T373806#11275873)
  • While using the GPU node selector, the tone-check model training task completed in 8hrs running in staging on MI210 GPU with 64G VRAM (P83966)
  • Pushed an MR to prod that has the single DAG for the tone-check training pipeline.
  • Copied tone-check base model from HDFS to PVC in prod. (MR)
  • Followed up to confirm that the triggerer process enabled by DPE SRE works as expected in both airflow-ml and airflow-devenv (T406958#11288441)
  • The tone-check training DAG ran and completed end-to-end in prod (T407212#11288359)

Tone-Check model training DAG succeeded in airflow-ml (Screenshot from 2025-10-20 07-36-31).png (1×1 px, 366 KB)

Oct 24 2025, 5:37 AM · Goal, Machine-Learning-Team

Oct 23 2025

kevinbazira updated subscribers of T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

Hi @Trokhymovych, just following up on our discussion from yesterday. When you have a moment, please share the demo code that shows model inference (input/output) using the Wikidata revert-risk model and a link to the model binary. Thanks!

Oct 23 2025, 7:14 AM · OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team
kevinbazira added a comment to P84087 DAG task failed when using in_cluster=True in deferrable operator.

Fixed by DPE SRE here.

Oct 23 2025, 5:59 AM · Machine-Learning-Team
kevinbazira added a comment to P84073 DAG task fails to resume after deferred pod completion.

Fixed as shown here.

Oct 23 2025, 5:57 AM · Machine-Learning-Team
kevinbazira added a comment to P83875 train_tone_check task failed because of OOM issue in airflow-devenv.

This OOM issue was resolved as shown here.

Oct 23 2025, 5:52 AM · Machine-Learning-Team
kevinbazira added a comment to P83880 train_tone_check task failed because it exceeded 8Gi memory limit.

This was fixed as shown here.

Oct 23 2025, 5:51 AM · Machine-Learning-Team
kevinbazira moved T406302: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Oct 23 2025, 5:40 AM · Essential-Work, Machine-Learning-Team
kevinbazira closed T406302: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator, a subtask of T396495: Build model training pipeline for tone check using WMF ML Airflow instance, as Resolved.
Oct 23 2025, 5:39 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Editing-team (Tracking), Machine-Learning-Team
kevinbazira closed T406302: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator as Resolved.

Since using the TriggerDagRunOperator for cross-DAG orchestration requires one to always check that all DAGs are unpaused, the ML team decided to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration

Oct 23 2025, 5:39 AM · Essential-Work, Machine-Learning-Team

Oct 22 2025

kevinbazira updated subscribers of T396495: Build model training pipeline for tone check using WMF ML Airflow instance.

In T407212#11288359, we finally have a tone-check model training pipeline that runs end-to-end in the airflow-ml production instance.

Oct 22 2025, 5:51 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Editing-team (Tracking), Machine-Learning-Team

Oct 20 2025

kevinbazira added a comment to T406958: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv.

Feel free to destroy/redeploy your whole devenv in about ~5 minutes, and check whether things work OOTB now!

Oct 20 2025, 9:25 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Machine-Learning-Team
kevinbazira added a comment to T406958: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv.

Super! The DAG task with a deferrable operator has now succeeded in the airflow-devenv.

Screenshot from 2025-10-20 11-52-47.png (837×1 px, 270 KB)

Oct 20 2025, 8:54 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Machine-Learning-Team
kevinbazira added a comment to T406958: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv.

Can you try to pass the in_cluster=True argument to the WMFKubernetesPodOperator that also has deferrable=True ?

Oct 20 2025, 7:49 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Machine-Learning-Team
kevinbazira created P84087 DAG task failed when using in_cluster=True in deferrable operator.
Oct 20 2025, 7:44 AM · Machine-Learning-Team
kevinbazira added a comment to T406958: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv.

@brouberol, thank you for enabling the triggerer process in Airflow. I tested it using this DAG, and the WMFKubernetesPodOperator deferred the training task, launched a GPU-enabled pod, and the pod ran and completed successfully.

Oct 20 2025, 6:17 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Machine-Learning-Team
kevinbazira updated the title for P84073 DAG task fails to resume after deferred pod completion from DAG task fails to resume after deffered pod completion to DAG task fails to resume after deferred pod completion.
Oct 20 2025, 5:55 AM · Machine-Learning-Team
kevinbazira created P84073 DAG task fails to resume after deferred pod completion.
Oct 20 2025, 5:54 AM · Machine-Learning-Team
kevinbazira moved T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration from In Progress to 2025-2026 Q1 Done on the Machine-Learning-Team board.
Oct 20 2025, 4:48 AM · Essential-Work, Machine-Learning-Team
kevinbazira closed T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration, a subtask of T406302: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator, as Resolved.
Oct 20 2025, 4:47 AM · Essential-Work, Machine-Learning-Team
kevinbazira closed T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration as Resolved.
Oct 20 2025, 4:47 AM · Essential-Work, Machine-Learning-Team
kevinbazira added a comment to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration.

With the configurations described in T407212#11280133, the tone_check_training_dag ran end-to-end in production just like it did in staging:

Tone-Check model training DAG succeeded in airflow-ml (Screenshot from 2025-10-20 07-36-31).png (1×1 px, 366 KB)

Oct 20 2025, 4:46 AM · Essential-Work, Machine-Learning-Team

Oct 16 2025

kevinbazira added a comment to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration.

Following T407212#11275074, I ran the tone-check training job locally with model-ready training data to determine memory usage. Both 8Gi and 16Gi were not enough, as the job required over 18Gi. With a 20Gi memory limit, the job ran on CPU without OOM issues, although it took a very long time to complete.

Oct 16 2025, 7:38 AM · Essential-Work, Machine-Learning-Team
kevinbazira created P83966 tone-check model training DAG task completed in 8 hrs running on MI210 GPU with 64G VRAM.
Oct 16 2025, 5:42 AM · Machine-Learning-Team

Oct 15 2025

kevinbazira added a comment to T373806: Investigate Label functionality of AMD GPU device plugin on k8s.

@elukey, thank you for working on the GPU node labellers. Following our IRC conversation, I tested the node selector functionality in Airflow using the WMFKubernetesPodOperator and here are the results:

Oct 15 2025, 10:05 AM · Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Machine-Learning-Team
kevinbazira added a comment to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration.

To fix the OOM issue reported in T407212#11271755, I increased container_resources.limits.memory from 8Gi to 16Gi in the train_tone_check task. However, the task running on CPU still failed with the error below:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"train-tone-check-e7hk664\" is forbidden: [maximum memory usage per Container is 8Gi, but limit is 16Gi, maximum memory usage per Pod is 10Gi, but limit is 17179869184]","reason":"Forbidden","details":{"name":"train-tone-check-e7hk664","kind":"pods"},"code":403}

Full logs can be found here. This indicates that we requested 16Gi for the container, but the cluster only allows up to 8Gi per container and 10Gi per pod.

Oct 15 2025, 4:52 AM · Essential-Work, Machine-Learning-Team
kevinbazira created P83880 train_tone_check task failed because it exceeded 8Gi memory limit.
Oct 15 2025, 4:49 AM · Machine-Learning-Team

Oct 14 2025

kevinbazira added a comment to T398970: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model.
  • Our DAGs were granted permission by DPE SRE to spin up pods in the airflow-ml instance (T406302#11250084)
  • Model training task fails because it requests a GPU, but all available GPUs are in use by other jobs (T406302#11252998)
  • Enabled deferrable execution for the training operator to wait for GPU resources gracefully and discovered that the Airflow triggerer process was not running (T406302#11258029)
  • Following SRE advice, we manually started the triggerer by execing into the scheduler pod, but the DAG task failed because of a kube-config issue (P83720)
  • Requested SRE to enable and properly configure the Airflow triggerer process to support deferrable operators (T406958)
    • Also had a follow-up discussion on slack in case there were other solutions to handle this issue
    • A fix is a WIP
  • Started merging the tone-check DAGs into a single tone_check_training_dag for simplified orchestration (T407212)
    • Tested the tone_check_training_dag in airflow-devenv and the data-generation and data-copy tasks ran successfully (T407212#11271755)
    • The train_tone_check task failed with an OOM issue (P83875)
Oct 14 2025, 2:06 PM · Goal, Machine-Learning-Team
kevinbazira added a comment to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration.

I have run the tone_check_training_dag in staging, and the following tasks succeeded: generate_training_data, split_training_data, and copy_hdfs_to_pvc. The train_tone_check task running on CPU failed with an OOM issue shown in the logs below:

1tone-check-training-dag-train-tone-check-vyuozrsz
2 ▶ Log message source details
3[2025-10-14, 09:51:09 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
4[2025-10-14, 09:51:09 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
5[2025-10-14, 09:51:09 UTC] {pod.py:1276} INFO - Building pod train-tone-check-xv60x59 with labels: {'dag_id': 'tone_check_training_dag', 'task_id': 'train_tone_check', 'run_id': 'manual__2025-10-14T060736.9821150000-fbf3b9f8e', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
6[2025-10-14, 09:51:09 UTC] {pod.py:573} INFO - Found matching pod train-tone-check-xv60x59 with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'tone_check_training_dag', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-10-14T060736.9821150000-fbf3b9f8e', 'task_id': 'train_tone_check', 'try_number': '1'}
7[2025-10-14, 09:51:09 UTC] {pod.py:574} INFO - `try_number` of task_instance: 1
8[2025-10-14, 09:51:09 UTC] {pod.py:575} INFO - `try_number` of pod: 1
9[2025-10-14, 09:51:09 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-tone-check-xv60x59 to dse-k8s-worker1009.eqiad.wmnet from None
10[2025-10-14, 09:51:09 UTC] {pod_manager.py:410} ▶ Waiting until 120s to get the POD scheduled...
11[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] Ensuring output model directory exists
12[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] + echo 'Ensuring output model directory exists'
13[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] + mkdir -p /mnt/model-training/tone_check/20251014T060736/output_model
14[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] Verifying input data exists on PVC
15[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] + echo 'Verifying input data exists on PVC'
16[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] + ls -lR /mnt/model-training/tone_check/20251014T060736
17[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] /mnt/model-training/tone_check/20251014T060736:
18[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] total 20
19[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Oct 14 09:50 full_model_ready_data.parquet
20[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] drwxr-sr-x 2 somebody runuser 4096 Oct 14 09:51 output_model
21[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Oct 14 09:50 test_data.parquet
22[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Oct 14 09:50 train_data.parquet
23[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Oct 14 09:50 validation_data.parquet
24[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] /mnt/model-training/tone_check/20251014T060736/full_model_ready_data.parquet:
25[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] total 835360
26[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 0 Oct 14 09:50 _SUCCESS
27[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 50930491 Oct 14 09:50 part-00000-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
28[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 50389962 Oct 14 09:50 part-00001-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
29[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 52493037 Oct 14 09:50 part-00002-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
30[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 54446932 Oct 14 09:50 part-00003-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
31[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 56800742 Oct 14 09:50 part-00004-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
32[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 58944135 Oct 14 09:50 part-00005-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
33[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 57057906 Oct 14 09:50 part-00006-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
34[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 57271362 Oct 14 09:50 part-00007-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
35[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 56664915 Oct 14 09:50 part-00008-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
36[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 55237851 Oct 14 09:50 part-00009-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
37[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 53890308 Oct 14 09:50 part-00010-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
38[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 52640163 Oct 14 09:50 part-00011-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
39[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 51851038 Oct 14 09:50 part-00012-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
40[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 49537542 Oct 14 09:50 part-00013-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
41[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 48025610 Oct 14 09:50 part-00014-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
42[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 49192284 Oct 14 09:50 part-00015-5838adb4-2e7c-43dc-859a-7e404686a61c-c000.snappy.parquet
43[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] /mnt/model-training/tone_check/20251014T060736/output_model:
44[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] total 0
45[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] /mnt/model-training/tone_check/20251014T060736/test_data.parquet:
46[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] total 83804
47[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 0 Oct 14 09:50 _SUCCESS
48[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 42567160 Oct 14 09:50 part-00000-c4bbe68a-6ac4-4c4a-b1d7-bb94d8dd26f2-c000.snappy.parquet
49[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 43245337 Oct 14 09:50 part-00001-c4bbe68a-6ac4-4c4a-b1d7-bb94d8dd26f2-c000.snappy.parquet
50[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] /mnt/model-training/tone_check/20251014T060736/train_data.parquet:
51[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] total 668588
52[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 0 Oct 14 09:50 _SUCCESS
53[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 86594416 Oct 14 09:50 part-00000-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
54[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 85139077 Oct 14 09:50 part-00001-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
55[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 85994779 Oct 14 09:50 part-00002-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
56[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 84652071 Oct 14 09:50 part-00003-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
57[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 86368470 Oct 14 09:50 part-00004-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
58[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 85018120 Oct 14 09:50 part-00005-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
59[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 85316168 Oct 14 09:50 part-00006-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
60[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 85531993 Oct 14 09:50 part-00007-7574e267-50dd-4353-93e8-dd8705ce041b-c000.snappy.parquet
61[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] /mnt/model-training/tone_check/20251014T060736/validation_data.parquet:
62[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] total 83148
63[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 0 Oct 14 09:50 _SUCCESS
64[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 42156457 Oct 14 09:50 part-00000-9884768c-398a-47eb-9889-2d043f37edc1-c000.snappy.parquet
65[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] -rw-rw---- 1 runuser runuser 42979476 Oct 14 09:50 part-00001-9884768c-398a-47eb-9889-2d043f37edc1-c000.snappy.parquet
66[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] + echo 'Starting model training'
67[2025-10-14, 09:51:24 UTC] {pod_manager.py:536} INFO - [base] Starting model training
68[2025-10-14, 09:51:25 UTC] {pod_manager.py:536} INFO - [base] + python3 training/tone_check/retrain/retrain.py --train-data-path '/mnt/model-training/tone_check/20251014T060736/train_data.parquet/*.parquet' --validation-data-path '/mnt/model-training/tone_check/20251014T060736/validation_data.parquet/*.parquet' --output-model-path /mnt/model-training/tone_check/20251014T060736/output_model --base-model-path /mnt/model-training/training/tone_check/base_model
69[2025-10-14, 09:51:25 UTC] {pod_manager.py:536} INFO - [base] INFO:root:Tone-Check Retraining
70[2025-10-14, 09:51:25 UTC] {pod_manager.py:536} INFO - [base] INFO:root:DEVICE: cpu
71[2025-10-14, 09:51:25 UTC] {pod_manager.py:536} INFO - [base] INFO:root:Parsed arguments: {'train_data_path': '/mnt/model-training/tone_check/20251014T060736/train_data.parquet/*.parquet', 'validation_data_path': '/mnt/model-training/tone_check/20251014T060736/validation_data.parquet/*.parquet', 'output_model_path': '/mnt/model-training/tone_check/20251014T060736/output_model', 'base_model_path': '/mnt/model-training/training/tone_check/base_model', 'num_label': 2, 'max_len': 512, 'learning_rate': 2e-06, 'weight_decay': 0.01, 'batch_size': 16, 'num_epoch': 5, 'metric_for_best_model': 'eval_loss', 'eval_strategy': 'epoch', 'save_strategy': 'epoch'}
72[2025-10-14, 09:51:25 UTC] {pod_manager.py:536} INFO - [base] INFO:root:Loading training data from: /mnt/model-training/tone_check/20251014T060736/train_data.parquet/*.parquet
73[2025-10-14, 09:51:54 UTC] {pod_manager.py:536} INFO - [base] INFO:root:Loading validation data from: /mnt/model-training/tone_check/20251014T060736/validation_data.parquet/*.parquet
74[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base]
75Generating train split: 0 examples [00:00, ? examples/s]
76Generating train split: 21206 examples [00:00, 41569.39 examples/s]
77Generating train split: 42412 examples [00:01, 42316.46 examples/s]
78Generating train split: 63618 examples [00:01, 42860.46 examples/s]
79Generating train split: 84823 examples [00:01, 43479.90 examples/s]
80Generating train split: 106027 examples [00:02, 43158.92 examples/s]
81Generating train split: 127230 examples [00:02, 43407.28 examples/s]
82Generating train split: 148438 examples [00:03, 43406.46 examples/s]
83Generating train split: 169645 examples [00:03, 42951.37 examples/s]
84Generating train split: 169645 examples [00:03, 43010.25 examples/s]
85[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base]
86Generating validation split: 0 examples [00:00, ? examples/s]
87Generating validation split: 10589 examples [00:00, 42558.20 examples/s]
88Generating validation split: 21174 examples [00:00, 41008.84 examples/s]
89Generating validation split: 21174 examples [00:00, 41146.82 examples/s]
90[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] INFO:root:Sample from training set:
91[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] {'input': 'en[SEP]Keyence[SEP]() is a company which produces sensors, barcode readers, vision systems and measuring equipment.\n\n\n\nKeyence Corporation is a global company with a network of 16 international organizations that specializes in factory automation. Founded in Japan in 1974, Keyence Corporation now earns over 2 billion dollars in yearly sales and employs nearly 3000 employees worldwide. \n\n \n\n\nKeyence is a direct sales company; every Keyence salesperson enrolls in a 6 month training program to learn the products and practice solution-based sales. Salespeople do not manage distributors. Instead, they go directly to customers with a demonstration case specifically built to solve applications on-the-spot.\n\nKeyence\'s sensors, vision systems, and high definition microscopes are part of the manufacturing and research processes in a variety of industries, including the electronic, semiconductor, automotive, food and packaging, biotechnology, and pharmaceutical industries. Keyence\'s customers include companies ranging from the largest Fortune 500 manufacturers to niche suppliers whose goal is to improve the quality and efficiency of their automated manufacturing.\n\n\t\nKeyence was named one of BusinessWeeks “1000 Best Valued Companies.” \nKeyence Japan is consistently listed in the Nihon Keizai Shimbun\'s yearly ranking of the "Top Ten Most Excellent Companies in Japan." Keyence is known as one of the best "pay" companies in Japan. The average annual wage for all full-time employees (average age: 31.9 years old) in FY2006 was JPY13,860,000 (US$117,348 as of March 2007).\nA 350-million-year-old ammonite fossil is displayed at the entrance of the Japanese headquarters; other fossils of long-dead creatures align the corridors and meeting rooms. Relics are supposed to convey a tacit message to employees: keep aiming high or you\'ll become a fossil.\nTakemitsu Takizaki founded Keyence Corporation in 1974 under the original name of "Lead Electric." \nTakizaki is listed as the 428th richest person in the world in 2008 by Forbes with a net worth of US$2.7 billion.\nKeyence is fabless (fabrication-less): Although Keyence is a manufacturer; it specializes solely in product planning and development and does not manufacture the final products. Keyence products are manufactured at qualified contract manufacturing companies.\nStephen Way, Senior Vice-President and Global Portfolio Manager at AGF Funds Inc.: "Keyence has a proven ability to deliver innovative products that customers want and this is driving strong pricing and profitability." \n The Financial Times: “Keyence means little to most people; to engineers, however, they mean a great deal.”\n\n\n\nKeyence manufactures a broad range of products, from photoelectric and proximity sensors to measuring instruments for inspection lines to high precision microscopy devices used in research institutes. These products are used by more than 80,000 customers globally. \n\nProducts are shipped from Keyence\'s stocking network centers in Japan, U.S. (Chicago), the U.K., Germany, France, Thailand, Malaysia, Singapore and South Korea or from 148 agents in 31 countries on the same day of receipt of an order. All products in the catalog are normally in stock.\n\n \n\nKeyence\'s customers in a variety of industries and manufacturing environments use their sensor products to detect the presence or absence of an entire part or just a particular feature of that part. Measurement products are used to determine the size or magnitude of a particular part or feature with great accuracy. As more and more factories seek to remain competitive by automating their processes, the market for the sensors and measurement products is huge and growing. New product releases consistently account for 30% of Keyence\'s annual sales.\nFiber Optic Sensors (FS Series)\nPhotoelectric Sensors (PX Series)\nLaser Displacement Sensor (LK Series)\nSafety Light Curtain (SL Series)\n\n \n\nVision system products are camera systems used on production lines to differentiate and measure multiple product features. Keyence\'s customers use their camera systems to perform quality control inspections that are too complicated for ordinary sensors. Their laser marking instruments use a high intensity laser to permanently and accurately mark shapes or characters onto surfaces such as metals or plastics at high speeds.\n\nMachine Vision (CV Series)\nLaser Marker (ML Series)\nLaser Marker (MD Series)\n\n\n\nMicroscopes are the only products offered by Keyence America for use away from a production line. While many of the customers for their microscopes are manufacturers, these microscopes are more typically used for research and development or failure analysis applications. Keyence\'s digital microscopes are capable of displaying a 3D image of the target. The image can also be manipulated or used to make a measurement of the target feature being viewed. Their color laser scanning microscope offers high accuracy with the use of a violet laser. This laser microscope approaches the accuracy and resolution of an SEM microscope at a lower cost and without destroying the target.\n\nDigital Microscope (VHX Series)\nLaser Microscope (VK Series)\n\n\n\n\n\n Keyence Global Home \n Keyence Corporation \n\n\n\n\n\n\n\nde:Keyence\nfr:Keyence\nja:キーエンス', 'label': 1}
92[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] INFO:root:tokenizer loaded
93[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] DatasetDict({
94[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] train: Dataset({
95[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] features: ['input', 'label'],
96[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] num_rows: 169645
97[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] })
98[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] test: Dataset({
99[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] features: ['input', 'label'],
100[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] num_rows: 21174
101[2025-10-14, 09:52:18 UTC] {pod_manager.py:536} INFO - [base] })
102[2025-10-14, 09:54:36 UTC] {pod_manager.py:536} INFO - [base] })
103[2025-10-14, 09:54:52 UTC] {pod_manager.py:536} INFO - [base]
104Map: 0%| | 0/169645 [00:00<?, ? examples/s]
105Map: 1%| | 1000/169645 [00:00<02:28, 1137.16 examples/s]
106Map: 1%| | 2000/169645 [00:01<02:20, 1189.68 examples/s]
107Map: 2%|▏ | 3000/169645 [00:02<02:17, 1215.79 examples/s]
108Map: 2%|▏ | 4000/169645 [00:03<02:19, 1188.13 examples/s]
109Map: 3%|▎ | 5000/169645 [00:04<02:16, 1209.34 examples/s]
110Map: 4%|▎ | 6000/169645 [00:04<02:14, 1220.19 examples/s]
111Map: 4%|▍ | 7000/169645 [00:05<02:11, 1233.40 examples/s]
112Map: 5%|▍ | 8000/169645 [00:06<02:09, 1251.11 examples/s]
113Map: 5%|▌ | 9000/169645 [00:07<02:07, 1264.74 examples/s]
114Map: 6%|▌ | 10000/169645 [00:08<02:02, 1298.53 examples/s]
115Map: 6%|▋ | 11000/169645 [00:08<02:04, 1279.02 examples/s]
116Map: 7%|▋ | 12000/169645 [00:09<02:04, 1266.48 examples/s]
117Map: 8%|▊ | 13000/169645 [00:10<02:03, 1270.39 examples/s]
118Map: 8%|▊ | 14000/169645 [00:11<02:03, 1261.36 examples/s]
119Map: 9%|▉ | 15000/169645 [00:12<02:02, 1263.15 examples/s]
120Map: 9%|▉ | 16000/169645 [00:12<02:07, 1200.56 examples/s]
121Map: 10%|█ | 17000/169645 [00:13<02:04, 1226.06 examples/s]
122Map: 11%|█ | 18000/169645 [00:14<02:03, 1232.75 examples/s]
123Map: 11%|█ | 19000/169645 [00:15<02:00, 1247.81 examples/s]
124Map: 12%|█▏ | 20000/169645 [00:16<01:57, 1270.54 examples/s]
125Map: 12%|█▏ | 21000/169645 [00:16<01:56, 1279.02 examples/s]
126Map: 13%|█▎ | 22000/169645 [00:17<01:55, 1276.82 examples/s]
127Map: 14%|█▎ | 23000/169645 [00:18<01:54, 1285.25 examples/s]
128Map: 14%|█▍ | 24000/169645 [00:19<01:52, 1291.70 examples/s]
129Map: 15%|█▍ | 25000/169645 [00:19<01:52, 1288.87 examples/s]
130Map: 15%|█▌ | 26000/169645 [00:20<01:50, 1295.17 examples/s]
131Map: 16%|█▌ | 27000/169645 [00:21<01:49, 1304.32 examples/s]
132Map: 17%|█▋ | 28000/169645 [00:22<01:49, 1291.09 examples/s]
133Map: 17%|█▋ | 29000/169645 [00:23<01:48, 1291.36 examples/s]
134Map: 18%|█▊ | 30000/169645 [00:23<01:46, 1306.17 examples/s]
135Map: 18%|█▊ | 31000/169645 [00:24<01:54, 1215.06 examples/s]
136Map: 19%|█▉ | 32000/169645 [00:25<01:50, 1249.11 examples/s]
137Map: 19%|█▉ | 33000/169645 [00:26<01:48, 1263.19 examples/s]
138Map: 20%|██ | 34000/169645 [00:26<01:44, 1292.33 examples/s]
139Map: 21%|██ | 35000/169645 [00:27<01:45, 1273.61 examples/s]
140Map: 21%|██ | 36000/169645 [00:28<01:47, 1245.50 examples/s]
141Map: 22%|██▏ | 37000/169645 [00:29<01:45, 1260.46 examples/s]
142Map: 22%|██▏ | 38000/169645 [00:30<01:42, 1284.19 examples/s]
143Map: 23%|██▎ | 39000/169645 [00:30<01:42, 1275.37 examples/s]
144Map: 24%|██▎ | 40000/169645 [00:31<01:42, 1270.58 examples/s]
145Map: 24%|██▍ | 41000/169645 [00:32<01:41, 1267.59 examples/s]
146Map: 25%|██▍ | 42000/169645 [00:33<01:42, 1239.91 examples/s]
147Map: 25%|██▌ | 43000/169645 [00:34<01:41, 1242.92 examples/s]
148Map: 26%|██▌ | 44000/169645 [00:34<01:41, 1238.00 examples/s]
149Map: 27%|██▋ | 45000/169645 [00:35<01:45, 1179.25 examples/s]
150Map: 27%|██▋ | 46000/169645 [00:36<01:44, 1178.52 examples/s]
151Map: 28%|██▊ | 47000/169645 [00:37<01:42, 1201.78 examples/s]
152Map: 28%|██▊ | 48000/169645 [00:38<01:38, 1231.08 examples/s]
153Map: 29%|██▉ | 49000/169645 [00:39<01:37, 1240.77 examples/s]
154Map: 29%|██▉ | 50000/169645 [00:39<01:37, 1230.93 examples/s]
155Map: 30%|███ | 51000/169645 [00:40<01:35, 1248.00 examples/s]
156Map: 31%|███ | 52000/169645 [00:41<01:33, 1261.46 examples/s]
157Map: 31%|███ | 53000/169645 [00:42<01:34, 1234.24 examples/s]
158Map: 32%|███▏ | 54000/169645 [00:43<01:34, 1228.30 examples/s]
159Map: 32%|███▏ | 55000/169645 [00:43<01:31, 1256.69 examples/s]
160Map: 33%|███▎ | 56000/169645 [00:44<01:31, 1247.19 examples/s]
161Map: 34%|███▎ | 57000/169645 [00:45<01:28, 1268.76 examples/s]
162Map: 34%|███▍ | 58000/169645 [00:46<01:27, 1276.75 examples/s]
163Map: 35%|███▍ | 59000/169645 [00:47<01:26, 1279.22 examples/s]
164Map: 35%|███▌ | 60000/169645 [00:47<01:26, 1268.28 examples/s]
165Map: 36%|███▌ | 61000/169645 [00:48<01:27, 1236.79 examples/s]
166Map: 37%|███▋ | 62000/169645 [00:49<01:30, 1190.58 examples/s]
167Map: 37%|███▋ | 63000/169645 [00:50<01:28, 1209.54 examples/s]
168Map: 38%|███▊ | 64000/169645 [00:51<01:26, 1223.68 examples/s]
169Map: 38%|███▊ | 65000/169645 [00:51<01:23, 1250.12 examples/s]
170Map: 39%|███▉ | 66000/169645 [00:52<01:21, 1264.36 examples/s]
171Map: 39%|███▉ | 67000/169645 [00:53<01:21, 1263.43 examples/s]
172Map: 40%|████ | 68000/169645 [00:54<01:21, 1242.42 examples/s]
173Map: 41%|████ | 69000/169645 [00:55<01:19, 1259.22 examples/s]
174Map: 41%|████▏ | 70000/169645 [00:55<01:19, 1248.72 examples/s]
175Map: 42%|████▏ | 71000/169645 [00:56<01:19, 1235.15 examples/s]
176Map: 42%|████▏ | 72000/169645 [00:57<01:19, 1224.75 examples/s]
177Map: 43%|████▎ | 73000/169645 [00:58<01:17, 1247.08 examples/s]
178Map: 44%|████▎ | 74000/169645 [00:59<01:16, 1245.45 examples/s]
179Map: 44%|████▍ | 75000/169645 [00:59<01:14, 1265.96 examples/s]
180Map: 45%|████▍ | 76000/169645 [01:00<01:15, 1246.89 examples/s]
181Map: 45%|████▌ | 77000/169645 [01:01<01:17, 1197.93 examples/s]
182Map: 46%|████▌ | 78000/169645 [01:02<01:14, 1223.32 examples/s]
183Map: 47%|████▋ | 79000/169645 [01:03<01:13, 1239.78 examples/s]
184Map: 47%|████▋ | 80000/169645 [01:04<01:11, 1248.55 examples/s]
185Map: 48%|████▊ | 81000/169645 [01:04<01:11, 1246.16 examples/s]
186Map: 48%|████▊ | 82000/169645 [01:05<01:09, 1267.19 examples/s]
187Map: 49%|████▉ | 83000/169645 [01:06<01:08, 1271.19 examples/s]
188Map: 50%|████▉ | 84000/169645 [01:07<01:07, 1266.06 examples/s]
189Map: 50%|█████ | 85000/169645 [01:07<01:07, 1262.92 examples/s]
190Map: 51%|█████ | 86000/169645 [01:08<01:08, 1217.95 examples/s]
191Map: 51%|█████▏ | 87000/169645 [01:09<01:07, 1228.93 examples/s]
192Map: 52%|█████▏ | 88000/169645 [01:10<01:05, 1239.52 examples/s]
193Map: 52%|█████▏ | 89000/169645 [01:11<01:06, 1211.15 examples/s]
194Map: 53%|█████▎ | 90000/169645 [01:12<01:05, 1210.74 examples/s]
195Map: 54%|█████▎ | 91000/169645 [01:13<01:08, 1149.50 examples/s]
196Map: 54%|█████▍ | 92000/169645 [01:13<01:05, 1186.26 examples/s]
197Map: 55%|█████▍ | 93000/169645 [01:14<01:03, 1214.81 examples/s]
198Map: 55%|█████▌ | 94000/169645 [01:15<01:01, 1222.91 examples/s]
199Map: 56%|█████▌ | 95000/169645 [01:16<01:01, 1222.11 examples/s]
200Map: 57%|█████▋ | 96000/169645 [01:17<01:01, 1202.70 examples/s]
201Map: 57%|█████▋ | 97000/169645 [01:17<01:00, 1209.69 examples/s]
202Map: 58%|█████▊ | 98000/169645 [01:18<00:57, 1238.49 examples/s]
203Map: 58%|█████▊ | 99000/169645 [01:19<00:56, 1245.87 examples/s]
204Map: 59%|█████▉ | 100000/169645 [01:20<00:55, 1246.14 examples/s]
205Map: 60%|█████▉ | 101000/169645 [01:21<00:54, 1262.22 examples/s]
206Map: 60%|██████ | 102000/169645 [01:21<00:55, 1225.76 examples/s]
207Map: 61%|██████ | 103000/169645 [01:22<00:54, 1232.80 examples/s]
208Map: 61%|██████▏ | 104000/169645 [01:23<00:53, 1234.04 examples/s]
209Map: 62%|██████▏ | 105000/169645 [01:24<00:52, 1226.97 examples/s]
210Map: 62%|██████▏ | 106000/169645 [01:25<00:52, 1214.53 examples/s]
211Map: 63%|██████▎ | 107000/169645 [01:26<00:50, 1252.81 examples/s]
212Map: 64%|██████▎ | 108000/169645 [01:26<00:51, 1199.56 examples/s]
213Map: 64%|██████▍ | 109000/169645 [01:27<00:50, 1212.32 examples/s]
214Map: 65%|██████▍ | 110000/169645 [01:28<00:47, 1247.53 examples/s]
215Map: 65%|██████▌ | 111000/169645 [01:29<00:46, 1255.89 examples/s]
216Map: 66%|██████▌ | 112000/169645 [01:30<00:46, 1247.88 examples/s]
217Map: 67%|██████▋ | 113000/169645 [01:30<00:45, 1234.20 examples/s]
218Map: 67%|██████▋ | 114000/169645 [01:31<00:45, 1213.99 examples/s]
219Map: 68%|██████▊ | 115000/169645 [01:32<00:45, 1200.68 examples/s]
220Map: 68%|██████▊ | 116000/169645 [01:33<00:44, 1210.90 examples/s]
221Map: 69%|██████▉ | 117000/169645 [01:34<00:42, 1226.14 examples/s]
222Map: 70%|██████▉ | 118000/169645 [01:34<00:41, 1240.46 examples/s]
223Map: 70%|███████ | 119000/169645 [01:35<00:39, 1268.46 examples/s]
224Map: 71%|███████ | 120000/169645 [01:36<00:39, 1250.77 examples/s]
225Map: 71%|███████▏ | 121000/169645 [01:37<00:39, 1241.34 examples/s]
226Map: 72%|███████▏ | 122000/169645 [01:38<00:37, 1263.06 examples/s]
227Map: 73%|███████▎ | 123000/169645 [01:39<00:38, 1208.68 examples/s]
228Map: 73%|███████▎ | 124000/169645 [01:39<00:37, 1218.68 examples/s]
229Map: 74%|███████▎ | 125000/169645 [01:40<00:36, 1231.99 examples/s]
230Map: 74%|███████▍ | 126000/169645 [01:41<00:35, 1235.24 examples/s]
231Map: 75%|███████▍ | 127000/169645 [01:42<00:34, 1227.59 examples/s]
232Map: 75%|███████▌ | 128000/169645 [01:43<00:33, 1243.08 examples/s]
233Map: 76%|███████▌ | 129000/169645 [01:43<00:32, 1233.23 examples/s]
234Map: 77%|███████▋ | 130000/169645 [01:44<00:32, 1218.79 examples/s]
235Map: 77%|███████▋ | 131000/169645 [01:45<00:31, 1230.11 examples/s]
236Map: 78%|███████▊ | 132000/169645 [01:46<00:29, 1263.88 examples/s]
237Map: 78%|███████▊ | 133000/169645 [01:47<00:29, 1258.67 examples/s]
238Map: 79%|███████▉ | 134000/169645 [01:47<00:28, 1260.31 examples/s]
239Map: 80%|███████▉ | 135000/169645 [01:48<00:27, 1242.55 examples/s]
240Map: 80%|████████ | 136000/169645 [01:49<00:27, 1235.13 examples/s]
241Map: 81%|████████ | 137000/169645 [01:50<00:27, 1190.21 examples/s]
242Map: 81%|████████▏ | 138000/169645 [01:51<00:26, 1217.04 examples/s]
243Map: 82%|████████▏ | 139000/169645 [01:52<00:25, 1212.46 examples/s]
244Map: 83%|████████▎ | 140000/169645 [01:52<00:23, 1235.38 examples/s]
245Map: 83%|████████▎ | 141000/169645 [01:53<00:23, 1217.09 examples/s]
246Map: 84%|████████▎ | 142000/169645 [01:54<00:22, 1233.74 examples/s]
247Map: 84%|████████▍ | 143000/169645 [01:55<00:21, 1237.94 examples/s]
248Map: 85%|████████▍ | 144000/169645 [01:56<00:20, 1241.20 examples/s]
249Map: 85%|████████▌ | 145000/169645 [01:56<00:19, 1260.91 examples/s]
250Map: 86%|████████▌ | 146000/169645 [01:57<00:18, 1262.86 examples/s]
251Map: 87%|████████▋ | 147000/169645 [01:58<00:17, 1258.55 examples/s]
252Map: 87%|████████▋ | 148000/169645 [01:59<00:17, 1251.42 examples/s]
253Map: 88%|████████▊ | 149000/169645 [01:59<00:16, 1275.44 examples/s]
254Map: 88%|████████▊ | 150000/169645 [02:00<00:15, 1286.95 examples/s]
255Map: 89%|████████▉ | 151000/169645 [02:01<00:14, 1278.79 examples/s]
256Map: 90%|████████▉ | 152000/169645 [02:02<00:13, 1289.33 examples/s]
257Map: 90%|█████████ | 153000/169645 [02:03<00:13, 1272.16 examples/s]
258Map: 91%|█████████ | 154000/169645 [02:04<00:12, 1207.48 examples/s]
259Map: 91%|█████████▏| 155000/169645 [02:04<00:11, 1222.62 examples/s]
260Map: 92%|█████████▏| 156000/169645 [02:05<00:11, 1235.19 examples/s]
261Map: 93%|█████████▎| 157000/169645 [02:06<00:10, 1242.47 examples/s]
262Map: 93%|█████████▎| 158000/169645 [02:07<00:09, 1255.36 examples/s]
263Map: 94%|█████████▎| 159000/169645 [02:07<00:08, 1257.31 examples/s]
264Map: 94%|█████████▍| 160000/169645 [02:08<00:07, 1268.96 examples/s]
265Map: 95%|█████████▍| 161000/169645 [02:09<00:07, 1222.56 examples/s]
266Map: 95%|█████████▌| 162000/169645 [02:10<00:06, 1250.96 examples/s]
267Map: 96%|█████████▌| 163000/169645 [02:11<00:05, 1246.01 examples/s]
268Map: 97%|█████████▋| 164000/169645 [02:11<00:04, 1250.22 examples/s]
269Map: 97%|█████████▋| 165000/169645 [02:12<00:03, 1235.72 examples/s]
270Map: 98%|█████████▊| 166000/169645 [02:13<00:02, 1240.55 examples/s]
271Map: 98%|█████████▊| 167000/169645 [02:14<00:02, 1223.81 examples/s]
272Map: 99%|█████████▉| 168000/169645 [02:15<00:01, 1245.17 examples/s]
273Map: 100%|█████████▉| 169000/169645 [02:16<00:00, 1189.64 examples/s]
274Map: 100%|██████████| 169645/169645 [02:16<00:00, 1172.97 examples/s]
275Map: 100%|██████████| 169645/169645 [02:17<00:00, 1237.48 examples/s]
276[2025-10-14, 09:54:53 UTC] {pod_manager.py:536} INFO - [base]
277Map: 0%| | 0/21174 [00:00<?, ? examples/s]
278Map: 5%|▍ | 1000/21174 [00:00<00:14, 1368.84 examples/s]
279Map: 9%|▉ | 2000/21174 [00:01<00:14, 1310.66 examples/s]
280Map: 14%|█▍ | 3000/21174 [00:02<00:14, 1277.57 examples/s]
281Map: 19%|█▉ | 4000/21174 [00:03<00:13, 1280.76 examples/s]
282Map: 24%|██▎ | 5000/21174 [00:03<00:12, 1295.41 examples/s]
283Map: 28%|██▊ | 6000/21174 [00:04<00:11, 1305.85 examples/s]
284Map: 33%|███▎ | 7000/21174 [00:05<00:10, 1300.82 examples/s]
285Map: 38%|███▊ | 8000/21174 [00:06<00:10, 1273.82 examples/s]
286Map: 43%|████▎ | 9000/21174 [00:06<00:09, 1272.23 examples/s]
287Map: 47%|████▋ | 10000/21174 [00:07<00:08, 1277.08 examples/s]
288Map: 52%|█████▏ | 11000/21174 [00:08<00:07, 1273.96 examples/s]
289Map: 57%|█████▋ | 12000/21174 [00:09<00:07, 1305.35 examples/s]
290Map: 61%|██████▏ | 13000/21174 [00:10<00:06, 1265.38 examples/s]
291Map: 66%|██████▌ | 14000/21174 [00:10<00:05, 1246.74 examples/s]
292Map: 71%|███████ | 15000/21174 [00:11<00:05, 1233.06 examples/s]
293Map: 76%|███████▌ | 16000/21174 [00:12<00:04, 1198.81 examples/s]
294Map: 80%|████████ | 17000/21174 [00:13<00:03, 1199.40 examples/s]
295Map: 85%|████████▌ | 18000/21174 [00:14<00:02, 1204.49 examples/s]
296Map: 90%|████████▉ | 19000/21174 [00:15<00:01, 1225.83 examples/s]
297Map: 94%|█████████▍| 20000/21174 [00:15<00:00, 1231.65 examples/s]
298Map: 99%|█████████▉| 21000/21174 [00:16<00:00, 1251.47 examples/s]
299Map: 100%|██████████| 21174/21174 [00:16<00:00, 1255.31 examples/s]
300Map: 100%|██████████| 21174/21174 [00:16<00:00, 1252.57 examples/s]
301[2025-10-14, 09:54:53 UTC] {pod_manager.py:536} INFO - [base] INFO:root:model is loaded
302[2025-10-14, 09:55:16 UTC] {pod_manager.py:536} INFO - [base] INFO:root:Start training
303[2025-10-14, 09:55:17 UTC] {pod_manager.py:555} INFO - [base]
304 0%| | 0/53015 [00:00<?, ?it/s]bash: line 9: 9 Killed python3 training/tone_check/retrain/retrain.py --train-data-path "/mnt/model-training/tone_check/20251014T060736/train_data.parquet/*.parquet" --validation-data-path "/mnt/model-training/tone_check/20251014T060736/validation_data.parquet/*.parquet" --output-model-path "/mnt/model-training/tone_check/20251014T060736/output_model" --base-model-path "/mnt/model-training/training/tone_check/base_model"
305[2025-10-14, 09:55:17 UTC] {pod_manager.py:714} INFO - Pod train-tone-check-xv60x59 has phase Running
306[2025-10-14, 09:55:19 UTC] {pod_manager.py:714} INFO - Pod train-tone-check-xv60x59 has phase Running
307[2025-10-14, 09:55:21 UTC] {pod.py:1122} INFO - Deleting pod: train-tone-check-xv60x59
308[2025-10-14, 09:55:21 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
309Traceback (most recent call last):
310 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
311 result = _execute_callable(context=context, **execute_callable_kwargs)
312 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
313 return ExecutionCallableRunner(
314 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
315 return self.func(*args, **kwargs)
316 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
317 return func(self, *args, **kwargs)
318 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
319 return self.execute_sync(context)
320 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
321 self.cleanup(
322 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
323 raise AirflowException(
324airflow.exceptions.AirflowException: Pod train-tone-check-xv60x59 returned a failure.
325remote_pod: {'api_version': 'v1',
326 'kind': 'Pod',
327 'metadata': {'annotations': {'cni.projectcalico.org/containerID': 'e9dfd10428317ca7fe7a76778b5a48135364b4576cf402d5c43054bde7fbbc05',
328 'cni.projectcalico.org/podIP': '',
329 'cni.projectcalico.org/podIPs': '',
330 'container.seccomp.security.alpha.kubernetes.io/base': 'runtime/default'},
331 'creation_timestamp': datetime.datetime(2025, 10, 14, 9, 51, 9, tzinfo=tzlocal()),
332 'deletion_grace_period_seconds': None,
333 'deletion_timestamp': None,
334 'finalizers': None,
335 'generate_name': None,
336 'generation': None,
337 'labels': {'airflow_kpo_in_cluster': 'True',
338 'airflow_version': '2.10.5',
339 'app': 'airflow',
340 'component': 'task-pod',
341 'dag_id': 'tone_check_training_dag',
342 'kubernetes_pod_operator': 'True',
343 'release': 'dev-kevinbazira',
344 'routed_via': 'dev-kevinbazira',
345 'run_id': 'manual__2025-10-14T060736.9821150000-fbf3b9f8e',
346 'task_id': 'train_tone_check',
347 'try_number': '1'},
348 'managed_fields': [{'api_version': 'v1',
349 'fields_type': 'FieldsV1',
350 'fields_v1': {'f:metadata': {'f:labels': {'.': {},
351 'f:airflow_kpo_in_cluster': {},
352 'f:airflow_version': {},
353 'f:app': {},
354 'f:component': {},
355 'f:dag_id': {},
356 'f:kubernetes_pod_operator': {},
357 'f:release': {},
358 'f:routed_via': {},
359 'f:run_id': {},
360 'f:task_id': {},
361 'f:try_number': {}}},
362 'f:spec': {'f:affinity': {'.': {},
363 'f:nodeAffinity': {'.': {},
364 'f:requiredDuringSchedulingIgnoredDuringExecution': {}}},
365 'f:containers': {'k:{"name":"base"}': {'.': {},
366 'f:args': {},
367 'f:command': {},
368 'f:env': {'.': {},
369 'k:{"name":"AWS_REQUEST_CHECKSUM_CALCULATION"}': {'.': {},
370 'f:name': {},
371 'f:value': {}},
372 'k:{"name":"AWS_RESPONSE_CHECKSUM_VALIDATION"}': {'.': {},
373 'f:name': {},
374 'f:value': {}},
375 'k:{"name":"REQUESTS_CA_BUNDLE"}': {'.': {},
376 'f:name': {},
377 'f:value': {}}},
378 'f:image': {},
379 'f:imagePullPolicy': {},
380 'f:name': {},
381 'f:resources': {'.': {},
382 'f:limits': {'.': {},
383 'f:cpu': {},
384 'f:memory': {}},
385 'f:requests': {'.': {},
386 'f:cpu': {},
387 'f:memory': {}}},
388 'f:securityContext': {'.': {},
389 'f:allowPrivilegeEscalation': {},
390 'f:capabilities': {'.': {},
391 'f:drop': {}},
392 'f:runAsNonRoot': {},
393 'f:seccompProfile': {'.': {},
394 'f:type': {}}},
395 'f:terminationMessagePath': {},
396 'f:terminationMessagePolicy': {},
397 'f:volumeMounts': {'.': {},
398 'k:{"mountPath":"/mnt/model-training"}': {'.': {},
399 'f:mountPath': {},
400 'f:name': {}}}}},
401 'f:dnsPolicy': {},
402 'f:enableServiceLinks': {},
403 'f:priorityClassName': {},
404 'f:restartPolicy': {},
405 'f:schedulerName': {},
406 'f:securityContext': {'.': {},
407 'f:fsGroup': {}},
408 'f:terminationGracePeriodSeconds': {},
409 'f:volumes': {'.': {},
410 'k:{"name":"airflow-ml-model-training-volume"}': {'.': {},
411 'f:name': {},
412 'f:persistentVolumeClaim': {'.': {},
413 'f:claimName': {}}}}}},
414 'manager': 'OpenAPI-Generator',
415 'operation': 'Update',
416 'subresource': None,
417 'time': datetime.datetime(2025, 10, 14, 9, 51, 9, tzinfo=tzlocal())},
418 {'api_version': 'v1',
419 'fields_type': 'FieldsV1',
420 'fields_v1': {'f:metadata': {'f:annotations': {'f:cni.projectcalico.org/containerID': {},
421 'f:cni.projectcalico.org/podIP': {},
422 'f:cni.projectcalico.org/podIPs': {}}}},
423 'manager': 'Go-http-client',
424 'operation': 'Update',
425 'subresource': 'status',
426 'time': datetime.datetime(2025, 10, 14, 9, 51, 18, tzinfo=tzlocal())},
427 {'api_version': 'v1',
428 'fields_type': 'FieldsV1',
429 'fields_v1': {'f:status': {'f:conditions': {'k:{"type":"ContainersReady"}': {'.': {},
430 'f:lastProbeTime': {},
431 'f:lastTransitionTime': {},
432 'f:reason': {},
433 'f:status': {},
434 'f:type': {}},
435 'k:{"type":"Initialized"}': {'.': {},
436 'f:lastProbeTime': {},
437 'f:lastTransitionTime': {},
438 'f:status': {},
439 'f:type': {}},
440 'k:{"type":"Ready"}': {'.': {},
441 'f:lastProbeTime': {},
442 'f:lastTransitionTime': {},
443 'f:reason': {},
444 'f:status': {},
445 'f:type': {}}},
446 'f:containerStatuses': {},
447 'f:hostIP': {},
448 'f:phase': {},
449 'f:podIP': {},
450 'f:podIPs': {'.': {},
451 'k:{"ip":"10.67.27.174"}': {'.': {},
452 'f:ip': {}},
453 'k:{"ip":"2620:0:861:302:57bd:c9a8:2745:49ac"}': {'.': {},
454 'f:ip': {}}},
455 'f:startTime': {}}},
456 'manager': 'kubelet',
457 'operation': 'Update',
458 'subresource': 'status',
459 'time': datetime.datetime(2025, 10, 14, 9, 55, 17, tzinfo=tzlocal())}],
460 'name': 'train-tone-check-xv60x59',
461 'namespace': 'airflow-dev',
462 'owner_references': None,
463 'resource_version': '829662183',
464 'self_link': None,
465 'uid': 'bf72c0f3-0455-42b5-a088-8174d34397ca'},
466 'spec': {'active_deadline_seconds': None,
467 'affinity': {'node_affinity': {'preferred_during_scheduling_ignored_during_execution': None,
468 'required_during_scheduling_ignored_during_execution': {'node_selector_terms': [{'match_expressions': [{'key': 'kubernetes.io/hostname',
469 'operator': 'NotIn',
470 'values': ['dse-k8s-worker1001.eqiad.wmnet']}],
471 'match_fields': None}]}},
472 'pod_affinity': None,
473 'pod_anti_affinity': None},
474 'automount_service_account_token': None,
475 'containers': [{'args': ['\n'
476 'set -e\n'
477 'set -x\n'
478 'echo "Ensuring output model directory '
479 'exists"\n'
480 'mkdir -p '
481 '/mnt/model-training/tone_check/20251014T060736/output_model\n'
482 'echo "Verifying input data exists on PVC"\n'
483 'ls -lR '
484 '/mnt/model-training/tone_check/20251014T060736\n'
485 'echo "Starting model training"\n'
486 'python3 '
487 'training/tone_check/retrain/retrain.py '
488 '--train-data-path '
489 '"/mnt/model-training/tone_check/20251014T060736/train_data.parquet/*.parquet" '
490 '--validation-data-path '
491 '"/mnt/model-training/tone_check/20251014T060736/validation_data.parquet/*.parquet" '
492 '--output-model-path '
493 '"/mnt/model-training/tone_check/20251014T060736/output_model" '
494 '--base-model-path '
495 '"/mnt/model-training/training/tone_check/base_model"\n'
496 'echo "Verifying model output"\n'
497 'ls -l '
498 '/mnt/model-training/tone_check/20251014T060736/output_model'],
499 'command': ['bash', '-c'],
500 'env': [{'name': 'REQUESTS_CA_BUNDLE',
501 'value': '/etc/ssl/certs/ca-certificates.crt',
502 'value_from': None},
503 {'name': 'AWS_REQUEST_CHECKSUM_CALCULATION',
504 'value': 'WHEN_REQUIRED',
505 'value_from': None},
506 {'name': 'AWS_RESPONSE_CHECKSUM_VALIDATION',
507 'value': 'WHEN_REQUIRED',
508 'value_from': None}],
509 'env_from': None,
510 'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-633556',
511 'image_pull_policy': 'IfNotPresent',
512 'lifecycle': None,
513 'liveness_probe': None,
514 'name': 'base',
515 'ports': None,
516 'readiness_probe': None,
517 'resize_policy': None,
518 'resources': {'claims': None,
519 'limits': {'cpu': '4', 'memory': '8Gi'},
520 'requests': {'cpu': '2',
521 'memory': '4Gi'}},
522 'restart_policy': None,
523 'security_context': {'allow_privilege_escalation': False,
524 'app_armor_profile': None,
525 'capabilities': {'add': None,
526 'drop': ['ALL']},
527 'privileged': None,
528 'proc_mount': None,
529 'read_only_root_filesystem': None,
530 'run_as_group': None,
531 'run_as_non_root': True,
532 'run_as_user': None,
533 'se_linux_options': None,
534 'seccomp_profile': {'localhost_profile': None,
535 'type': 'RuntimeDefault'},
536 'windows_options': None},
537 'startup_probe': None,
538 'stdin': None,
539 'stdin_once': None,
540 'termination_message_path': '/dev/termination-log',
541 'termination_message_policy': 'File',
542 'tty': None,
543 'volume_devices': None,
544 'volume_mounts': [{'mount_path': '/mnt/model-training',
545 'mount_propagation': None,
546 'name': 'airflow-ml-model-training-volume',
547 'read_only': None,
548 'recursive_read_only': None,
549 'sub_path': None,
550 'sub_path_expr': None},
551 {'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount',
552 'mount_propagation': None,
553 'name': 'kube-api-access-hs5h6',
554 'read_only': True,
555 'recursive_read_only': None,
556 'sub_path': None,
557 'sub_path_expr': None}],
558 'working_dir': None}],
559 'dns_config': None,
560 'dns_policy': 'ClusterFirst',
561 'enable_service_links': True,
562 'ephemeral_containers': None,
563 'host_aliases': None,
564 'host_ipc': None,
565 'host_network': None,
566 'host_pid': None,
567 'host_users': None,
568 'hostname': None,
569 'image_pull_secrets': None,
570 'init_containers': None,
571 'node_name': 'dse-k8s-worker1009.eqiad.wmnet',
572 'node_selector': None,
573 'os': None,
574 'overhead': None,
575 'preemption_policy': 'PreemptLowerPriority',
576 'priority': -100,
577 'priority_class_name': 'low-priority-pod',
578 'readiness_gates': None,
579 'resource_claims': None,
580 'resources': None,
581 'restart_policy': 'Never',
582 'runtime_class_name': None,
583 'scheduler_name': 'default-scheduler',
584 'scheduling_gates': None,
585 'security_context': {'app_armor_profile': None,
586 'fs_group': 900,
587 'fs_group_change_policy': None,
588 'run_as_group': None,
589 'run_as_non_root': None,
590 'run_as_user': None,
591 'se_linux_change_policy': None,
592 'se_linux_options': None,
593 'seccomp_profile': None,
594 'supplemental_groups': None,
595 'supplemental_groups_policy': None,
596 'sysctls': None,
597 'windows_options': None},
598 'service_account': 'default',
599 'service_account_name': 'default',
600 'set_hostname_as_fqdn': None,
601 'share_process_namespace': None,
602 'subdomain': None,
603 'termination_grace_period_seconds': 30,
604 'tolerations': [{'effect': 'NoExecute',
605 'key': 'node.kubernetes.io/not-ready',
606 'operator': 'Exists',
607 'toleration_seconds': 300,
608 'value': None},
609 {'effect': 'NoExecute',
610 'key': 'node.kubernetes.io/unreachable',
611 'operator': 'Exists',
612 'toleration_seconds': 300,
613 'value': None}],
614 'topology_spread_constraints': None,
615 'volumes': [{'aws_elastic_block_store': None,
616 'azure_disk': None,
617 'azure_file': None,
618 'cephfs': None,
619 'cinder': None,
620 'config_map': None,
621 'csi': None,
622 'downward_api': None,
623 'empty_dir': None,
624 'ephemeral': None,
625 'fc': None,
626 'flex_volume': None,
627 'flocker': None,
628 'gce_persistent_disk': None,
629 'git_repo': None,
630 'glusterfs': None,
631 'host_path': None,
632 'image': None,
633 'iscsi': None,
634 'name': 'airflow-ml-model-training-volume',
635 'nfs': None,
636 'persistent_volume_claim': {'claim_name': 'airflow-ml-model-training',
637 'read_only': None},
638 'photon_persistent_disk': None,
639 'portworx_volume': None,
640 'projected': None,
641 'quobyte': None,
642 'rbd': None,
643 'scale_io': None,
644 'secret': None,
645 'storageos': None,
646 'vsphere_volume': None},
647 {'aws_elastic_block_store': None,
648 'azure_disk': None,
649 'azure_file': None,
650 'cephfs': None,
651 'cinder': None,
652 'config_map': None,
653 'csi': None,
654 'downward_api': None,
655 'empty_dir': None,
656 'ephemeral': None,
657 'fc': None,
658 'flex_volume': None,
659 'flocker': None,
660 'gce_persistent_disk': None,
661 'git_repo': None,
662 'glusterfs': None,
663 'host_path': None,
664 'image': None,
665 'iscsi': None,
666 'name': 'kube-api-access-hs5h6',
667 'nfs': None,
668 'persistent_volume_claim': None,
669 'photon_persistent_disk': None,
670 'portworx_volume': None,
671 'projected': {'default_mode': 420,
672 'sources': [{'cluster_trust_bundle': None,
673 'config_map': None,
674 'downward_api': None,
675 'secret': None,
676 'service_account_token': {'audience': None,
677 'expiration_seconds': 3607,
678 'path': 'token'}},
679 {'cluster_trust_bundle': None,
680 'config_map': {'items': [{'key': 'ca.crt',
681 'mode': None,
682 'path': 'ca.crt'}],
683 'name': 'kube-root-ca.crt',
684 'optional': None},
685 'downward_api': None,
686 'secret': None,
687 'service_account_token': None},
688 {'cluster_trust_bundle': None,
689 'config_map': None,
690 'downward_api': {'items': [{'field_ref': {'api_version': 'v1',
691 'field_path': 'metadata.namespace'},
692 'mode': None,
693 'path': 'namespace',
694 'resource_field_ref': None}]},
695 'secret': None,
696 'service_account_token': None}]},
697 'quobyte': None,
698 'rbd': None,
699 'scale_io': None,
700 'secret': None,
701 'storageos': None,
702 'vsphere_volume': None}]},
703 'status': {'conditions': [{'last_probe_time': None,
704 'last_transition_time': datetime.datetime(2025, 10, 14, 9, 51, 9, tzinfo=tzlocal()),
705 'message': None,
706 'reason': None,
707 'status': 'True',
708 'type': 'Initialized'},
709 {'last_probe_time': None,
710 'last_transition_time': datetime.datetime(2025, 10, 14, 9, 55, 17, tzinfo=tzlocal()),
711 'message': None,
712 'reason': 'PodFailed',
713 'status': 'False',
714 'type': 'Ready'},
715 {'last_probe_time': None,
716 'last_transition_time': datetime.datetime(2025, 10, 14, 9, 55, 17, tzinfo=tzlocal()),
717 'message': None,
718 'reason': 'PodFailed',
719 'status': 'False',
720 'type': 'ContainersReady'},
721 {'last_probe_time': None,
722 'last_transition_time': datetime.datetime(2025, 10, 14, 9, 51, 9, tzinfo=tzlocal()),
723 'message': None,
724 'reason': None,
725 'status': 'True',
726 'type': 'PodScheduled'}],
727 'container_statuses': [{'allocated_resources': None,
728 'allocated_resources_status': None,
729 'container_id': 'containerd://8ea8e64f754ba8f01555fb4356718a5eff1a8d60890a675bb9161764e2575672',
730 'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-633556',
731 'image_id': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines@sha256:5038bf766451c291a541c7f35ccf9fe9924f44a861589fdb70b6717791532637',
732 'last_state': {'running': None,
733 'terminated': None,
734 'waiting': None},
735 'name': 'base',
736 'ready': False,
737 'resources': None,
738 'restart_count': 0,
739 'started': False,
740 'state': {'running': None,
741 'terminated': {'container_id': 'containerd://8ea8e64f754ba8f01555fb4356718a5eff1a8d60890a675bb9161764e2575672',
742 'exit_code': 137,
743 'finished_at': datetime.datetime(2025, 10, 14, 9, 55, 16, tzinfo=tzlocal()),
744 'message': None,
745 'reason': 'OOMKilled',
746 'signal': None,
747 'started_at': datetime.datetime(2025, 10, 14, 9, 51, 18, tzinfo=tzlocal())},
748 'waiting': None},
749 'user': None,
750 'volume_mounts': None}],
751 'ephemeral_container_statuses': None,
752 'host_i_ps': None,
753 'host_ip': '10.64.0.149',
754 'init_container_statuses': None,
755 'message': None,
756 'nominated_node_name': None,
757 'phase': 'Failed',
758 'pod_i_ps': [{'ip': '10.67.27.174'},
759 {'ip': '2620:0:861:302:57bd:c9a8:2745:49ac'}],
760 'pod_ip': '10.67.27.174',
761 'qos_class': 'Burstable',
762 'reason': None,
763 'resize': None,
764 'resource_claim_statuses': None,
765 'start_time': datetime.datetime(2025, 10, 14, 9, 51, 9, tzinfo=tzlocal())}}
766[2025-10-14, 09:55:21 UTC] {taskinstance.py:1226} INFO - Marking task as UP_FOR_RETRY. dag_id=tone_check_training_dag, task_id=train_tone_check, run_id=manual__2025-10-14T06:07:36.982115+00:00, execution_date=20251014T060736, start_date=20251014T095109, end_date=20251014T095521
767[2025-10-14, 09:55:21 UTC] {taskinstance.py:341} ▶ Post task execution logs

Oct 14 2025, 12:29 PM · Essential-Work, Machine-Learning-Team
kevinbazira created P83875 train_tone_check task failed because of OOM issue in airflow-devenv.
Oct 14 2025, 12:25 PM · Machine-Learning-Team
kevinbazira moved T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration from Unsorted to In Progress on the Machine-Learning-Team board.
Oct 14 2025, 12:09 PM · Essential-Work, Machine-Learning-Team
kevinbazira created T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration.
Oct 14 2025, 12:08 PM · Essential-Work, Machine-Learning-Team

Oct 10 2025

kevinbazira created T406958: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv.
Oct 10 2025, 5:01 AM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Machine-Learning-Team

Oct 9 2025

kevinbazira added a comment to P83720 Kube-Config error after activating triggerer process in airflow-ml.

I started the triggerer process using the commands below:

$ kube_env airflow-ml-deploy dse-k8s-eqiad
$ kubectl get pods
NAME                                               READY   STATUS      RESTARTS   AGE
airflow-envoy-6787cbd6df-pbw2k                     1/1     Running     0          2d1h
airflow-gitsync-6d644db84-5rwr6                    1/1     Running     0          2d1h
airflow-hadoop-shell-dd6db6fcf-xdfbc               1/1     Running     0          2d1h
airflow-kerberos-8b5978dd-2644c                    1/1     Running     0          2d1h
airflow-scheduler-76cc96d9d7-fbqj6                 1/1     Running     0          2d1h
airflow-statsd-79449b6c49-q2nvn                    1/1     Running     0          2d1h
airflow-task-shell-7957965bfb-zkxfj                1/1     Running     0          2d1h
airflow-webserver-5fb5d89d44-nbhl9                 2/2     Running     0          2d1h
postgresql-airflow-ml-1                            1/1     Running     0          3d4h
postgresql-airflow-ml-2                            1/1     Running     0          2d21h
postgresql-airflow-ml-pooler-rw-7d5d74cb69-7wj6r   1/1     Running     0          2d23h
postgresql-airflow-ml-pooler-rw-7d5d74cb69-chpq2   1/1     Running     0          2d23h
postgresql-airflow-ml-pooler-rw-7d5d74cb69-r8td9   1/1     Running     0          3d1h
retrain-tone-check-ba4r32n                         0/1     Completed   0          28h
Oct 9 2025, 3:13 PM · Machine-Learning-Team
kevinbazira created P83720 Kube-Config error after activating triggerer process in airflow-ml.
Oct 9 2025, 3:12 PM · Machine-Learning-Team
kevinbazira added a comment to T406302: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator.

After enabling deferrable execution of the training operator to handle GPU resource contention, the following warning appears in both the airflow-devenv and airflow-ml web UI:

The triggerer does not appear to be running.
Oct 9 2025, 6:06 AM · Essential-Work, Machine-Learning-Team

Oct 8 2025

kevinbazira added a comment to T406302: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator.

Now the tone_check_retrain_dag is failing with the error below:

Oct 8 2025, 12:39 AM · Essential-Work, Machine-Learning-Team