User Details
- User Since
- Aug 3 2019, 6:58 AM (332 w, 2 d)
- Availability
- Available
- IRC Nick
- kevinbazira
- LDAP User
- Kevin Bazira
- MediaWiki User
- KBazira (WMF) [ Global Accounts ]
Yesterday
Yes, I saw it here when I was testing the Qwen3-Embedding-4B example that shows how to use HF transformers for embeddings inference.
The prototype above can be run on ml-lab using the steps below:
Thu, Dec 11
I resolved the issue in P72019#288792 by installing typing_extensions==4.15.0 in vllm_from_bdist_wheel_venv and adjusting the PYTHONPATH so that this venv's site-packages (with typing_extensions 4.15.0) appear before the ROCm PyTorch venv (with typing_extensions 4.9.0) in the search path:
export PYTHONPATH=/home/kevinbazira/test_aya/build_vllm/test_wheel/vllm_from_bdist_wheel_venv/lib/python3.11/site-packages:/srv/pytorch-rocm/venv/lib/python3.11/site-packages/
Wed, Dec 3
Tue, Dec 2
@prabhat, has the WME team had a chance to run scale and latency tests on the revertrisk-wikidata inference service? Does this service meet your performance requirements?
Mon, Dec 1
In T410906#11415517, we successfully tested the llm model-server on LiftWing with MI300X GPU.
Fri, Nov 28
Finally, as shown below, the llm model-server using MI300X GPU in LiftWing production is able to serve the aya-expanse-8B model:
$ kubectl get pods NAME READY STATUS RESTARTS AGE aya-llm-predictor-00015-deployment-65b4577748-6wh2c 3/3 Running 0 2m11s
Thu, Nov 27
In T410906#11409323, we found that the flashattention2 wheel built above didn't support MI300X GPUs. In P85813, we built a wheel from source that supports both gfx90a and gfx942 ROCm targets.
In P85813, we built a flash-attention2 wheel that supports both gfx90a and gfx942 ROCm targets. Now the llm model-server inference nolonger throws the error from T410906#11409323, but returns only <PAD> tokens:
$ kubectl get pods NAME READY STATUS RESTARTS AGE aya-llm-predictor-00014-deployment-65ccd57d6d-pf79b 3/3 Running 0 2m23s
Wed, Nov 26
In T410906#11408603, we found that the bitsandbytes wheel built above didn't support MI300X GPUs. In P85707, we built a wheel from source that supports both gfx90a and gfx942 ROCm targets.
In P85707, we built a bitsandbytes wheel that supports both gfx90a and gfx942 ROCm targets. Now the llm model-server starts well without an error, but runs into a familiar error: invalid device function caused by flash-attention2 on inference:
$ kubectl get pods NAME READY STATUS RESTARTS AGE aya-llm-predictor-00013-deployment-cb8c6b54f-52d4t 3/3 Running 0 95s langid-predictor-default-00014-deployment-5894db899b-hhn2r 3/3 Running 0 11d
The llm model-server is no longer throwing the OOM issue after using BITSANDBYTES_DTYPE="int4" and packages built from source, as we did in P85433#343273. However, although the package built runs on the ML-Lab MI200 GPU (gfx90a) as shown in T410906#11405817, it's not compatible with MI300X GPUs (gfx942):
+ source common_settings.sh +++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())' ++ CPU_COUNT=6 ++ echo 'CPU count detected from get_cpu_count: 6' ++ export OMP_NUM_THREADS=6 CPU count detected from get_cpu_count: 6 OMP_NUM_THREADS set to: 6 ++ OMP_NUM_THREADS=6 ++ echo 'OMP_NUM_THREADS set to: 6' + MODEL_SERVER_PATH=src/models/llm/model.py + exec /usr/bin/python3 src/models/llm/model.py `torch_dtype` is deprecated! Use `dtype` instead! g++ (Debian 12.2.0-14+deb12u1) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Tue, Nov 25
Since we would like to use less VRAM which is causing the OOM issue in T410906#11404979, I am going to revert back to using BITSANDBYTES_DTYPE="int4" which had caused the error in T410906#11404187 but we have a solution in P85433#343273. Below I have tested the solution on ML-Lab and am going to try a similar solution in LiftWing.
$ MODEL_NAME=aya-expanse-8B LLM_CLASS=llm.Aya MODEL_PATH="/home/kevinbazira/.cache/huggingface/hub/models--CohereForAI--aya-expanse-8b/snapshots/554c52e22d0f713bab9d3e360734d25cd15dda16/" BITSANDBYTES_DTYPE="int4" DEVICE=auto ATTN_IMPLEMENTATION="flash_attention_2" DTYPE="float16" python3 src/models/llm/model.py `torch_dtype` is deprecated! Use `dtype` instead! g++ (Debian 12.2.0-14+deb12u1) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The revertrisk-wikidata inference service production endpoint uses similar scaling configs that other revertrisk inference-services use: https://github.com/wikimedia/operations-deployment-charts/blob/8412fc655d3b1e10b38cf0c954d910b820e93a05/helmfile.d/ml-services/revertrisk/values.yaml#L145-L150
Looks like the torch version 2.5.1+rocm6.1 that the llm model-server image currently uses doesn't support expandable_segments:
kevinbazira@deploy2002:~$ kubectl logs aya-llm-predictor-00009-deployment-54ccf6ddc6-b5r9w
+ source common_settings.sh
+++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
CPU count detected from get_cpu_count: 6
OMP_NUM_THREADS set to: 6
++ CPU_COUNT=6
++ echo 'CPU count detected from get_cpu_count: 6'
++ export OMP_NUM_THREADS=6
++ OMP_NUM_THREADS=6
++ echo 'OMP_NUM_THREADS set to: 6'
+ MODEL_SERVER_PATH=src/models/llm/model.py
+ exec /usr/bin/python3 src/models/llm/model.py
/opt/lib/venv/lib/python3.11/site-packages/accelerate/utils/modeling.py:841: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
_ = torch.tensor([0], device=i)
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
File "/srv/app/src/models/llm/model.py", line 145, in <module>
model = llm_class(model_name)
^^^^^^^^^^^^^^^^^^^^^
File "/srv/app/src/models/llm/aya/aya.py", line 12, in __init__
super().__init__(model_name)
File "/srv/app/src/models/llm/model.py", line 32, in __init__
self.model, self.tokenizer = self.load()
^^^^^^^^^^^
File "/srv/app/src/models/llm/aya/aya.py", line 21, in load
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5799, in caching_allocator_warmup
_ = torch.empty(byte_count // factor, dtype=torch.float16, device=device, requires_grad=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.91 GiB. GPU 0 has a total capacity of 24.00 GiB of which 23.38 GiB is free. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]The above error was fixed by setting BITSANDBYTES_DTYPE to None. Now we are running into OOO issue shown below:
kevinbazira@deploy2002:~$ kubectl logs aya-llm-predictor-00008-deployment-9759d96d5-jf5r8
+ source common_settings.sh
+++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
++ CPU_COUNT=6
++ echo 'CPU count detected from get_cpu_count: 6'
++ export OMP_NUM_THREADS=6
++ OMP_NUM_THREADS=6
++ echo 'OMP_NUM_THREADS set to: 6'
+ MODEL_SERVER_PATH=src/models/llm/model.py
+ exec /usr/bin/python3 src/models/llm/model.py
CPU count detected from get_cpu_count: 6
OMP_NUM_THREADS set to: 6
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last):
File "/srv/app/src/models/llm/model.py", line 145, in <module>
model = llm_class(model_name)
^^^^^^^^^^^^^^^^^^^^^
File "/srv/app/src/models/llm/aya/aya.py", line 12, in __init__
super().__init__(model_name)
File "/srv/app/src/models/llm/model.py", line 32, in __init__
self.model, self.tokenizer = self.load()
^^^^^^^^^^^
File "/srv/app/src/models/llm/aya/aya.py", line 21, in load
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4400, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 5799, in caching_allocator_warmup
_ = torch.empty(byte_count // factor, dtype=torch.float16, device=device, requires_grad=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 16.91 GiB. GPU 0 has a total capacity of 24.00 GiB of which 23.38 GiB is free. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]First deployment shows the model-server in a CrashLoopBackOff:
kevinbazira@deploy2002:~$ kubectl get pods NAME READY STATUS RESTARTS AGE aya-llm-predictor-00007-deployment-84dd44b649-lh9vb 1/3 CrashLoopBackOff 4 (26s ago) 4m2s
Mon, Nov 24
As we worked on T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing, we conducted locust load tests on the revertrisk-wikidata inference service staging endpoint. These tests ran for 120 seconds with 2 users, each sending requests at intervals between 1 and 5 seconds, using sample Wikidata revision IDs got from the Research team's expert_sample.csv.
Fri, Nov 21
This error doesn't exist when I use this bitsandbytes wheel I built from source in P71986
kevinbazira@ml-lab1002:~/test_aya$ source .venv/bin/activate (.venv) kevinbazira@ml-lab1002:~/test_aya$ python3 -m bitsandbytes g++ (Debian 12.2.0-14+deb12u1) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Thu, Nov 20
The revertrisk-wikidata inference service is now live in LiftWing production. It can be accessed through:
1.External endpoint:
$ curl "https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H "Content-Type: application/json" --http1.12.Internal endpoint:
$ curl "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H "Host: revertrisk-wikidata.revertrisk.wikimedia.org" -H "Content-Type: application/json" --http1.13.Documentation:
- Model card (Work in Progress: T406179#11390839)
- WMF API Gateway documentation
- Wikitech LiftWing documentation
@Trokhymovych, here are resources to help you create a comprehensive model card for the revertrisk-wikidata model:
- Section to create model card: https://meta.wikimedia.org/wiki/Machine_learning_models#Create_a_model_card
- FAQs to answer in the model card: https://docs.google.com/document/d/1Q5aJGGBJB4LN3dXS8_-IjZYi0a3T1MIWtDXDwZeEins/edit
- Model card template: https://meta.wikimedia.org/wiki/Machine_learning_models/Model_card_template
Wed, Nov 19
Tue, Nov 18
@Trokhymovych, thank you for reviewing the revertrisk-wikidata model-server and sharing detailed feedback (that's super useful).
Mon, Nov 17
Nov 13 2025
I have run locust load tests on the revertrisk-wikidata staging isvc for 120s with 2 users, each sending requests between 1s to 5s, using sample Wikidata revision IDs that were shared in the expert_sample.csv in T406179#11333762. Results show the average response time is 568ms with 0% failure rate over 66 requests.
$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test ... MODEL=revertrisk_wikidata my_locust_venv/bin/locust --headless --csv results/revertrisk_wikidata [2025-11-13 04:53:43,557] stat1008/INFO/locust.main: Run time limit set to 120 seconds [2025-11-13 04:53:43,557] stat1008/INFO/locust.main: Starting Locust 2.31.5 [2025-11-13 04:53:43,558] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second [2025-11-13 04:53:43,559] stat1008/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 2} (2 total users) [2025-11-13 04:55:42,893] stat1008/INFO/locust.main: --run-time limit reached, shutting down Load test results are within the threshold [2025-11-13 04:55:43,001] stat1008/INFO/locust.main: Shutting down (exit code 0) Type Name # reqs # fails | Avg Min Max Med | req/s failures/s --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- POST /v1/models/revertrisk-wikidata:predict 66 0(0.00%) | 568 375 886 550 | 0.56 0.00 --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- Aggregated 66 0(0.00%) | 568 375 886 550 | 0.56 0.00
Nov 12 2025
As we prepare to run load tests, the revertrisk-wikidata isvc has been deployed in LiftWing staging:
# pod running in revision-models ns staging $ kube_env revision-models ml-staging-codfw $ kubectl get pods NAME READY STATUS RESTARTS AGE revertrisk-wikidata-predictor-00001-deployment-6fff6dbcbf-mxgmg 3/3 Running 0 77s
@Trokhymovych, following up on T406179#11353371, the revertrisk-wikidata model-server is now live in LiftWing's experimental namespace. Please test it by adjusting the rev_id in the curl command below and let us know whether it's returning correct predictions:
# ssh into WMF stat machine $ ssh stat1008.eqiad.wmnet
Nov 11 2025
I have added unit tests for critical components of the model-server to make sure future changes do not break functionality. Here is the output when I build the test image and run the tests:
$ docker buildx build --target test -f .pipeline/revertrisk_wikidata/blubber.yaml --platform=linux/amd64 . -t rrw_unit_test $ docker run --rm rrw_unit_test ... Initialized empty Git repository in /srv/revertrisk_wikidata/.git/ ci-lint: install_deps> python -I -m pip install pre-commit ci-lint: commands[0]> pre-commit run --all-files --show-diff-on-failure [INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks. [INFO] Initializing environment for https://github.com/astral-sh/ruff-pre-commit. [INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks. [INFO] Once installed this environment will be reused. [INFO] This may take a few minutes... [INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit. [INFO] Once installed this environment will be reused. [INFO] This may take a few minutes... check yaml...............................................................Passed fix end of files.........................................................Passed trim trailing whitespace.................................................Passed ruff (legacy alias)......................................................Passed ruff format..............................................................Passed ci-lint: OK ✔ in 11.8 seconds ci-unit: install_deps> python -I -m pip install -r /srv/revertrisk_wikidata/requirements-test.txt ci-unit: commands[0]> pytest test/unit ============================= test session starts ============================== platform linux -- Python 3.11.2, pytest-9.0.0, pluggy-1.6.0 cachedir: .tox/ci-unit/.pytest_cache rootdir: /srv/revertrisk_wikidata configfile: tox.ini plugins: anyio-4.11.0, asyncio-1.3.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function collected 5 items
Nov 7 2025
The revertrisk-wikidata model-server has been deployed in the LiftWing experimental namespace. It is currently available through an internal endpoint that can only be accessed by tools that run within the WMF infrastructure (e.g deploy2002, stat1008, etc):
# pod running in experimental ns $ kube_env experimental ml-staging-codfw $ kubectl get pods NAME READY STATUS RESTARTS AGE revertrisk-wikidata-predictor-default-00019-deployment-557bkfdk 3/3 Running 0 96s
The revertrisk-wikidata model-server has been containerized and integrated into the CI/CD pipeline which published it successfully to the Wikimedia docker registry:
docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-revertrisk-wikidata:2025-11-07-042629-publish
Nov 6 2025
As we prepare to publish the revertrisk-wikidata model-server image to the Wikimedia Docker registry, here is a summary of the image layers:
$ docker history b601e2d84c63 IMAGE CREATED CREATED BY SIZE COMMENT b601e2d84c63 2 minutes ago [production] 📂 [common_settings.sh] -> comm… 1.36kB buildkit.exporter.image.v0 <missing> 2 minutes ago [production] 📂 [model_server_entrypoint.sh]… 303B buildkit.exporter.image.v0 <missing> 2 minutes ago [production] 📦 {build}[/opt/lib/venv/lib/py… 1.81GB buildkit.exporter.image.v0 <missing> 4 minutes ago [production] 📂 [python] -> python/ 33.1kB buildkit.exporter.image.v0 <missing> 4 minutes ago [production] 📂 [src/models/revertrisk_wikid… 30.7kB buildkit.exporter.image.v0 <missing> 4 minutes ago mount / from exec /bin/sh -c (getent group "… 9.07kB buildkit.exporter.image.v0 <missing> 4 minutes ago mount / from exec /bin/sh -c (getent group "… 8.88kB buildkit.exporter.image.v0 <missing> 4 minutes ago mount / from exec /bin/sh -c apt-get update … 59.2MB buildkit.exporter.image.v0 <missing> 11 days ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0B <missing> 11 days ago /bin/sh -c #(nop) ENV LC_ALL=C.UTF-8 0B <missing> 11 days ago /bin/sh -c #(nop) ADD file:abeaf73dbbde23882… 74.8MB
The largest layer size is ~1.81GB which meets the Wikimedia Docker registry's 4GB compressed layer size limit.
Nov 4 2025
Thanks for the review @Trokhymovych. I have updated the model-server to remove the reliance on the static full_labels_2024-04_text_en.csv file.
@Trokhymovych thank you for sharing the full model and detailed instructions. The model has been uploaded to swift:
$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/revertrisk/wikidata/20251104121312/ DIR s3://wmf-ml-models/revertrisk/wikidata/20251104121312/data/ 2025-11-04 06:43 631M s3://wmf-ml-models/revertrisk/wikidata/20251104121312/wikidata_revertrisk_graph2text_v2.pkl
and is also publicly accessible via the analytics portal: https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/wikidata/20251104121312/
Oct 30 2025
Hi @Miriam, the plan is to deploy the full model. When the Research Team provides the full model, we shall build a model-server and host it on LiftWing.
Oct 29 2025
The revertrisk-wikidata-metadata model-server prototype has been updated in P84312 based on the discussion we have had regarding feature processing. We have fixed:
1.user_is_bot: expected as a string instead of an integer ("0" or "1")
2.event_user_groups: expected as a string instead of a float ("0.0" or "1.0")
3.user_age and page_age: should be expressed in years instead of seconds
Oct 28 2025
Thank you for sharing this information, @Trokhymovych.
Oct 24 2025
- Ran tone-check training job locally with model-ready training data to determine memory usage as there was an 8GB limit in wmf airflow that caused the job to fail. (T407212#11280133)
- Tested GPU node labels that were set up by SRE. (T373806#11275873)
- While using the GPU node selector, the tone-check model training task completed in 8hrs running in staging on MI210 GPU with 64G VRAM (P83966)
- Pushed an MR to prod that has the single DAG for the tone-check training pipeline.
- Copied tone-check base model from HDFS to PVC in prod. (MR)
- Followed up to confirm that the triggerer process enabled by DPE SRE works as expected in both airflow-ml and airflow-devenv (T406958#11288441)
- The tone-check training DAG ran and completed end-to-end in prod (T407212#11288359)
- Wrote first version of the airflow ML Pipelines docs: https://wikitech.wikimedia.org/wiki/Machine_Learning/Airflow_ML_Pipelines
Oct 23 2025
Hi @Trokhymovych, just following up on our discussion from yesterday. When you have a moment, please share the demo code that shows model inference (input/output) using the Wikidata revert-risk model and a link to the model binary. Thanks!
Fixed by DPE SRE here.
Fixed as shown here.
This OOM issue was resolved as shown here.
This was fixed as shown here.
Since using the TriggerDagRunOperator for cross-DAG orchestration requires one to always check that all DAGs are unpaused, the ML team decided to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration
Oct 22 2025
In T407212#11288359, we finally have a tone-check model training pipeline that runs end-to-end in the airflow-ml production instance.
Oct 20 2025
Super! The DAG task with a deferrable operator has now succeeded in the airflow-devenv.
@brouberol, thank you for enabling the triggerer process in Airflow. I tested it using this DAG, and the WMFKubernetesPodOperator deferred the training task, launched a GPU-enabled pod, and the pod ran and completed successfully.
With the configurations described in T407212#11280133, the tone_check_training_dag ran end-to-end in production just like it did in staging:
Oct 16 2025
Following T407212#11275074, I ran the tone-check training job locally with model-ready training data to determine memory usage. Both 8Gi and 16Gi were not enough, as the job required over 18Gi. With a 20Gi memory limit, the job ran on CPU without OOM issues, although it took a very long time to complete.
Oct 15 2025
@elukey, thank you for working on the GPU node labellers. Following our IRC conversation, I tested the node selector functionality in Airflow using the WMFKubernetesPodOperator and here are the results:
To fix the OOM issue reported in T407212#11271755, I increased container_resources.limits.memory from 8Gi to 16Gi in the train_tone_check task. However, the task running on CPU still failed with the error below:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"train-tone-check-e7hk664\" is forbidden: [maximum memory usage per Container is 8Gi, but limit is 16Gi, maximum memory usage per Pod is 10Gi, but limit is 17179869184]","reason":"Forbidden","details":{"name":"train-tone-check-e7hk664","kind":"pods"},"code":403}Full logs can be found here. This indicates that we requested 16Gi for the container, but the cluster only allows up to 8Gi per container and 10Gi per pod.
Oct 14 2025
- Our DAGs were granted permission by DPE SRE to spin up pods in the airflow-ml instance (T406302#11250084)
- Model training task fails because it requests a GPU, but all available GPUs are in use by other jobs (T406302#11252998)
- Enabled deferrable execution for the training operator to wait for GPU resources gracefully and discovered that the Airflow triggerer process was not running (T406302#11258029)
- Following SRE advice, we manually started the triggerer by execing into the scheduler pod, but the DAG task failed because of a kube-config issue (P83720)
- Requested SRE to enable and properly configure the Airflow triggerer process to support deferrable operators (T406958)
- Also had a follow-up discussion on slack in case there were other solutions to handle this issue
- A fix is a WIP
- Started merging the tone-check DAGs into a single tone_check_training_dag for simplified orchestration (T407212)
- Tested the tone_check_training_dag in airflow-devenv and the data-generation and data-copy tasks ran successfully (T407212#11271755)
- The train_tone_check task failed with an OOM issue (P83875)
I have run the tone_check_training_dag in staging, and the following tasks succeeded: generate_training_data, split_training_data, and copy_hdfs_to_pvc. The train_tone_check task running on CPU failed with an OOM issue shown in the logs below:
Oct 10 2025
Oct 9 2025
I started the triggerer process using the commands below:
$ kube_env airflow-ml-deploy dse-k8s-eqiad $ kubectl get pods NAME READY STATUS RESTARTS AGE airflow-envoy-6787cbd6df-pbw2k 1/1 Running 0 2d1h airflow-gitsync-6d644db84-5rwr6 1/1 Running 0 2d1h airflow-hadoop-shell-dd6db6fcf-xdfbc 1/1 Running 0 2d1h airflow-kerberos-8b5978dd-2644c 1/1 Running 0 2d1h airflow-scheduler-76cc96d9d7-fbqj6 1/1 Running 0 2d1h airflow-statsd-79449b6c49-q2nvn 1/1 Running 0 2d1h airflow-task-shell-7957965bfb-zkxfj 1/1 Running 0 2d1h airflow-webserver-5fb5d89d44-nbhl9 2/2 Running 0 2d1h postgresql-airflow-ml-1 1/1 Running 0 3d4h postgresql-airflow-ml-2 1/1 Running 0 2d21h postgresql-airflow-ml-pooler-rw-7d5d74cb69-7wj6r 1/1 Running 0 2d23h postgresql-airflow-ml-pooler-rw-7d5d74cb69-chpq2 1/1 Running 0 2d23h postgresql-airflow-ml-pooler-rw-7d5d74cb69-r8td9 1/1 Running 0 3d1h retrain-tone-check-ba4r32n 0/1 Completed 0 28h
After enabling deferrable execution of the training operator to handle GPU resource contention, the following warning appears in both the airflow-devenv and airflow-ml web UI:
The triggerer does not appear to be running.
Oct 8 2025
Now the tone_check_retrain_dag is failing with the error below:

