Page MenuHomePhabricator

Incorporate Tone-check Retraining Notebook in ml-pipelines
Closed, ResolvedPublic

Description

This task is based on: T396495: Build model training pipeline for tone check using WMF ML Airflow instance.

Develop the code from exploratory-notebook and adjust it in order to be part of the ml-pipelines repo.

The decision is to keep the retraining code simple (for the current phase) without classes and abstractions, imitating the logic taken from the notebook.
The code needs to be adjusted in order to work with repo rules/pipelines and with both gitlab-ci/kokkuri.

The logic behind this attempt is to finalise the tone-check retraining code and develop it in such a way that it will be containerised via kokkuri.
The image needs to be slim and decoupled from the data and base_model.
The container will accept external volumes where the data and the base_model will exist.

The current status for Tone-Check retraining pipeline is summarised here in these comments: T396495#10970710 & T396495#11025158

This ticket is a continuation from the above status using the logic from the exploratory-notebook and simulates the actual operation decoupling the data/base_model from the retraining container.

Details

Other Assignee
achou
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
tone-check: refactor retraining image and script for DAG integrationrepos/machine-learning/ml-pipelines!44kevinbaziratone_check_training_job_logicmain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
Openppelberg
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedEsanders
ResolvedMNeisler
In ProgressSucheta-Salgaonkar-WMF
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedgkyziridis
Resolvedbrouberol
Resolvedbrouberol
Resolvedgkyziridis
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedkevinbazira
Resolvedbrouberol
Resolvedkevinbazira

Event Timeline

gkyziridis updated the task description. (Show Details)
gkyziridis updated the task description. (Show Details)

Update

  • Incorporate the code in ml-piplines
  • Decouple model and data from the image
  • Achieve it to run it locally using these changes:
    • Use base: python:3.11-slim
    • Use the actual model (locally)
    • Build the image
    • Mount the volumes from local machine
    • Run it successfully

Obstacles:

  • Testing is getting more difficult when it comes to push it on repo and run ci/cd and kokkuri pipelines due to the absence of the volumes (data and base_model).
  • Since the image will be slim and flat from data/models we need to decide how the gitlab-ci pipelines will run together attaching the needed volumes.
  • I had issues on building the image using the amd-pytorch23 base:
    • This image is used the previous-example on the ml-pipelines successfully.
    • The previous example had the model and the data included in the container (using small model and toy data).
  • Both fixes made for running it locally, they will not be part of the actual code:
    • Use python:3.11-slim base image
    • Install torch on the top

The current status of the code can be found in this brach.

Test it locally:

  1. Clone branch:
git clone -b develop_retraining_notebook_tonecheck https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines.git && cd ml-pipelines
  1. Build image:
docker build --target production -f .pipeline/training/tone_check/retrain/blubber.yaml -t retrain:slim .
  1. Download data and model:
  1. Locate the above into: ml-pipelines/training/tone_check/data and ml-pipelines/training/tone_check/base_model. Create an empty ml-pipelines/training/tone_check/output/ folder as well.
  1. Make sure you are in the correct folder when you run the image:
$ pwd
# ml-pipelines/training/tone_check

$ ls -l base_model 
# Output
config.json
model.safetensors
optimizer.pt
rng_state.pth
scheduler.pt
special_tokens_map.json
tokenizer_config.json
tokenizer.json
trainer_state.json
training_args.bin
vocab.txt

$ ls -l data      
# Output
peacock_detection_dataset.csv
  1. Run the container and mount the volumes:
docker run --rm \
  -v $(pwd)/data:/srv/edit_check/training/tone_check/data \
  -v $(pwd)/base_model:/srv/edit_check/training/tone_check/base_model \
  -v $(pwd)/output:/srv/edit_check/training/tone_check/output \
  retrain:slim

Output:

1$ docker build --target production -f .pipeline/training/tone_check/retrain/blubber.yaml -t retrain:slim .
2
3$ pwd
4# ml-pipelines/training/tone_check
5
6$ docker run --rm \
7 -v $(pwd)/data:/srv/edit_check/training/tone_check/data \
8 -v $(pwd)/base_model:/srv/edit_check/training/tone_check/base_model \
9 -v $(pwd)/output:/srv/edit_check/training/tone_check/output \
10 retrain:slim
11
12# Output
13INFO:root: -- Welcome to Tonecheck Retraining Job --
14INFO:root:DEVICE: cpu
15>>>> STARTED!!!
16Generating train split: 3000 examples [00:00, 78989.27 examples/s]
17
18INFO:root:train_dataset['train'][10]:
19{'input': 'en[SEP]Peacock_Detection[SEP]they operate in several international markets . the company was founded in 1980 and has grown steadily since . this organization focuses on delivering quality services . he worked in finance and operations for over a decade . they operate in several international markets . this organization focuses on delivering quality services . the company was founded in 1980 and has grown steadily since . employees undergo regular training programs . he worked in finance and operations for over a decade . employees undergo regular training programs . the report was submitted for review . he worked in finance and operations for over a decade . the company was founded in 1980 and has grown steadily since . they operate in several international markets . employees undergo regular training programs . they operate in several international markets . employees undergo regular training programs . the company was founded in 1980 and has grown steadily since . he worked in finance and operations for over a decade . the system was updated to meet new regulatory requirements . the report was submitted for review . this organization focuses on delivering quality services . the company was founded in 1980 and has grown steadily since . this organization focuses on delivering quality services . they operate in several international markets . the report was submitted for review . the system was updated to meet new regulatory requirements . they operate in several international markets . they operate in several international markets . they operate in several international markets .', 'label': 0}
20
21INFO:root:tokenizer loaded
22DatasetDict({
23 train: Dataset({
24 features: ['input', 'label'],
25 num_rows: 2700
26 })
27 test: Dataset({
28 features: ['input', 'label'],
29 num_rows: 300
30 })
31})
32Map: 100%|██████████| 2700/2700 [00:00<00:00, 7525.25 examples/s]
33Map: 100%|██████████| 300/300 [00:00<00:00, 7653.61 examples/s]
34Filter: 100%|██████████| 2700/2700 [00:00<00:00, 6217.11 examples/s]
35Filter: 100%|██████████| 300/300 [00:00<00:00, 5845.64 examples/s]
36
37INFO:root:model is loaded: BertForSequenceClassification(
38 (bert): BertModel(
39 (embeddings): BertEmbeddings(
40 (word_embeddings): Embedding(119547, 768, padding_idx=0)
41 (position_embeddings): Embedding(512, 768)
42 (token_type_embeddings): Embedding(2, 768)
43 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
44 (dropout): Dropout(p=0.1, inplace=False)
45 )
46 (encoder): BertEncoder(
47 (layer): ModuleList(
48 (0-11): 12 x BertLayer(
49 (attention): BertAttention(
50 (self): BertSdpaSelfAttention(
51 (query): Linear(in_features=768, out_features=768, bias=True)
52 (key): Linear(in_features=768, out_features=768, bias=True)
53 (value): Linear(in_features=768, out_features=768, bias=True)
54 (dropout): Dropout(p=0.1, inplace=False)
55 )
56 (output): BertSelfOutput(
57 (dense): Linear(in_features=768, out_features=768, bias=True)
58 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
59 (dropout): Dropout(p=0.1, inplace=False)
60 )
61 )
62 (intermediate): BertIntermediate(
63 (dense): Linear(in_features=768, out_features=3072, bias=True)
64 (intermediate_act_fn): GELUActivation()
65 )
66 (output): BertOutput(
67 (dense): Linear(in_features=3072, out_features=768, bias=True)
68 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
69 (dropout): Dropout(p=0.1, inplace=False)
70 )
71 )
72 )
73 )
74 (pooler): BertPooler(
75 (dense): Linear(in_features=768, out_features=768, bias=True)
76 (activation): Tanh()
77 )
78 )
79 (dropout): Dropout(p=0.1, inplace=False)
80 (classifier): Linear(in_features=768, out_features=2, bias=True)
81)
82
83INFO:root:Start training
84 0%| | 0/20 [00:00<?, ?it/s]/opt/lib/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
85 warnings.warn(warn_msg)
86 20%|██ | 4/20 [00:28<01:37, 6.12s/it/opt/lib/venv/lib/python3.11/site-packages/transformers/configuration_utils.py:393: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.
87Non-default generation parameters: {'max_length': 512}
88 warnings.warn(
89/opt/lib/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
90 warnings.warn(warn_msg)
91 40%|████ | 8/20 [00:54<01:10, 5.86s/it]{'eval_loss': 0.38032618165016174, 'eval_acc': 1.0, 'eval_roc_auc': 1.0, 'eval_threshold': 0.8366155624389648, 'eval_pr_auc': 1.0, 'eval_recall': 1.0, 'eval_precision': 1.0, 'eval_f1': 1.0, 'eval_tn': 2, 'eval_fp': 0, 'eval_fn': 0, 'eval_tp': 1, 'eval_runtime': 0.6659, 'eval_samples_per_second': 4.505, 'eval_steps_per_second': 1.502, 'epoch': 1.0}
92 40%|████ | 8/20 [00:54<01:10, 5.86s/it]/opt/lib/venv/lib/python3.11/site-packages/transformers/configuration_utils.py:393: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.
93Non-default generation parameters: {'max_length': 512}
94 warnings.warn(
95/opt/lib/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
96 warnings.warn(warn_msg)
97 60%|██████ | 12/20 [01:20<00:47, {'eval_loss': 0.29208609461784363, 'eval_acc': 1.0, 'eval_roc_auc': 1.0, 'eval_threshold': 0.8407101035118103, 'eval_pr_auc': 1.0, 'eval_recall': 1.0, 'eval_precision': 1.0, 'eval_f1': 1.0, 'eval_tn': 2, 'eval_fp': 0, 'eval_fn': 0, 'eval_tp': 1, 'eval_runtime': 0.5785, 'eval_samples_per_second': 5.186, 'eval_steps_per_second': 1.729, 'epoch': 2.0}
98 60%|██████ | 12/20 [01:21<00:47, 5.91s/it/opt/lib/venv/lib/python3.11/site-packages/transformers/configuration_utils.py:393: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.
99Non-default generation parameters: {'max_length': 512}
100 warnings.warn(
101/opt/lib/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
102 warnings.warn(warn_msg)
103 80%|████████ | 16/20 [01:47<00:23, 5.87s/it]{'eval_loss': 0.23636728525161743, 'eval_acc': 1.0, 'eval_roc_auc': 1.0, 'eval_threshold': 0.8521502614021301, 'eval_pr_auc': 1.0, 'eval_recall': 1.0, 'eval_precision': 1.0, 'eval_f1': 1.0, 'eval_tn': 2, 'eval_fp': 0, 'eval_fn': 0, 'eval_tp': 1, 'eval_runtime': 0.5762, 'eval_samples_per_second': 5.207, 'eval_steps_per_second': 1.736, 'epoch': 3.0}
104 80%|████████ | 16/20 [01:47<00:23, 5.87s/it/opt/lib/venv/lib/python3.11/site-packages/transformers/configuration_utils.py:393: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.
105Non-default generation parameters: {'max_length': 512}
106 warnings.warn(
107/opt/lib/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
108 warnings.warn(warn_msg)
109100%|██████████| 20/20 [02:13<00:00, 5.84s/it]{'eval_loss': 0.20355510711669922, 'eval_acc': 1.0, 'eval_roc_auc': 1.0, 'eval_threshold': 0.8665211796760559, 'eval_pr_auc': 1.0, 'eval_recall': 1.0, 'eval_precision': 1.0, 'eval_f1': 1.0, 'eval_tn': 2, 'eval_fp': 0, 'eval_fn': 0, 'eval_tp': 1, 'eval_runtime': 0.6483, 'eval_samples_per_second': 4.628, 'eval_steps_per_second': 1.543, 'epoch': 4.0}
110100%|██████████| 20/20 [02:14<00:00, 5.84s/it/opt/lib/venv/lib/python3.11/site-packages/transformers/configuration_utils.py:393: UserWarning: Some non-default generation parameters are set in the model config. These should go into either a) `model.generation_config` (as opposed to `model.config`); OR b) a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model).This warning will become an exception in the future.
111Non-default generation parameters: {'max_length': 512}
112 warnings.warn(
113100%|██████████| 20/20 [02:16<00:00, 6.82s/it]
114INFO:root:Trainning took: 136.44 secs
115{'eval_loss': 0.19127391278743744, 'eval_acc': 1.0, 'eval_roc_auc': 1.0, 'eval_threshold': 0.8711633682250977, 'eval_pr_auc': 1.0, 'eval_recall': 1.0, 'eval_precision': 1.0, 'eval_f1': 1.0, 'eval_tn': 2, 'eval_fp': 0, 'eval_fn': 0, 'eval_tp': 1, 'eval_runtime': 0.5683, 'eval_samples_per_second': 5.279, 'eval_steps_per_second': 1.76, 'epoch': 5.0}
116{'train_runtime': 136.3646, 'train_samples_per_second': 0.99, 'train_steps_per_second': 0.147, 'train_loss': 0.34897818565368655, 'epoch': 5.0}
117

Update

Achieved to use base image: docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-3-20250511 and publish it on docker-registry.wikimedia.org/repos/machine-learning/ml-pipelines .

The problem I was facing on publishing the image before was probably that I was using a python:3.11-slim for base image.

I could not test this configuration locally using the docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-3-20250511 base image due to the following error during docker run:

1Fatal Python error: Illegal instruction
2Current thread 0x00007fffff4b9b80 (most recent call first):
3 File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
4 File "<frozen importlib._bootstrap_external>", line 1233 in create_module
5 File "<frozen importlib._bootstrap>", line 573 in module_from_spec
6 File "<frozen importlib._bootstrap>", line 676 in _load_unlocked
7 File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
8 File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
9 File "/opt/lib/python/site-packages/torch/__init__.py", line 237 in <module>
10 File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
11 File "<frozen importlib._bootstrap_external>", line 940 in exec_module
12 File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
13 File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
14 File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
15 File "/srv/edit_check/training/tone_check/retrain/retrain.py", line 4 in <module>
16 File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
17 File "<frozen importlib._bootstrap_external>", line 940 in exec_module
18 File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
19 File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
20 File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
21 File "/srv/edit_check/tests/training/tone_check/retrain/retrain_unit_test.py", line 9 in <module>
22 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/assertion/rewrite.py", line 186 in exec_module
23 File "<frozen importlib._bootstrap>", line 690 in _load_unlocked
24 File "<frozen importlib._bootstrap>", line 1149 in _find_and_load_unlocked
25 File "<frozen importlib._bootstrap>", line 1178 in _find_and_load
26 File "<frozen importlib._bootstrap>", line 1206 in _gcd_import
27 File "/usr/lib/python3.11/importlib/__init__.py", line 126 in import_module
28 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/pathlib.py", line 587 in import_path
29 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/python.py", line 498 in importtestmodule
30 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/python.py", line 551 in _getobj
31 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/python.py", line 280 in obj
32 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/python.py", line 567 in _register_setup_module_fixture
33 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/python.py", line 554 in collect
34 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/runner.py", line 389 in collect
35 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/runner.py", line 344 in from_call
36 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/runner.py", line 391 in pytest_make_collect_report
37 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 121 in _multicall
38 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
39 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 512 in __call__
40 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/runner.py", line 567 in collect_one_node
41 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 839 in _collect_one_node
42 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 974 in genitems
43 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 813 in perform_collect
44 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 353 in pytest_collection
45 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 121 in _multicall
46 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
47 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 512 in __call__
48 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 342 in _main
49 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 289 in wrap_session
50 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/main.py", line 336 in pytest_cmdline_main
51 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 121 in _multicall
52 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
53 File "/opt/lib/venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 512 in __call__
54 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/config/__init__.py", line 175 in main
55 File "/opt/lib/venv/lib/python3.11/site-packages/_pytest/config/__init__.py", line 201 in console_main
56 File "/opt/lib/venv/lib/python3.11/site-packages/pytest/__main__.py", line 9 in <module>
57 File "<frozen runpy>", line 88 in _run_code
58 File "<frozen runpy>", line 198 in _run_module_as_main
59
60Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg (total: 2)
61
62
63
64

I could build the image locally but not run it.

Nevertheless, both test and production variant ran and published successfully via the kokkuri pipeline.
You can find the results namely: kokkuri_test_pipeline and kokkuri_publish_pipeline.

Since this is done, we can move forward for applying the PVC volume when is available.

achou updated Other Assignee, added: achou; removed: AikoChou.Aug 22 2025, 7:39 AM

kevinbazira updated https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/44

tone-check: refactor retraining image and script for DAG integration