Page MenuHomePhabricator

Add Link training pipeline gets stuck for certain wikis
Open, Stalled, Needs TriagePublic

Description

I successfully retrained Add Link models for several wikis as part of T385780: Retrain Add Link models for Surfacing Structured Tasks pilot wikis. However, the run_pipeline.sh script gets stuck for certain wikis, such as fawiki. I attempted running WIKI_ID=fawiki ./run_pipeline.sh several times, and I always get stuck with an output that looks like this:

RUNNING wikipedia2vec on dump
[2025-02-23 18:45:38,866] [INFO] Starting to build a Dump DB... (train@cli.py:167)
[2025-02-23 18:46:42,522] [INFO] Processed: 100000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:47:24,712] [INFO] Processed: 200000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:47:49,421] [INFO] Processed: 300000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:48:15,206] [INFO] Processed: 400000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:48:40,221] [INFO] Processed: 500000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:49:01,624] [INFO] Processed: 600000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:49:17,222] [INFO] Processed: 700000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:49:27,672] [INFO] Processed: 800000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:49:52,850] [INFO] Processed: 900000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:50:12,734] [INFO] Processed: 1000000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:50:31,309] [INFO] Processed: 1100000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:50:45,251] [INFO] Processed: 1200000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:51:01,738] [INFO] Processed: 1300000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:51:24,854] [INFO] Processed: 1400000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:51:55,359] [INFO] Processed: 1500000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:52:11,238] [INFO] Processed: 1600000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:52:35,035] [INFO] Processed: 1700000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:53:06,955] [INFO] Processed: 1800000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:53:42,425] [INFO] Processed: 1900000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:54:13,262] [INFO] Processed: 2000000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:54:50,196] [INFO] Processed: 2100000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:55:25,172] [INFO] Processed: 2200000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:55:57,525] [INFO] Processed: 2300000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:56:31,249] [INFO] Processed: 2400000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:57:07,460] [INFO] Processed: 2500000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:57:37,293] [INFO] Processed: 2600000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:58:11,897] [INFO] Processed: 2700000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:59:00,373] [INFO] Processed: 2800000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 18:59:49,701] [INFO] Processed: 2900000 pages (__iter__@wiki_dump_reader.py:37)
[2025-02-23 19:00:17,285] [INFO] Starting to build a dictionary... (train@cli.py:172)
[2025-02-23 19:00:17,293] [INFO] Step 1/2: Processing Wikipedia pages... (build_dictionary@cli.py:234)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [02:17<00:00, 7488.18it/s]
[2025-02-23 19:02:34,767] [INFO] Step 2/2: Processing Wikipedia redirects... (build_dictionary@cli.py:234)
[2025-02-23 19:03:36,925] [INFO] 316436 words and 1335058 entities are indexed in the dictionary (build_dictionary@cli.py:234)
[2025-02-23 19:03:37,165] [INFO] Starting to build a link graph... (train@cli.py:177)
[2025-02-23 19:03:37,194] [INFO] Step 1/2: Processing Wikipedia pages... (build_link_graph@cli.py:247)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [00:18<00:00, 55828.12it/s][2025-02-23 19:03:56,025] [INFO] Step 2/2: Converting matrix... (build_link_graph@cli.py:247)
[2025-02-23 19:04:00,361] [INFO] Starting to build a mention DB... (train@cli.py:186)
[2025-02-23 19:04:00,457] [INFO] Step 1/3: Starting to iterate over Wikipedia pages... (build_mention_db@cli.py:272)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [00:24<00:00, 41960.52it/s][2025-02-23 19:04:25,428] [INFO] Step 2/3: Starting to count occurrences... (build_mention_db@cli.py:272)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [02:17<00:00, 7488.42it/s][2025-02-23 19:06:45,290] [INFO] Step 3/3: Building DB... (build_mention_db@cli.py:272)
[2025-02-23 19:06:52,435] [INFO] Starting to train embeddings... (train@cli.py:194)
[2025-02-23 19:06:52,500] [INFO] Building a table for sampling frequent words... (train_embedding@cli.py:326)
[2025-02-23 19:06:54,677] [INFO] Building tables for sampling negatives... (train_embedding@cli.py:326)
[2025-02-23 19:07:21,145] [INFO] Building a table for iterating links... (train_embedding@cli.py:326)
[2025-02-23 19:07:23,002] [INFO] Initializing weights... (train_embedding@cli.py:326)
Iteration 1/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [07:55<00:00, 2166.27it/s]
Iteration 2/5:  69%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                        | 709900/1029156 [08:20<04:19, 1229.01it/s]

It stays at 69% pretty much indefinitely (for several days). It looks like there is something fawiki-related that is preventing the retraining to finish. I tried killing the script, deleting the data folder and re-running, but to no avail.

Event Timeline

Restricted Application added subscribers: Huji, Aklapper. · View Herald Transcript

@MGerlach If you have any insights on what might be going on here, they would be greatly appreciated!

@Urbanecm_WMF this is related to the wikipedia2vec package that is generating the embeddings for the articles. I remember that their latest release (v2) introduced issues with some languages. Therefore, we tried to stick with version 1.0.5 which, however, I was only able to run with python3.7 (and not higher versions). See for example, the comment to use python3.7 in run-pipeline.sh. I couldnt immediately verify this, as I was not able build a virtual environment with python3.7 (only saw python3.9 on stat1008).

FYI: we removed the dependency on wikipedia2vec in the new airflow training pipeline T361926 (I know its not a fix for this task but one more reason to make the output of that pipeline compatible with your current setup). cc @fkaelin

Michael changed the task status from Open to Stalled.Mar 18 2025, 2:43 PM
Michael moved this task from Inbox to Blocked on the Growth-Team board.
Michael subscribed.

So, this is expected to be resolved by implementing the new way of training these models with the new airflow training pipeline. The task to make that happen is: T388258: Make airflow-dag for addalink training pipeline output compatible with deployed model

Once T388258 is done, we can revisit this task and probably close it.