I successfully retrained Add Link models for several wikis as part of T385780: Retrain Add Link models for Surfacing Structured Tasks pilot wikis. However, the run_pipeline.sh script gets stuck for certain wikis, such as fawiki. I attempted running WIKI_ID=fawiki ./run_pipeline.sh several times, and I always get stuck with an output that looks like this:
RUNNING wikipedia2vec on dump [2025-02-23 18:45:38,866] [INFO] Starting to build a Dump DB... (train@cli.py:167) [2025-02-23 18:46:42,522] [INFO] Processed: 100000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:47:24,712] [INFO] Processed: 200000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:47:49,421] [INFO] Processed: 300000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:48:15,206] [INFO] Processed: 400000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:48:40,221] [INFO] Processed: 500000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:49:01,624] [INFO] Processed: 600000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:49:17,222] [INFO] Processed: 700000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:49:27,672] [INFO] Processed: 800000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:49:52,850] [INFO] Processed: 900000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:50:12,734] [INFO] Processed: 1000000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:50:31,309] [INFO] Processed: 1100000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:50:45,251] [INFO] Processed: 1200000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:51:01,738] [INFO] Processed: 1300000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:51:24,854] [INFO] Processed: 1400000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:51:55,359] [INFO] Processed: 1500000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:52:11,238] [INFO] Processed: 1600000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:52:35,035] [INFO] Processed: 1700000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:53:06,955] [INFO] Processed: 1800000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:53:42,425] [INFO] Processed: 1900000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:54:13,262] [INFO] Processed: 2000000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:54:50,196] [INFO] Processed: 2100000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:55:25,172] [INFO] Processed: 2200000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:55:57,525] [INFO] Processed: 2300000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:56:31,249] [INFO] Processed: 2400000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:57:07,460] [INFO] Processed: 2500000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:57:37,293] [INFO] Processed: 2600000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:58:11,897] [INFO] Processed: 2700000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:59:00,373] [INFO] Processed: 2800000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 18:59:49,701] [INFO] Processed: 2900000 pages (__iter__@wiki_dump_reader.py:37) [2025-02-23 19:00:17,285] [INFO] Starting to build a dictionary... (train@cli.py:172) [2025-02-23 19:00:17,293] [INFO] Step 1/2: Processing Wikipedia pages... (build_dictionary@cli.py:234) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [02:17<00:00, 7488.18it/s] [2025-02-23 19:02:34,767] [INFO] Step 2/2: Processing Wikipedia redirects... (build_dictionary@cli.py:234) [2025-02-23 19:03:36,925] [INFO] 316436 words and 1335058 entities are indexed in the dictionary (build_dictionary@cli.py:234) [2025-02-23 19:03:37,165] [INFO] Starting to build a link graph... (train@cli.py:177) [2025-02-23 19:03:37,194] [INFO] Step 1/2: Processing Wikipedia pages... (build_link_graph@cli.py:247) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [00:18<00:00, 55828.12it/s][2025-02-23 19:03:56,025] [INFO] Step 2/2: Converting matrix... (build_link_graph@cli.py:247) [2025-02-23 19:04:00,361] [INFO] Starting to build a mention DB... (train@cli.py:186) [2025-02-23 19:04:00,457] [INFO] Step 1/3: Starting to iterate over Wikipedia pages... (build_mention_db@cli.py:272) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [00:24<00:00, 41960.52it/s][2025-02-23 19:04:25,428] [INFO] Step 2/3: Starting to count occurrences... (build_mention_db@cli.py:272) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [02:17<00:00, 7488.42it/s][2025-02-23 19:06:45,290] [INFO] Step 3/3: Building DB... (build_mention_db@cli.py:272) [2025-02-23 19:06:52,435] [INFO] Starting to train embeddings... (train@cli.py:194) [2025-02-23 19:06:52,500] [INFO] Building a table for sampling frequent words... (train_embedding@cli.py:326) [2025-02-23 19:06:54,677] [INFO] Building tables for sampling negatives... (train_embedding@cli.py:326) [2025-02-23 19:07:21,145] [INFO] Building a table for iterating links... (train_embedding@cli.py:326) [2025-02-23 19:07:23,002] [INFO] Initializing weights... (train_embedding@cli.py:326) Iteration 1/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1029156/1029156 [07:55<00:00, 2166.27it/s] Iteration 2/5: 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 709900/1029156 [08:20<04:19, 1229.01it/s]
It stays at 69% pretty much indefinitely (for several days). It looks like there is something fawiki-related that is preventing the retraining to finish. I tried killing the script, deleting the data folder and re-running, but to no avail.