While debugging a sqoop failure, we found that airflow job ingestion_wikis_monthly to extend grouped_wikis.csv used in sqoop jobs is created *after* some sqoop jobs in the list have already started. This means the first few sqoop jobs see a different list if wikis, and the set set of sqoops see a newer (possibly) different set of wikis.
ingestion_wikis_monthly: Ended 2026-05-01, 02:03:00 UTC
File was created at 2026-05-01 02:02
akhatun@an-launcher1003:~$ sudo -u analytics hdfs dfs -ls /wmf/data/wmf/mediawiki/database/grouped_wikis.csv -rw-r--r-- 3 analytics analytics-privatedata-users 19486 2026-05-01 02:02 /wmf/data/wmf/mediawiki/database/grouped_wikis.csv
And first sqoop started at 2026-05-01T00:00:10
2026-05-01T00:00:10 INFO ************ NOTE ************ 2026-05-01T00:00:10 INFO When sqooping from cloud, resulting data will be shareable with the public but when sqooping from production, resulting data may need to be redacted or otherwise anonymized before sharing. 2026-05-01T00:00:10 INFO ^^^^^^^^^^^^ NOTE ^^^^^^^^^^^^ 2026-05-01T00:00:12 INFO Checking HDFS paths 2026-05-01T00:01:01 INFO Generating ORM jar at /tmp/sqoop-jars/2026-05-01T00:01:01/mediawiki-tables-sqoop-orm.jar 2026-05-01T00:01:01 INFO STARTING: etwiki.archive (try 1) 2026-05-01T00:01:01 INFO STARTING: etwiki.change_tag (try 1) 2026-05-01T00:01:01 INFO STARTING: etwiki.change_tag_def (try 1) ....
Once the grouped_wikis.csv is updated, the remaining sqoop jobs would get a possibly modified version of the wiki list.