Page MenuHomePhabricator

Inconsistent wiki list: grouped_wikis.csv extended *after* some sqoop jobs have already started
Open, MediumPublic

Description

While debugging a sqoop failure, we found that airflow job ingestion_wikis_monthly to extend grouped_wikis.csv used in sqoop jobs is created *after* some sqoop jobs in the list have already started. This means the first few sqoop jobs see a different list if wikis, and the set set of sqoops see a newer (possibly) different set of wikis.

ingestion_wikis_monthly: Ended 2026-05-01, 02:03:00 UTC

File was created at 2026-05-01 02:02

akhatun@an-launcher1003:~$ sudo -u analytics hdfs dfs -ls /wmf/data/wmf/mediawiki/database/grouped_wikis.csv
-rw-r--r--   3 analytics analytics-privatedata-users      19486 2026-05-01 02:02 /wmf/data/wmf/mediawiki/database/grouped_wikis.csv

And first sqoop started at 2026-05-01T00:00:10

2026-05-01T00:00:10 INFO   ************ NOTE ************
2026-05-01T00:00:10 INFO   When sqooping from cloud, resulting data will be shareable with the public but when sqooping from production, resulting data may need to be redacted or otherwise anonymized before sharing.
2026-05-01T00:00:10 INFO   ^^^^^^^^^^^^ NOTE ^^^^^^^^^^^^
2026-05-01T00:00:12 INFO   Checking HDFS paths
2026-05-01T00:01:01 INFO   Generating ORM jar at /tmp/sqoop-jars/2026-05-01T00:01:01/mediawiki-tables-sqoop-orm.jar
2026-05-01T00:01:01 INFO   STARTING: etwiki.archive (try 1)
2026-05-01T00:01:01 INFO   STARTING: etwiki.change_tag (try 1)
2026-05-01T00:01:01 INFO   STARTING: etwiki.change_tag_def (try 1)
....

Once the grouped_wikis.csv is updated, the remaining sqoop jobs would get a possibly modified version of the wiki list.