Page MenuHomePhabricator

Data pipeline weekly schedule is on hold
Closed, ResolvedPublic

Description

The image suggestions data pipeline should follow the Wikidata snapshot release schedule, i.e., weekly on Mondays: this would let the pipeline leverage the freshest data.

However, a manual execution of the Airflow job resulted in empty commonswiki_file.py output for the 2022-04-11 Wikidata snapshot.
The main suspects are joins between weekly Wikidata snapshots and monthly Wikis ones. For instance, if we are at the beginning of April, Wikidata would have a snapshot on 2022-04-04, but maybe other Wikis are still on March?
This requires further investigation to answer the following questions:

  • is the Airlfow Hive sensor that waits for the latest Wikidata snapshot still useful?
  • Should we agree on a different schedule that ensures workable data?

Update: solution

We agreed with @JAllemandou that the best solution is to let the Airflow Hive sensor wait for the latest available snapshot (read partition) of every table we query in the pipeline. This would ensure the freshest data.
Usually wmf_raw table need 2-3 days of processing before they become available, but there may be an additional delay for unexpected reasons.
As a result, we set a timeout of 6 days before the sensor stops, so that the next scheduled pipeline run can be still be triggered weekly.
See https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/merge_requests/55

Event Timeline

mfossati changed the task status from Open to In Progress.May 5 2022, 8:51 AM