TLDR: Downstream Airflow jobs seem to consume webrequests partitions when such partitions may not be ready for downstream consumption. What can we do about this?
Longer version:
While debugging an Airflow failure on the projectview_hourly DAG, we figured that the root cause was that we were consuming empty Hive partititions. My strong speculation is that this happened because the partition on webrequests was created, but the data was actually not available. Data eventually landed, but downstream jobs had already triggered and thus they acted on the empty partition.
For details of the original investigation, please see https://lists.wikimedia.org/hyperkitty/list/data-engineering-alerts@lists.wikimedia.org/thread/AX27YJC4N4B7A5GFD6DUVR5R7IAEDVVN/.
The long term issue here is that perhaps the majority of our Airflow jobs depend on webrequests, and if the data there is not good, then all downstream jobs are not good either. Additionally, since we currently do not have a provenance mechanism, the exercise to restate the bad downstream jobs is manual and error prone (need rto figure which Airflow jobs are downstream of the parent manually).