Page MenuHomePhabricator

Plan how to handle image-suggestions data pipeline failure if no engineers are available
Closed, ResolvedPublic

Description

We have fairly regular upstream failures in our image suggestions pipelines (some dataset we depend on not getting generated), also sometimes something will stop working because some part of the infrastructure has changed (e.g. recently we had to switch or data checking script to get kicked off via skein rather than via spark)

So far our alerting has worked ok, and we've been notified when something is wrong, also we have some instructions for what to do in case of a failure.

If, however, there is no available engineer to take action on a failure quickly (i.e. before the next pipeline run), it's unclear what should happen. This ticket is to give us a place to make and record decisions about that

Event Timeline

Options:

  • automatically pause the DAG if there is an error
  • put in a sensor for the last piece of data we generate in the previous week
  • a "regenerate dataset snapshot X" script that clears out the hive partitions for snapshot X and rewrites all the data
  • T338013
  • change the order of the DAG execution, and put diff generation last so Search only imports data once everything else is done
MarkTraceur claimed this task.
MarkTraceur subscribed.

Closing per discussion in estimation - the planning is done and Cormac will create tickets for the above mitigation strategies.