Page MenuHomePhabricator

Automated monthly bulk ingestion runs
Open, MediumPublic8 Estimated Story Points

Description

We want to be able to run the bulk ingestion process on a monthly basis automatically, so that we can be sure that any data that was lost to the incidents or schema changes that we do in our system is automatically reflected in all of the APIs.

Acceptance criteria
Bulk ingestion DAG runs every month automatically or create implementation tickets.

ToDo

  • setup a schedule for bulk-ingestion DAG inside scheduler (Airflow)
  • make sure we don't generate snapshots that are double in size due to bulk ingestion process being in progress (need to solution this)
  • figure out what to do with batches while ingestion is running (they will be floated with data from ingestion)

Notes
For more context about the ingestion process and how it's executed please refer to the Bulk Ingestion Runbook v2 under Runbooks directory on Product/Eng. drive.
Also try to use the Runbook to go through the process on dev environment to get a solid grasp of the issue before starting the actual work.

Things to consider

  • We'll have to figure out what to do with the batches in the scenario of monthly ingestion, due to the way it's implemented right now it will not work properly in the days that ingestion is running.
  • Snapshots will need a elaborated solution on how to track when ingestion has started for a particular project, cuz if we do nothing we'll have snapshots double in size for next couple of days after the ingestion.

Related Objects

Event Timeline

Protsack.stephan created this task.