This task will contain a migration plan and be used to track the production deployment of Refine in Airflow. Subtasks will be created if needed.
The task description should be updated as we 'refine' ٩(^‿^)۶ the migration plan.
Implementation of the refactor is tracked in T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation
Done is
- analytics-test-hadoop event ingestion Refine jobs are scheduled via Airflow
- analytics-test-hadoop systemd event ingestion Refine jobs are removed, and corresponding puppet code is deleted.
- analytics-hadoop event ingestion Refine jobs are scheduled via Airflow
- analytics-hadoop systemd event ingestion Refined jobs are removed, and corresponding puppet code is deleted.
Note that this task does not include Airflow-ization of:
- the refine_netflow systemd job
- the RefineSanitize systemd jobs.
- the data_purge systemd jobs
Production cutover ideas
'Production' here means that the Airflow job is configured to write data to the event Hive tables.
Before we deploy to production, we plan to configure Airflow Refine to write in parallel to a temp database, event_airflow perhaps? Once we feel confident with that, we will need to cutover to writing into the existent production event database.
There are several ways we could do the actual production migrations.
- All at once
- Change systemd jobs to write to event_systemd(?) database.
- Change Airflow Refine job to write to event database.
- Rerun the hour on which we cutover for all event tables
- After a time period (1 weekish?) of functional Airflow refined event tables, we stop the systemd timers and remove the event_systemd database.
- Incremental
- Modify legacy Refine job to be configurable (via EventStreamConfig?) to set which Hive database a dataset should be written to.
- Use EventStreamConfig to manage cutover of of legacy Refine and Airflow Refine. E.g. Make legacy Refine of eventlogging_NavigationTiming stream write to event_systemd at the same time we configure Airflow Refine to write to event instead of event_airflow.
... Other ideas?
Migration plan
Migrate analytics-test-hadoop cluster
Doing this first will help us determine our production migration plan.
- Deploy airflow analytics_test instance refine job. This job should limit the tables it refines to the streams that gobblin ingests in analytics-test-hadoop. At first, the airflow job should refine in parallel to a different Hive database.
- Compare output of systemd Refine and Airflow refine using mechanism developed in T361502: [Refine Refactoring] Define and implement a automated testing / comparison tool for config store configured datasets.
Once confident, do the production cutover:
- Do a manual EvolveHiveTable --dry_run=true for all event tables immediately before cutover to be sure no unexpected ALTERs will be executed.
All at once cutover method:
- Change systemd jobs to write to event_systemd(?) database in Puppet.
- Change Airflow Refine job to write to event database.
- Manually rerun the hour on which we do the cutover.
- After a time period (1 weekish?) of functional Airflow refined event tables:
- Ensure absent the refine systemd timers,
- Remove relevant refine puppet code
- Drop the event_systemd Hive database, tables and files.
Migrate analytics-hadoop cluster
Same steps as above, but for production analytics-hadoop cluster. To be filled in when we are closer to being ready.