Page MenuHomePhabricator

[Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment
Closed, ResolvedPublic13 Estimated Story Points

Assigned To
Authored By
Ahoelzl
Jul 11 2024, 4:25 PM
Referenced Files
Restricted File
Nov 20 2024, 10:52 AM
Restricted File
Nov 20 2024, 10:52 AM
Restricted File
Nov 20 2024, 10:52 AM

Description

This task will contain a migration plan and be used to track the production deployment of Refine in Airflow. Subtasks will be created if needed.

The task description should be updated as we 'refine' ٩(^‿^)۶ the migration plan.

Implementation of the refactor is tracked in T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation

Done is

  • analytics-test-hadoop event ingestion Refine jobs are scheduled via Airflow
  • analytics-test-hadoop systemd event ingestion Refine jobs are removed, and corresponding puppet code is deleted.
  • analytics-hadoop event ingestion Refine jobs are scheduled via Airflow
  • analytics-hadoop systemd event ingestion Refined jobs are removed, and corresponding puppet code is deleted.

Note that this task does not include Airflow-ization of:

Production cutover ideas

'Production' here means that the Airflow job is configured to write data to the event Hive tables.

Before we deploy to production, we plan to configure Airflow Refine to write in parallel to a temp database, event_airflow perhaps? Once we feel confident with that, we will need to cutover to writing into the existent production event database.

There are several ways we could do the actual production migrations.

  1. All at once
    • Change systemd jobs to write to event_systemd(?) database.
    • Change Airflow Refine job to write to event database.
    • Rerun the hour on which we cutover for all event tables
    • After a time period (1 weekish?) of functional Airflow refined event tables, we stop the systemd timers and remove the event_systemd database.
  1. Incremental
    • Modify legacy Refine job to be configurable (via EventStreamConfig?) to set which Hive database a dataset should be written to.
    • Use EventStreamConfig to manage cutover of of legacy Refine and Airflow Refine. E.g. Make legacy Refine of eventlogging_NavigationTiming stream write to event_systemd at the same time we configure Airflow Refine to write to event instead of event_airflow.

... Other ideas?

NOTE: As of 2024-08-28, we will plan to proceed with the 'All at once' migration plan.

Migration plan

Migrate analytics-test-hadoop cluster

Doing this first will help us determine our production migration plan.

Once confident, do the production cutover:

  • Do a manual EvolveHiveTable --dry_run=true for all event tables immediately before cutover to be sure no unexpected ALTERs will be executed.

All at once cutover method:

  • Change systemd jobs to write to event_systemd(?) database in Puppet.
  • Change Airflow Refine job to write to event database.
  • Manually rerun the hour on which we do the cutover.
  • After a time period (1 weekish?) of functional Airflow refined event tables:

Migrate analytics-hadoop cluster

Same steps as above, but for production analytics-hadoop cluster. To be filled in when we are closer to being ready.

Details

Other Assignee
mforns
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
analytics/refinery/sourcemaster+19 -6
analytics/refinery/sourcemaster+184 -18
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+0 -4
operations/dnsmaster+1 -1
analytics/refinery/sourcemaster+54 -2
operations/puppetproduction+6 -2
operations/dnsmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+9 -2
operations/puppetproduction+10 -1
operations/deployment-chartsmaster+13 -8
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+10 -7
operations/deployment-chartsmaster+33 -18
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+7 -21
analytics/refinery/sourcemaster+197 -1
analytics/refinery/sourcemaster+28 -7
operations/deployment-chartsmaster+23 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+2 -0
operations/puppetproduction+2 -2
analytics/refinery/sourcemaster+4 -1
analytics/refinery/source0.2.49+4 -1
operations/mediawiki-configmaster+10 -0
analytics/refinery/sourcemaster+37 -18
operations/mediawiki-configmaster+201 -9
operations/puppetproduction+1 -1
analytics/refinery/sourcemaster+28 -19
analytics/refinery/source0.2.49+29 -23
analytics/refinery/sourcemaster+30 -23
analytics/refinery/sourcemaster+22 -16
analytics/refinery/sourcemaster+83 -15
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
DuplicateNone
DuplicateNone
ResolvedJAllemandou
ResolvedOttomata
OpenNone
ResolvedNone
OpenAhoelzl
ResolvedAntoine_Quhen
Resolvedgmodena
ResolvedSpikegmodena
ResolvedOttomata
ResolvedAntoine_Quhen
Resolved Stevemunene
ResolvedAntoine_Quhen
ResolvedAntoine_Quhen
Resolvedtchin
OpenNone
ResolvedOttomata
OpenNone
Resolvedtchin
ResolvedAntoine_Quhen
OpenNone
DeclinedAntoine_Quhen
ResolvedJAllemandou
ResolvedOttomata
ResolvedAntoine_Quhen
ResolvedAntoine_Quhen
ResolvedAntoine_Quhen
ResolvedAntoine_Quhen

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1161047 merged by Brouberol:

[operations/deployment-charts@master] Airflow analytics-test: Optimization for LocalExecutors

https://gerrit.wikimedia.org/r/1161047

Change #1163040 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the geoip databases to the dse-k8s-worker nodes

https://gerrit.wikimedia.org/r/1163040

Change #1163040 merged by Btullis:

[operations/puppet@production] Add the geoip databases to the dse-k8s-worker nodes

https://gerrit.wikimedia.org/r/1163040

Change #1166132 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] refine-monitor: RefineTarget.shouldRefine fix

https://gerrit.wikimedia.org/r/1166132

Change #1167572 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] data-engineering: Refine switch over preparation

https://gerrit.wikimedia.org/r/1167572

Currently implementing the plan here: https://docs.google.com/document/d/1PXGlHfnZIwr54H4RVOaIT_sLbJY6mpQieVq4owUH8Ls
Posted below in its state on 2025 July 10th.

Refine from systemd to Airflow - Technical migration plan

Before M-Day

Initialize the new event_systemd database with all event tables precreated from the existing event.* schemas (script needed). MEDIUM https://gitlab.wikimedia.org/-/snippets/232

  • Create a puppet patch (for M-Day) to update the configuration of the systemd refine and refine monitor. NOTE: No need to change refine_sanitized as it reads from the event table. MEDIUM https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167572
    • Systemd Refine output should be the new event_systemd db / folders.
    • Refine-monitor should monitor the new event_systemd db and folders
    • Modify from-until values (-8, -2)
  • Create an Airflow patch (for M-day) to update the variables in refine_to_hive dag. MEDIUM
    • change its output to the event DB
    • update the table it diffs with to event_systemd (+sensors)
    • Update start date → Should be M-DAY at hour 0. We wish to have a clear-cut day.

On M-Day (TBD)

  • Pause refine_to_hive DAG and systemd timers
  • Clear past dag runs of refine_to_hive (delete the DAG itself)
  • Apply config patches and airflow variable deletion NOTE: Verify start_date for airflow and since / until for systemd refine. We should have systemd-refine only process the beginning of the day, not 24h. Same for refine_monitor. NOTE2: The since/until should make sure to recompute from the beginning of the day (testing on one small table for 1 hour might be a good idea to not mess up).
  • Restart systemd timer to load data to new database/folders. Wait for the first job to be successful 🙂
  • Restart Airflow refine_to_hive DAG, monitor!
  • Drop event_alt db including hdfs data

On M-Day + 1 (TBD)

Update refine and refine_monitor since / until config back to -26 / -2

Later (TBD)

  • Drop systemd timers for refine and refine_monitor
  • Drop event_systemd db including hdfs data
  • Drop diff task in refine_to_hive_hourly (+hdfs dependencies)

Optimizations

  • Use LocalExecutor (with smaller pool) for pure Java canary events.
  • Using LocalExecutor for small refine tasks (non Spark) with smaller pool. (waiting for GeoIp DB)
  • Using K8S Operator without skein for small Spark tasks.
  • Optimize dag inter-tasks parameters passing for lower DB resources consumptions (delegate as much as possible at exec time). This is about using XComs too much 🙂(+lowering parsing time)

Change #1167572 merged by Btullis:

[operations/puppet@production] data-engineering: Refine switch-over preparation

https://gerrit.wikimedia.org/r/1167572

Change #1169675 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bump hive metastore heap to support the refine migration

https://gerrit.wikimedia.org/r/1169675

Change #1169675 merged by Btullis:

[operations/puppet@production] Bump hive metastore heap to support the refine migration

https://gerrit.wikimedia.org/r/1169675

Change #1169683 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Fail over hive services to an-coord1004

https://gerrit.wikimedia.org/r/1169683

Change #1169697 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Revert "Fail over hive services to an-coord1004"

https://gerrit.wikimedia.org/r/1169697

Change #1169697 merged by Btullis:

[operations/dns@master] Revert "Fail over hive services to an-coord1004"

https://gerrit.wikimedia.org/r/1169697

Change #1169720 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Refine: Force partition creation at end of refine

https://gerrit.wikimedia.org/r/1169720

Change #1170084 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Analytics: Refine post migration update

https://gerrit.wikimedia.org/r/1170084

Change #1170084 merged by Brouberol:

[operations/puppet@production] Analytics: Refine post migration update

https://gerrit.wikimedia.org/r/1170084

Change #1169720 merged by jenkins-bot:

[analytics/refinery/source@master] Refine: Force partition creation at end of refine

https://gerrit.wikimedia.org/r/1169720

Change #1170384 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Analyics: Refine restore monitor timerange

https://gerrit.wikimedia.org/r/1170384

Change #1170384 merged by Brouberol:

[operations/puppet@production] Analyics: Refine restore monitor timerange

https://gerrit.wikimedia.org/r/1170384

Change #1172318 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Refine: Fix location of refined data-empty partitions

https://gerrit.wikimedia.org/r/1172318

Change #1177334 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/mediawiki-config@master] Analytics - Refine eventlogging_MediaWikiPingback

https://gerrit.wikimedia.org/r/1177334

Change #1177334 merged by jenkins-bot:

[operations/mediawiki-config@master] Analytics - Refine eventlogging_MediaWikiPingback

https://gerrit.wikimedia.org/r/1177334

Mentioned in SAL (#wikimedia-operations) [2025-08-11T13:33:28Z] <phuedx@deploy1003> Started scap sync-world: Backport for [[gerrit:1177334|Analytics - Refine eventlogging_MediaWikiPingback (T369845)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-11T13:35:15Z] <phuedx@deploy1003> aqu, phuedx: Backport for [[gerrit:1177334|Analytics - Refine eventlogging_MediaWikiPingback (T369845)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-11T13:41:45Z] <phuedx@deploy1003> Finished scap sync-world: Backport for [[gerrit:1177334|Analytics - Refine eventlogging_MediaWikiPingback (T369845)]] (duration: 08m 16s)

Following the migration, we noticed that some partition paths were changed compared to what they used to be. Changed in 2 ways:

1/ camel-cased table in hdfs path name in place of lower-cased. e.g.
/wmf/data/event/CentralNoticeBannerHistory/year=2025/month=7/day=16/hour=0/
in place of
/wmf/data/event/centralnoticebannerhistory/year=2025/month=7/day=16/hour=0/
It was due to the Hive table location (as a metadata) now being used by the new Refine and it was not lowercased in the metadata.

2/ duplication of the table name in the path of some partitions
/wmf/data/event/test_instrumentation/test_instrumentation/datacenter=codfw/year=2025/month=7/day=15/hour=10
in place of
/wmf/data/event/test_instrumentation/datacenter=codfw/year=2025/month=7/day=15/hour=10
It was due to a bug of parameters being passed between Airflow and the scala job.

Here is the a script used to harmonize the localizations: https://gitlab.wikimedia.org/-/snippets/241
Another one to check for missing partitions: https://gitlab.wikimedia.org/-/snippets/242
Another one to remove ghost directories: https://gitlab.wikimedia.org/-/snippets/244

Change #1166132 abandoned by Aqu:

[analytics/refinery/source@master] refine-monitor: RefineTarget.shouldRefine fix

Reason:

Refine has been migrated and RefineMonitor is becoming obsolete.

https://gerrit.wikimedia.org/r/1166132

Change #1172318 merged by jenkins-bot:

[analytics/refinery/source@master] Refine: Fix location of refined data-empty partitions

https://gerrit.wikimedia.org/r/1172318

aqu closed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1373

Draft: Main - Canary events: Replace Skein by Kubernetes executor with custom image