- Add required EventStreamConfig settings, e.g. consumers.analytics_hive_ingestion.enabled
- Possibly implement deep merging of configs in EventStreamConfig
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T307500 Airflow Hackathon (May 2022) | |||
| Open | None | T307505 Refine jobs should be scheduled by Airflow | |||
| Resolved | Antoine_Quhen | T356762 [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation | |||
| Resolved | tchin | T367134 [Refine Refactoring] Changes to EventStreamConfig needed for scheduling Refine via airflow |
Event Timeline
Can we connect this to an (epic?) parent task(s) and subscribe some more folks from DE?
Thank you!
Thomas and I had a working session today where we really tried to understand how we can get Refine in Airflow without breaking the user experience of automated event Hive ingestion.
We don't want users to have to understand gobblin HDFS input paths or how the schema is discovered.
To do this, something has to have logic to intelligently infer smart defaults based on conventions we define. Currently, this is done in Refine and RefineTarget code.
@Antoine_Quhen's WIP refine_to_hive_to_hive hourly dag already does some of this. For example, it determines the partitions a table should have, and extracts their value, from the configured Kafka topics setting (this is available in ESC). It determines final gobblin input paths also from the topics setting.
We could put this kind of logic inside of EventStreamConfig, but it would be quite strange there. It wouldn't just be defaults merging, it would be WMF Refine job specific convention logic to e.g. infer the output table location HDFS path from the stream name.
Instead, we suggest putting this defaults convention logic in the refine_to_hive_to_hive airflow dag.
We started taking notes on changes that would be needed. This list is probably not comprehensive, but you get the idea:
- Airflow job needs to discover streams to ingest using ESC, just like we do for canary_event.
- We can use the consumers.analytics_hadoop_ingestion.enabled (or a perhaps a new) setting in ESC to enable or disable Refinement.
- Airflow job applies intelligent defaults via convention for a stream when the config setting is not set in ESC:
- relative input_schema_uri determined from schema_title setting in ESC
- Use multiple input_schema_base_uris. This should be a high level job setting. EventSchemaLoader in code knows how to use relative schema_uri with multiple schema_base_paths to find the schema.
- fully_qualified_table_name inferred from stream name and default database name (need new parameter).
- input_paths determined from ESC topics setting. This is mostly happening now in Antione's patch. We just need to make it smarter about defaulting the full input paths using topics too.
- Partitions defaulted from ESC topics setting + hourly. This is happening already in Antione's patch.
- Overrides for Refine applied via ESC. E.g. hourly_computing_scale.
This would allow us to use ESC to schedule Refine via airflow now, with no changes to the expected user interface.
As discussed in the meeting, I'm OK with minimal changes to ESC: It could be nice to add the hourly_computing_scale and have a boolean ~needs_to_be_refined.
To define those booleans, we should reuse the regex in Refine systemd jobs:
event_logging_legacy_table_exclude_regex = r"^(edit|inputdevicedynamics|pageissues|mobilewebmainmenuclicktracking|kaiosappconsent|mobilewebuiclicktracking|citationusagepageload|citationusage|readingdepth|editconflict|resourcetiming|rumspeedindex|layoutshift|featurepolicyviolation|specialmutesubmit|suggestedtagsaction)$" event_logging_legacy_table_include_regex = r"^(contenttranslationabusefilter|desktopwebuiactionstracking|mobilewebuiactionstracking|prefupdate|quicksurveyinitiation|quicksurveysresponses|searchsatisfaction|specialinvestigate|templatewizard|test|universallanguageselector|wikidatacompletionsearchclicks|editattemptstep|visualeditorfeatureuse|helppanel|homepagemodule|newcomertask|homepagevisit|serversideaccountcreation|cpubenchmark|navigationtiming|painttiming|savetiming|codemirrorusage|referencepreviewsbaseline|referencepreviewscite|referencepreviewspopups|templatedataapi|templatedataeditor|twocolconflictconflict|twocolconflictexit|virtualpageview|visualeditortemplatedialoguse|wikibasetermboxinteraction|wmdebannerevents|wmdebannerinteractions|wmdebannersizeissue|landingpageimpression|centralnoticebannerhistory|centralnoticeimpression|translationrecommendationuseraction|translationrecommendationuirequests|translationrecommendationapirequests|wikipediaportal)$" event_table_exclude_regex = r"^(mediawiki_page_properties_change|mediawiki_recentchange|resource_purge)$"
The new process does not refine a stream without canary events, as we rely on _IMPORTED flags. And those flags could be skipped without canary T365223. This should be enforced in ESC or at least commented on.
As discussed, Streams defined as Regex should not be refined.
have a boolean ~needs_to_be_refined.
This would be a consumer specific configuration in ESC. Gobblin currently uses the consumer name 'analytics_hadoop_ingestion'., e.g.
'consumers' => [ 'analytics_hadoop_ingestion' => [ 'job_name' => 'eventlogging_legacy', 'enabled' => true, ], ]
We could have a new consumer config block, e.g. analytics_hive_ingestion or something, but I wonder if it would be better to reuse the nicely named analytics_hadoop_ingestion for Refine as well. Gobblin and Refine are both implementation details for hadoop ingestion in general.
@Antoine_Quhen I think the intention is to have a single Airflow job for Refine, is that correct?
Would it be easier to do like we do now and have 2 (event default, and eventlogging legacy?). If we had two, we could auto-set defaults for eventlogging_legacy in the airflow job accordingly, instead of hardcoding the legacy ones on every stream. (Am fine with either, just thinking about it.)
We will probably need a separate airflow job for Refine netflow anyway, right? (Unless we somehow make netflow a regualr event platform stream?)
we should reuse the regex in Refine systemd jobs
This info is _mostly_ already in eventstream config, as the analytics_hadoop_ingestion.job_name. We can ignore the eventlogging_legacy_table_include/exclude stuff. That only exists for the eventlogging -> event platform migration. This will not be needed soon.
As for event_table_exclude_regex, I think we can just disable analytics_hadoop_ingestion for these. We don't do anything with the data, since it is not refined.
We discussed a bunch in a meeting today. We have two main ideas on how to structure ESC for Airflow Refine.
First, some summaries from the convo:
- in the future, when eventlogging is fully migrated to event platform (SOON! T353817), we can use almost the exact same logic to Refine eventlogging legacy streams and event platform streams.
- We discussed being able to remove the analytics_hadoop_ingestion.job_name. This is used by gobblin currently, but once we are done eventlogging migration we can use the same gobblin job (event_default) for legacy eventlogging streams too.
- We do want to use Refine to refine webrequest and netflow...one day.
Okay, ESC ideas:
Re-use analytics_hadoop_ingestion consumer config
# idea: re-use analytics_hadoop_ingestion for refine mediaiwiki.page_change.v1: # ... consumers: analytics_hadoop_ingestion: enabled: true dataset_size: medium # use this to infer refine and/or gobblin job settings? # any other overrides needed?
Pro: Gobblin and Refine are really just implementation details of Stream -> Hive ingestion. Conceptually keeps this together. If we refactor ingestion (getting rid of Gobblin), we can re-use this block again.
Con: It may be annoying when Gobblin and Refine need different tool explicit settings. What if gobblin needs small resources but Refine needs large? Do we add gobblin and refine tool specific settings block in the analytics_hadoop_ingestion block? Hm.
New analytics_hive_ingestion consumer config
# idea: re-use analytics_hadoop_ingestion for refine mediaiwiki.page_change.v1: # ... consumers: analytics_hadoop_ingestion: enabled: true job_size: small analytics_hive_ingestion: # refine enabled: true job_size: medium
Pros: will work now for Refine stream discovery even for legacy eventlogging streams (we don't have to wait for migration to finish.)
Cons: More concepts in config.
I have a slight preference for reusing analytics_hadoop_ingestion, but Antoine I think slightly prefers the new analytics_hive_ingestion. Antoine is implementing this now, so let's go with analytics_hive_ingestion for now and see how it feels.
Change #1050596 had a related patch set uploaded (by TChin; author: TChin):
[operations/mediawiki-config@master] EventStreamConfig: Add hive ingestion defaults
Change #1050596 merged by jenkins-bot:
[operations/mediawiki-config@master] EventStreamConfig: Add hive ingestion defaults
Mentioned in SAL (#wikimedia-operations) [2024-07-08T13:17:10Z] <urbanecm@deploy1002> Started scap sync-world: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]]
Mentioned in SAL (#wikimedia-operations) [2024-07-08T13:31:58Z] <urbanecm@deploy1002> tchin, jforrester, urbanecm: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-07-08T13:47:48Z] <urbanecm@deploy1002> Finished scap: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] (duration: 30m 38s)
Change #1052761 had a related patch set uploaded (by TChin; author: TChin):
[mediawiki/extensions/EventStreamConfig@master] Deeply merge stream defaults
Change #1052762 had a related patch set uploaded (by TChin; author: TChin):
[operations/mediawiki-config@master] EventStreamConfig: Enable hive ingestion for mediawiki.page-delete
Change #1052762 merged by jenkins-bot:
[operations/mediawiki-config@master] EventStreamConfig: Enable hive ingestion for mediawiki.page-delete
Mentioned in SAL (#wikimedia-operations) [2024-07-16T13:05:04Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1052762|EventStreamConfig: Enable hive ingestion for mediawiki.page-delete (T367134)]]
Mentioned in SAL (#wikimedia-operations) [2024-07-16T13:09:11Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 tchin, lucaswerkmeister-wmde: Backport for [[gerrit:1052762|EventStreamConfig: Enable hive ingestion for mediawiki.page-delete (T367134)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-07-16T13:15:20Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1052762|EventStreamConfig: Enable hive ingestion for mediawiki.page-delete (T367134)]] (duration: 10m 15s)
Diffing the output of deeply merging stream defaults, all of the changes are either just adding job_name to a disabled hadoop ingestion config and adding analytics_hive_ingestion defaults to streams that have enabled hadoop ingestion
"/^mediawiki\\.job\..+/": {
"canary_events_enabled": false,
"consumers": {
"analytics_hadoop_ingestion": {
"enabled": false,
+ "job_name": "event_default"
},
"analytics_hive_ingestion": {
"enabled": false
}
},
"destination_event_service": "eventgate-main",
"schema_title": "mediawiki/job",
"stream": "/^mediawiki\\.job\..+/",
"topic_prefixes": [
"eqiad.",
"codfw."
],
"topics": [
"/^(eqiad\\.|codfw\.)mediawiki\.job\..+/"
]
},
"eventlogging_CentralNoticeBannerHistory": {
"canary_events_enabled": true,
"consumers": {
"analytics_hadoop_ingestion": {
"enabled": true,
"job_name": "eventlogging_legacy"
},
+ "analytics_hive_ingestion": {
+ "enabled": false
+ }
},
"destination_event_service": "eventgate-analytics-external",
"schema_title": "analytics/legacy/centralnoticebannerhistory",
"stream": "eventlogging_CentralNoticeBannerHistory",
"topic_prefixes": null,
"topics": [
"eventlogging_CentralNoticeBannerHistory"
]
},Change #1052761 merged by jenkins-bot:
[mediawiki/extensions/EventStreamConfig@master] Deeply merge stream defaults