Page MenuHomePhabricator

Automate event stream ingestion into HDFS for streams that don't use EventGate
Closed, ResolvedPublic

Description

Currently, we configure Camus jobs to import events based on which eventgate cluster instance they come through. We do this because generally the throughput of each stream is relatively the same size for each eventgate cluster instance, which allows us to better tune the ingestion jobs.

T269619: Create pipelines for late/spurious/failed events will create one of the first set of event platform based streams that does not flow through eventgate. We need to add a Camus job that knows how to import these kinda of events, probably by adding another stream config setting to indicate that they should be ingested by Camus. This may cause us to re-think which stream config setting we are using to configure ingestion altogether.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fdans triaged this task as Medium priority.Mar 1 2021, 5:04 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

@Ottomata this is going to block the deployment of the WDQS Flink based Streaming Updater. Any chance you could raise the priority?

For additional context, we are planning on deploying the new Streaming Updater on March 15

Change 668119 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set canary_events_enabled: true for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/668119

Change 668119 merged by jenkins-bot:
[operations/mediawiki-config@master] Set canary_events_enabled: true for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/668119

Mentioned in SAL (#wikimedia-operations) [2021-03-03T16:28:07Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: canary_events_enabled: true for rdf-streaming-updater streams - T273901 (duration: 01m 49s)

Change 668124 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Add a consumers.analytics-hadoop setting to automate ingestion of streams intod HDFS

https://gerrit.wikimedia.org/r/668124

Ok here's the idea:

I add a new EventStreamConfig settings block called consumers, where we can add consumers by name, and then put in relevant settings for them. Those consumers would be responsible for using those settings.

We could do the same for producers (or maybe just producer for single writer principal). This would be relevant for things being discussed in T273235: [Metrics Platform] Define stream configuration syntax relevant to v1 release cc @Mholloway

Example for consumers:
Declare settings: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/668124
Use them: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668125

Change 668131 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/668131

Change 668131 merged by Ottomata:
[operations/mediawiki-config@master] Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/668131

Mentioned in SAL (#wikimedia-operations) [2021-03-03T17:13:55Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Set destination_event_serivce: eventgate-main for rdf-streaming-updater streams - T273901 (duration: 01m 08s)

Change 668135 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Bump schemas/event/secondary sha to get rdf_streaming_updater schemas

https://gerrit.wikimedia.org/r/668135

Change 668135 merged by Ottomata:
[eventgate-wikimedia@master] Bump schemas/event/secondary sha to get rdf_streaming_updater schemas

https://gerrit.wikimedia.org/r/668135

Change 668139 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Bump to 2021-03-03-172637-production to get rdf_streaming_updater schemas

https://gerrit.wikimedia.org/r/668139

Change 668139 merged by Ottomata:
[operations/deployment-charts@master] Bump to 2021-03-03-172637-production to get rdf_streaming_updater schemas

https://gerrit.wikimedia.org/r/668139

Change 668144 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-main should also use schemas/event/secondary repo

https://gerrit.wikimedia.org/r/668144

Change 668144 merged by Ottomata:
[operations/deployment-charts@master] eventgate-main should also use schemas/event/secondary repo

https://gerrit.wikimedia.org/r/668144

Change 701177 had a related patch set uploaded (by Ottomata; author: Ottomata):

[wikimedia-event-utilities@master] Support getting EventStreamConfig settings by JsonPointer path

https://gerrit.wikimedia.org/r/701177

Change 701177 merged by Ottomata:

[wikimedia-event-utilities@master] Support getting EventStreamConfig settings by JsonPointer path

https://gerrit.wikimedia.org/r/701177

Change 668124 merged by Ottomata:

[operations/mediawiki-config@master] Add a consumers.analytics-hadoop setting to automate ingestion of streams into HDFS

https://gerrit.wikimedia.org/r/668124

Mentioned in SAL (#wikimedia-operations) [2021-07-08T14:52:29Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Add consumers.analytics_hadoop-ingestion stream config settings for automated gobblin imports - T271232 T273901 (duration: 01m 09s)

Yes! We've done this now that we are using Gobblin instead of Camus. Moving this to our Kanban so we can ACK and close it this week.