Page MenuHomePhabricator

Create pipelines for late/spurious/failed events
Closed, ResolvedPublic

Description

As a production service, the flink streaming updater job needs somewhere to send events that are late/failed/spurious. Currently these go to HDFS but apps in the Kubernetes cluster can't talk to the Analytics cluster. There needs to be a Kafka bridge that the rdf-streaming-updater can send events to, and those events will eventually end up in HDFS.

Acceptance Criteria:
Kafka topics are set up for late/failed/spurious events
Those same events end up in HDFS

Event Timeline

@Ottomata also suggested via IRC to consider using the event platform instead of kafka

Change 647723 had a related patch set uploaded (by DCausse; owner: DCausse):
[schemas/event/secondary@master] Add rdf-streaming-updater schemas for side outputs

https://gerrit.wikimedia.org/r/647723

Change 649715 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] [WIP] Add json encoders for side output events

https://gerrit.wikimedia.org/r/649715

Thanks!
it worked like a charm but I still had to pull com.github.java-json-tools:json-schema-validator:2.2.14 to do schema validation, would it make sense to add some helper functions to wikimedia-event-utilities for validating a json against its schema? Use-case is a unit test to make sure that the json produced is compliant with the schema it's referencing.

would it make sense to add some helper functions to wikimedia-event-utilities for validating a json against its schema

Sure that could be useful :)

@dcausse, will these be POSTed to an EventGate, or to produced directly to Kafka?

@dcausse, will these be POSTed to an EventGate, or to produced directly to Kafka?

I plan to POST them to event gate using a very naive SinkFunction and see how it behaves but I can push directly to kafka if it's preferable?

It depends on what you want to do :) EventGate will handle multi DC, filling some default values, and topic prefixes for you, but is an extra hop to Kafka. As a prod system in a language with a good Kafka client, producing to Kafka is totally allowed. You'd be the first main user of event platform not going through EventGate, but it is definitely something we want to support.

Perhaps we'll want to build in some logic for doing some of the things EventGate is doing (as a proxy) into wikimedia-event-utilties (including validation, as you suggested).

It depends on what you want to do :) EventGate will handle multi DC, filling some default values, and topic prefixes for you, but is an extra hop to Kafka. As a prod system in a language with a good Kafka client, producing to Kafka is totally allowed. You'd be the first main user of event platform not going through EventGate, but it is definitely something we want to support.

Perhaps we'll want to build in some logic for doing some of the things EventGate is doing (as a proxy) into wikimedia-event-utilties (including validation, as you suggested).

I think the most important for me is that these events get stored in HDFS properly and I'm not sure what are the requirements for this. I went with event-gate because I know it works :)

But using the flink kafka producers makes more sense, it's robust and easy to use but I was worried of missing important parts of the process. If this kind of usecases is something we'd like to support then I think I should go with this approach, I created T270371 to discuss all the tools that wikimedia-event-utilties should provide to help this usecase.

T270371 is great thank you!

If this kind of usecases is something we'd like to support

We do. EventGate is a poor substitute for a full fledged Kafka client. Kafka has so many more features and knobs to dial, and for prod apps those are going to be important.

Change 647723 merged by Ottomata:
[schemas/event/secondary@master] Add rdf-streaming-updater schemas for side outputs

https://gerrit.wikimedia.org/r/647723

Change 661727 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [wdqs] Add flink sideoutput stream definitions

https://gerrit.wikimedia.org/r/661727

Mentioned in SAL (#wikimedia-operations) [2021-02-08T15:11:47Z] <ottomata> set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619

Mentioned in SAL (#wikimedia-analytics) [2021-02-08T15:11:53Z] <ottomata> set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619

Change 649715 merged by jenkins-bot:
[wikidata/query/rdf@master] Add json encoders for side output events

https://gerrit.wikimedia.org/r/649715

Change 661727 merged by jenkins-bot:
[operations/mediawiki-config@master] [wdqs] Add flink sideoutput stream definitions

https://gerrit.wikimedia.org/r/661727

Mentioned in SAL (#wikimedia-operations) [2021-02-10T12:26:44Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T269619: [wdqs] Add flink sideoutput stream definitions (duration: 01m 06s)

Change 663219 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Do not produce canary events for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/663219

Change 663219 merged by jenkins-bot:
[operations/mediawiki-config@master] Do not produce canary events for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/663219

Mentioned in SAL (#wikimedia-operations) [2021-02-10T15:26:25Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Do not produce canary events for rdf-streaming-updater streams - T269619 (duration: 01m 13s)

Change 668119 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set canary_events_enabled: true for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/668119

Change 668119 merged by jenkins-bot:
[operations/mediawiki-config@master] Set canary_events_enabled: true for rdf-streaming-updater streams

https://gerrit.wikimedia.org/r/668119

@Gehel @dcausse these events are now in HDFS. There aren't any Hive tables yet because no non-canary events have yet been ingested, and we filter out canary events from the Hive tables.