Flink pipelines will need a way to output JSON events that we are sure are compatible with Event Platform conventions. We need Flink support for doing what EventGate does when it receives events. Namely:
- ensuring the schema is allowed in the stream
- making sure some fields are set, e.g. meta.dt, dt, etc.
- validating the event against its schema
There is already code to do this in wikimedia-event-utilities in the JsonEventGenerator. This is used in the wdqs-rdf-updater job, but its usage is pretty manual.
We should be able to plug JsonEventGenerator somewhere at the end of a streaming pipeline before producing the events to Kafka. There are a few ways to do this, and I'm not yet sure which is best:
- Pipeline step: Explicit map step in a pipeline, converting from a case class or a Row -> ObjectNode -> JsonEventGenertor.generateEvent -> Kafka
- Custom Flink serialization format: this could wrap JsonRowDataSerializationSchema somehow and call JsonEventGenertor.generateEvent with the Row converted from ObjectNode before serializing it to a byte[]
We chose solution 2.