Spark producers and consumers currently can access Event Platform's Kafka topics via EventGate and EventStream.
The goal of this task is to standardize producer/consumer APIs and access patterns.
== Use cases
- Dumps 2.0 reconciliation
- Search Update Pipeline and weighted tags for image recommendations: T372912
- WDQS Updater reconcile mechanism (currently uses a simple [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/rdf-spark-tools/src/main/scala/org/wikidata/query/rdf/updater/reconcile/ReconciliationSender.scala|EventGate client]] from the spark driver)
== Done is
[x] Spark + Event Platform Kafka sink support is implemented in wikimedia-event-utilities repository that meets [[ https://wikitech.wikimedia.org/wiki/Event_Platform/Producer_Requirements | Event Platform producer requirements ]].
[x] New wikimedia-event-utilities version is released
== Usage example based in implemented Spark Sink
```lang=java
dataFrame
.write()
.format("wmf-event-stream")
.option("event-stream-name", "mediawiki.cirrussearch.page_weighted_tags_change.rc0")
.option("event-schema-base-uris", "https://schema.wikimedia.org/repositories/primary/jsonschema")
.option("event-stream-config-uri", "https://meta.wikimedia.org/w/api.php?action=streamconfigs")
.option("event-schema-version", "1.0.0")
.option("event-stream-topic-prefix", "eqiad")
.option("kafka.bootstrap.servers", "kafka.local:9092")
// this is the default topic used by the kafka sink if a row does not specify one
.option("topic", "eqiad.mediawiki.cirrussearch.page_weighted_tags_change.rc0")
.save();
```
The above should also work in pyspark with identical syntax.
== Related
- {T214430}
- {T372912}
- {T358373}