Spark producers and consumers currently can access Event Platform's Kafka topics via EventGate and EventStream.
The goal of this task is to standardize producer/consumer APIs and access patterns.
Use cases
- Dumps 2.0 reconciliation
- Search Update Pipeline and weighted tags for image recommendations: T372912
- WDQS Updater reconcile mechanism (currently uses a simple EventGate client from the spark driver)
Done is
- Spark + Event Platform Kafka sink support is implemented in wikimedia-event-utilities repository that meets Event Platform producer requirements.
- New wikimedia-event-utilities version is released
Usage example based in implemented Spark Sink
dataFrame .write() .format("wmf-event-stream") .option("event-stream-name", "mediawiki.cirrussearch.page_weighted_tags_change.rc0") .option("event-schema-base-uris", "https://schema.wikimedia.org/repositories/primary/jsonschema") .option("event-stream-config-uri", "https://meta.wikimedia.org/w/api.php?action=streamconfigs") .option("event-schema-version", "1.0.0") .option("event-stream-topic-prefix", "eqiad") .option("kafka.bootstrap.servers", "kafka.local:9092") // this is the default topic used by the kafka sink if a row does not specify one .option("topic", "eqiad.mediawiki.cirrussearch.page_weighted_tags_change.rc0") .save();
The above should also work in pyspark with identical syntax.