On T368782: MediaWiki Reconciliation API, we will be emitting a new kind of 'reconciliation' event. Hopefully the schema of such event is exactly the same as the page change event.
In this task, we should create a Flink job that will:
- Consume this new event stream
- Create an enriched version of that includes the content slots, hopefully exactly the same as mediawiki_page_content_change_v1 (could not find schema?).
- A separete Gobblin process should make this stream available as a Hive Table under the event Hive database.
Requirements
This flink enrichment job should target the DSE cluster. This new k8s service will:
- not be publicly accessible via a <appname>.wikimedia.org subdomain.
- not require users logging in.
- consume/produce from/to kafka jumbo.
- _may_ need to consume from kafka main (@gmodena to clarify this requirement).
- need to reach MediaWiki Action endpoints (via internal routes).
Resources and expected load
Some initial estimates. They might need some refinements as we go.
- Flink topology: one job manager and two task managers, managed by the Flink k8s operator. Each will be allocated a dedicated k8s pod.
- 1G memory allocate to job manager, and 1.5G allocated to taskmanagers should be safe defaults.
- Expected load: significantly lower than mw-page-content-change-nerich. This app will consume events from a topic produced into a batch daily (hopefully hourly process). The worst case scenario estimated so fare is 100k requests/hour.
- Flink HA. TBC. In its first iteration, we'll probably won't need HA for this job. We might still want an object store to snapshot kafka offsets. We could experiment with snapshotting to CEPH if available.
Actions
The Dumps team will
- Provide Data SREs a namespace name.
- Setup a new job in the mediawiki-event-enrichment repo, for integration with Deployment pipeline.
- Add helmfile/values to deployment-charts, based atop the flink-app Helm chart.
- Add new input output streams to EventStreamConfig (the Gobblin consumer will be enabled by default)/
As discussed in slack, the following steps will require Data SRE actions.
- create the namespace
- create the read/deploy credentials