The existent webrequest streams are not technically 'Event Platform' streams. Making them one would allow us to use tooling we are developing as part of T308356: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate to consume webrequest from Kafka using Flink, or any other event platform tooling. This would be nice for {T310997}.
It would also make webrequest Kafka topics automatically documented in datahub.
We are planning to do this as part of T351117: Move analytics log from Varnish to HAProxy, and in doing so create a new event platform webrequest stream, allowing us to eventually decommission the old one.
Suggested name of new stream: webrequest.frontend, explicitly composed of topics webrequest.frontend.text and webrequest.frontend.upload.
Tasks
- An event schema declared that matches the webrequest fields: patch
- The following fields added to webrequest's output format from the haproxy producer
- $schema
- meta.stream (this can be just set to 'webrequest')
- Possibly also: meta.dt, meta.request_id, meta.id, etc. To be discussed.
- webrequest.frontend stream declared in event stream config, with its composite topics explicitly declared: patch, with canary events enabled.
- Gobblin hadoop ingestion uses eventstreamconfig to ingest the topics, instead of topic wildcard: patch
- Airflow webrequest job to ingest the new raw data to new raw data to a new wmf_raw.webrequest_frontend table.
- Note that this table will have need to add partitions for each of the composite topics/ingested HDFS raw directory paths.
- Analysis to ensure the data in the new raw wmf_raw.webrequest_frontend table matches the old wmf_raw.webrequest table.
- Alter the wmf.webrequest table to match the new schema (should just be adding fields).
- Airflow webrequest refine job logic changed:
- to ingest the wmf_raw.webrequest_frontend to the existent wmf.webrequest table, once we are ready to finalize the migration.
- To be backwards compatible, this will need to correctly populate the wmf.webrequest webrequest_source Hive partition correctly, either from directory path (which matches the Kafka topic) as is done now, or from some new event data field we might add (cache_cluster?).
- to filter out canary events (probably also needs to be done in raw sequence stats too?)
Once these are done, we should be able to treat webrequest.frontend like any other event stream. The stream will be ingested to Hive wmf_raw.webrequest_frontend and refined to the existent Hive wmf.webrequest.