The existent webrequest streams are not technically 'Event Platform' streams. Making them one would allow us to use tooling we are developing as part of {T308356} to consume webrequest from Kafka using Flink, or any other event platform tooling. This would be nice for {T310997}.
It would also make webrequest Kafka topics automatically documented in datahub.
We are planning to do this as part of {T351117}, and in doing so create a new event platform webrequest stream, allowing us to eventually decommission the old one.
Suggested name of new stream: `webrequest.frontend`, explicitly composed of topics `webrequest.frontend.text` and `webrequest.frontend.upload`.
==== Tasks
[] An event schema declared that matches the webrequest fields: [[ https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/983898 | patch ]]
[] The following fields added to webrequest's output format from the haproxy producer
- `$schema`
- `meta.stream` (this can be just set to 'webrequest')
-- Possibly also: `meta.dt`, `meta.request_id`, `meta.id`, etc. To be discussed.
[] `webrequest.frontend` stream declared in event stream config, with its composite topics explicitly declared: [[ https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/983905 | patch ]], with canary events enabled.
[] Gobblin hadoop ingestion uses eventstreamconfig to ingest the topics, instead of topic wildcard: [[ https://gerrit.wikimedia.org/r/c/analytics/refinery/+/983926 | patch ]]
[] Airflow webrequest job to ingestion logic changed:t the new raw data to new raw data to a new `wmf_raw.webrequest_frontend` table.
-- [] to ingest the new raw data to new raw data to a new `wmf_raw.webrequest_frontend` table, and to 'refine' the wmf_raw.webrequest_frontend to the `wmf.webrequest` table, once we are ready to do finalize the migration[] Analysis to ensure the data in the new raw `wmf_raw.webrequest_frontend` table matches the old `wmf_raw.webrequest` table.
-- [][] Alter the `wmf.webrequest` table to match the new schema (should just be adding fields).
[] Airflow webrequest refine job logic changed:
-- [] to ingest the wmf_raw.webrequest_frontend to the existent `wmf.webrequest` table, once we are ready to finalize the migration.
-- [] to filter out canary events (probably also needs to be done in raw sequence stats too?)
Once these are done, we should be able to treat `webrequest.frontend` like any other event stream. The stream will be ingested to Hive `wmf_raw.webrequest_frontend` and refined to the existent `wmf.webrequest`.