Page MenuHomePhabricator

[Event Platform] Declare webrequest as an Event Platform stream
Open, LowPublic

Description

The existent webrequest streams are not technically 'Event Platform' streams. Making them one would allow us to use tooling we are developing as part of T308356: [Shared Event Platform] Ability to use Event Platform streams in Flink without boilerplate to consume webrequest from Kafka using Flink, or any other event platform tooling. This would be nice for {T310997}.

It would also make webrequest Kafka topics automatically documented in datahub.

We are planning to do this as part of T351117: Move analytics log from Varnish to HAProxy, and in doing so create a new event platform webrequest stream, allowing us to eventually decommission the old one.

Suggested name of new stream: webrequest.frontend, explicitly composed of topics webrequest.frontend.text and webrequest.frontend.upload.

Tasks
  • An event schema declared that matches the webrequest fields: patch
  • The following fields added to webrequest's output format from the haproxy producer
  • $schema
  • meta.stream (this can be just set to 'webrequest')
    • Possibly also: meta.dt, meta.request_id, meta.id, etc. To be discussed.
  • webrequest.frontend stream declared in event stream config, with its composite topics explicitly declared: patch, with canary events enabled.
  • Gobblin hadoop ingestion uses eventstreamconfig to ingest the topics, instead of topic wildcard: patch
  • Airflow webrequest job to ingest the new raw data to new raw data to a new wmf_raw.webrequest_frontend table.
    • Note that this table will have need to add partitions for each of the composite topics/ingested HDFS raw directory paths.
  • Analysis to ensure the data in the new raw wmf_raw.webrequest_frontend table matches the old wmf_raw.webrequest table.
  • Alter the wmf.webrequest table to match the new schema (should just be adding fields).
  • Airflow webrequest refine job logic changed:
    • to ingest the wmf_raw.webrequest_frontend to the existent wmf.webrequest table, once we are ready to finalize the migration.
    • To be backwards compatible, this will need to correctly populate the wmf.webrequest webrequest_source Hive partition correctly, either from directory path (which matches the Kafka topic) as is done now, or from some new event data field we might add (cache_cluster?).
    • to filter out canary events (probably also needs to be done in raw sequence stats too?)

Once these are done, we should be able to treat webrequest.frontend like any other event stream. The stream will be ingested to Hive wmf_raw.webrequest_frontend and refined to the existent Hive wmf.webrequest.

Event Timeline

Ahoelzl renamed this task from Declare webrequest as an Event Platform stream to [Event Platform] Declare webrequest as an Event Platform stream.Oct 20 2023, 5:29 PM

We are going to do this, but as a new stream. This new stream will be used for HAProxy logging as described in https://phabricator.wikimedia.org/T351117#9413691.

Change 983898 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] WIP - Add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change 983905 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] WIP - add webrequest.frontend stream

https://gerrit.wikimedia.org/r/983905

Change 983926 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] WIP - Add gobblin job webrequest_frontend to pull new webrequest stream

https://gerrit.wikimedia.org/r/983926

How should we layout and name the new stream(s)?

Currently, we have webrequest_text and webrequest_upload topics. Which topic is produced to is determine by which cache_cluster the server is in. IIUC, haproxy will be colocated in the same way as varnish, so we can continue to use this layout if we want to. (cc @Fabfur to confirm).

We could consider changing this. I had thought that perhaps we should do topic prefixing by datacenter like we do for other event streams. E.g. esams.webrequest, etc. topics.

However, the more I look at this, the more I think we should keep the topic layout the same, and keep the Hive partitioning of raw and refined webrequest tables the same too. We can and should populate common and nice fields if we can (e.g. meta.request_id from X-Request-Id header), but using the haproxy migration as an excuse to refactor the webrequest schema and stream layout will make the task grow a lot.

So. Let's make one event platform stream with two topics.

How about:

  • new stream declared as webrequest.frontend composed of topics webrequest.frontend.text and webrequest.frontend.upload.
  • Hive ingestion logic changed to extract the webrequest_source 'text' vs 'upload' partitions via a regex in the topic name, just like it does now. We just have to change the regex to match the new topic names.

@Milimetric @gmodena @aqu, whatcha think?

@Antoine_Quhen asked if we should consider making the new webrequest Hive table an Iceberg table. @JAllemandou @xcollazo can/should we do this?

How should we layout and name the new stream(s)?

Currently, we have webrequest_text and webrequest_upload topics. Which topic is produced to is determine by which cache_cluster the server is in. IIUC, haproxy will be colocated in the same way as varnish, so we can continue to use this layout if we want to. (cc @Fabfur to confirm).

Yep, that's correct!

As for Hive tables. I'm trying to decide how best to do the migration. Perhaps, it would be easiest to keep the existent wmf.webrequest refined Hive table as is. The raw table would change to webrequest_frontend as imported from the new streams, but the webrequest refine airflow job would switch to refining from webrequest_frontend raw table once we are ready to do the migration cutover.

This would make the code changes on our side much smaller, and also would reduce cognitive overhead for users of the wmf.webrequest dataset. We wouldn't have to change any terminology an say e.g.'the webrequest frontend' table. We wouldn't have to update any existent documentation, e.g. datahub and wikitech.

This means that we would have to do our analysis and comparision on the wmf_raw.webrequest_frontend table, or we could do a few manual refinements into a temporary refined table for analysis purposes, before we fully migrate.

Ottomata removed subscribers: ayounsi, CDanis, ssingh and 4 others.

Hm, alternatively, we could just have the raw and refined tables be brand newly named tables and ingestion jobs during the migration, and then do the final cutover with a RENAME TABLE.

Thoughts?

@Antoine_Quhen asked if we should consider making the new webrequest Hive table an Iceberg table. @JAllemandou @xcollazo can/should we do this?

Couple questions back at you: is webrequest append only? If not, how do we do rewrites today? Append only would be significantly easier to tune, while dealing with MERGEs on big Iceberg tables requires deep tuning given our limited cluster resources (see T340863 for all the fun details making Dumps 2.0 behave reasonably well).

Couple questions back at you: is webrequest append only?

yes

If not, how do we do rewrites today?

If we do, they are per hour. We re-refine the entire hour.

We just had a discussion in DE standup about T335306: [SPIKE] Evaluation on iceberg sensor for airflow. I'm sure there are many existent Hive sensors on the webrequest table. I'd rather not block on that task for this migration. I suggest we keep this has a regular Hive table.

We just had a discussion in DE standup about T335306: [SPIKE] Evaluation on iceberg sensor for airflow. I'm sure there are many existent Hive sensors on the webrequest table. I'd rather not block on that task for this migration. I suggest we keep this has a regular Hive table.

You'd need that and T338065: [Iceberg Migration] Implement mechanism for automatic Iceberg data deletion and optimization.

It would be super cool if webrequest was Iceberg, but I agree all this work would scope creep your original intention for this ticket.

@Fabfur and I would like to start some integration tests in the short term. I moved the webrequest schema from GA to development in the primary repo. This follows the same process we adopted with page_change, and should allow for faster iteration speed without messing around with schema versions.

Similarly, I tagged the stream declaration as webrequest.frontend.rc0 in mediawiki-config.

Change 1012656 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] Add webrequest_frontent raw schema.

https://gerrit.wikimedia.org/r/1012656

Change #983905 merged by jenkins-bot:

[operations/mediawiki-config@master] Add webrequest.frontend.rc0 stream

https://gerrit.wikimedia.org/r/983905

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:16:47Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:20:33Z] <hashar@deploy1002> otto and hashar: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:37:47Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] (duration: 20m 59s)

Hello. FYI we are receiving some alerts about failed produce_canary_events jobs, due to being unable to find the webrequest schema. e.g.

24/03/27 13:30:45 ERROR ResourceLoader: Caught exception when trying to load resource.
org.wikimedia.eventutilities.core.util.ResourceLoadingException: Failed loading resource. (resource: https://schema.discovery.wmnet/repositories/primary/jsonschema/development/webrequest/latest)
	at org.wikimedia.eventutilities.core.util.ResourceLoader.loadFirst(ResourceLoader.java:122)

Let me know if there's anything I can do to help, but maybe this is all fine.

Change #983898 merged by jenkins-bot:

[schemas/event/primary@master] development: add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change #1015260 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Change #1015260 merged by jenkins-bot:

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:12:44Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:28:20Z] <hashar@deploy1002> gmodena and hashar: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:46:48Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] (duration: 34m 03s)

Change #1017041 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #983926 merged by Gmodena:

[analytics/refinery@master] Add gobblin job webrequest_frontend_rc0

https://gerrit.wikimedia.org/r/983926

Change #1017041 merged by Btullis:

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #1026498 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] primary: add webrequest schema

https://gerrit.wikimedia.org/r/1026498

Change #1026506 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.v1.

https://gerrit.wikimedia.org/r/1026506