⚓ T314956 [Event Platform] Declare webrequest as an Event Platform stream

Subject	Repo	Branch	Lines +/-
webrequest: add error schema.	schemas/event/primary	master	+112 -0
EventStreamConfig: Add webrequest.frontend.error	operations/mediawiki-config	master	+21 -0
primary: add webrequest schema	schemas/event/primary	master	+17 -33
EventStreamConfig: Add webrequest.frontend.v1.	operations/mediawiki-config	master	+30 -0
gobblin: add webrequest_error pull job.	analytics/refinery	master	+32 -0
analytics: refinery: add webrequest_frontend timer	operations/puppet	production	+12 -0
Add gobblin job webrequest_frontend_rc0	analytics/refinery	master	+34 -0
webrequest: disable canary events.	operations/mediawiki-config	master	+5 -0
development: add webrequest schema	schemas/event/primary	master	+314 -1
Add webrequest.frontend.rc0 stream	operations/mediawiki-config	master	+34 -0

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

• lbowmaker moved this task from Event Platform Backlog to Event Platform Maintenance (current quarter) on the Data Engineering and Event Platform Team board.Oct 19 2023, 8:00 PM

Ahoelzl renamed this task from Declare webrequest as an Event Platform stream to [Event Platform] Declare webrequest as an Event Platform stream.Oct 20 2023, 5:29 PM

Ottomata triaged this task as Low priority.Oct 23 2023, 3:10 PM

Ottomata updated the task description. (Show Details)Oct 29 2023, 10:56 PM

• lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 1:40 PM

Ottomata added a parent task: T351117: Move analytics log from Varnish to HAProxy.Dec 18 2023, 5:35 PM

Ottomata mentioned this in T351117: Move analytics log from Varnish to HAProxy.Dec 18 2023, 5:40 PM

We are going to do this, but as a new stream. This new stream will be used for HAProxy logging as described in https://phabricator.wikimedia.org/T351117#9413691.

Change 983898 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] WIP - Add webrequest schema

https://gerrit.wikimedia.org/r/983898

gerritbot added a project: Patch-For-Review.Dec 18 2023, 6:20 PM

Change 983905 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] WIP - add webrequest.frontend stream

https://gerrit.wikimedia.org/r/983905

Change 983926 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] WIP - Add gobblin job webrequest_frontend to pull new webrequest stream

https://gerrit.wikimedia.org/r/983926

Ottomata updated the task description. (Show Details)Dec 18 2023, 7:02 PM

Ottomata updated the task description. (Show Details)Dec 19 2023, 4:26 PM

Ottomata added subscribers: gmodena, tchin, Snwachukwu.

How should we layout and name the new stream(s)?

Currently, we have webrequest_text and webrequest_upload topics. Which topic is produced to is determine by which cache_cluster the server is in. IIUC, haproxy will be colocated in the same way as varnish, so we can continue to use this layout if we want to. (cc @Fabfur to confirm).

We could consider changing this. I had thought that perhaps we should do topic prefixing by datacenter like we do for other event streams. E.g. esams.webrequest, etc. topics.

However, the more I look at this, the more I think we should keep the topic layout the same, and keep the Hive partitioning of raw and refined webrequest tables the same too. We can and should populate common and nice fields if we can (e.g. meta.request_id from X-Request-Id header), but using the haproxy migration as an excuse to refactor the webrequest schema and stream layout will make the task grow a lot.

So. Let's make one event platform stream with two topics.

How about:

new stream declared as webrequest.frontend composed of topics webrequest.frontend.text and webrequest.frontend.upload.
Hive ingestion logic changed to extract the webrequest_source 'text' vs 'upload' partitions via a regex in the topic name, just like it does now. We just have to change the regex to match the new topic names.

@Milimetric @gmodena @aqu, whatcha think?

@Antoine_Quhen asked if we should consider making the new webrequest Hive table an Iceberg table. @JAllemandou @xcollazo can/should we do this?

Ottomata updated the task description. (Show Details)Dec 21 2023, 3:44 PM

Ottomata added a subscriber: Antoine_Quhen.

Ottomata updated the task description. (Show Details)Dec 21 2023, 3:52 PM

In T314956#9421549, @Ottomata wrote:

How should we layout and name the new stream(s)?

Currently, we have webrequest_text and webrequest_upload topics. Which topic is produced to is determine by which cache_cluster the server is in. IIUC, haproxy will be colocated in the same way as varnish, so we can continue to use this layout if we want to. (cc @Fabfur to confirm).

Yep, that's correct!

As for Hive tables. I'm trying to decide how best to do the migration. Perhaps, it would be easiest to keep the existent wmf.webrequest refined Hive table as is. The raw table would change to webrequest_frontend as imported from the new streams, but the webrequest refine airflow job would switch to refining from webrequest_frontend raw table once we are ready to do the migration cutover.

This would make the code changes on our side much smaller, and also would reduce cognitive overhead for users of the wmf.webrequest dataset. We wouldn't have to change any terminology an say e.g.'the webrequest frontend' table. We wouldn't have to update any existent documentation, e.g. datahub and wikitech.

This means that we would have to do our analysis and comparision on the wmf_raw.webrequest_frontend table, or we could do a few manual refinements into a temporary refined table for analysis purposes, before we fully migrate.

Hm, alternatively, we could just have the raw and refined tables be brand newly named tables and ingestion jobs during the migration, and then do the final cutover with a RENAME TABLE.

Thoughts?

In T314956#9421551, @Ottomata wrote:

@Antoine_Quhen asked if we should consider making the new webrequest Hive table an Iceberg table. @JAllemandou @xcollazo can/should we do this?

Couple questions back at you: is webrequest append only? If not, how do we do rewrites today? Append only would be significantly easier to tune, while dealing with MERGEs on big Iceberg tables requires deep tuning given our limited cluster resources (see T340863 for all the fun details making Dumps 2.0 behave reasonably well).

Couple questions back at you: is webrequest append only?

yes

If not, how do we do rewrites today?

If we do, they are per hour. We re-refine the entire hour.

We just had a discussion in DE standup about T335306: [SPIKE] Evaluation on iceberg sensor for airflow. I'm sure there are many existent Hive sensors on the webrequest table. I'd rather not block on that task for this migration. I suggest we keep this has a regular Hive table.

In T314956#9421992, @Ottomata wrote:

We just had a discussion in DE standup about T335306: [SPIKE] Evaluation on iceberg sensor for airflow. I'm sure there are many existent Hive sensors on the webrequest table. I'd rather not block on that task for this migration. I suggest we keep this has a regular Hive table.

You'd need that and T338065: [Iceberg Migration] Implement mechanism for automatic Iceberg table maintenance.

It would be super cool if webrequest was Iceberg, but I agree all this work would scope creep your original intention for this ticket.

xcollazo mentioned this in T351837: [SPIKE] Assess impact of Move analytics log from Varnish to HAProxy .Jan 12 2024, 1:18 PM

Ottomata updated the task description. (Show Details)Feb 5 2024, 2:38 PM

gmodena added a subscriber: Ahoelzl.Feb 5 2024, 6:34 PM

gmodena claimed this task.Feb 5 2024, 8:01 PM

gmodena added a parent task: T354694: [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition.Feb 5 2024, 8:04 PM

Ottomata updated the task description. (Show Details)Feb 9 2024, 10:39 PM

@Fabfur and I would like to start some integration tests in the short term. I moved the webrequest schema from GA to development in the primary repo. This follows the same process we adopted with page_change, and should allow for faster iteration speed without messing around with schema versions.

Similarly, I tagged the stream declaration as webrequest.frontend.rc0 in mediawiki-config.

Change 1012656 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] Add webrequest_frontent raw schema.

https://gerrit.wikimedia.org/r/1012656

Tagging T360642: Remove extra fields currently sent to Kafka

Change #983905 merged by jenkins-bot:

[operations/mediawiki-config@master] Add webrequest.frontend.rc0 stream

https://gerrit.wikimedia.org/r/983905

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:16:47Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:20:33Z] <hashar@deploy1002> otto and hashar: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:37:47Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] (duration: 20m 59s)

Hello. FYI we are receiving some alerts about failed produce_canary_events jobs, due to being unable to find the webrequest schema. e.g.

24/03/27 13:30:45 ERROR ResourceLoader: Caught exception when trying to load resource.
org.wikimedia.eventutilities.core.util.ResourceLoadingException: Failed loading resource. (resource: https://schema.discovery.wmnet/repositories/primary/jsonschema/development/webrequest/latest)
	at org.wikimedia.eventutilities.core.util.ResourceLoader.loadFirst(ResourceLoader.java:122)

Let me know if there's anything I can do to help, but maybe this is all fine.

Change #983898 merged by jenkins-bot:

[schemas/event/primary@master] development: add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change #1015260 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Change #1015260 merged by jenkins-bot:

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:12:44Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:28:20Z] <hashar@deploy1002> gmodena and hashar: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:46:48Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] (duration: 34m 03s)

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/643

Draft: analytics: webrequest: add webrequest_frontend refine dag.

Change #1017041 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #983926 merged by Gmodena:

[analytics/refinery@master] Add gobblin job webrequest_frontend_rc0

https://gerrit.wikimedia.org/r/983926

Change #1017041 merged by Btullis:

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #1026498 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] primary: add webrequest schema

https://gerrit.wikimedia.org/r/1026498

Change #1026506 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.v1.

https://gerrit.wikimedia.org/r/1026506

Change #1036272 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] webrequest: add error schema.

https://gerrit.wikimedia.org/r/1036272

Change #1036299 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.error

https://gerrit.wikimedia.org/r/1036299

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/643

analytics: webrequest: add webrequest_frontend refine dag.

Change #1037599 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] gobblin: add webrequest_error pull job.

https://gerrit.wikimedia.org/r/1037599

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/729

Draft: analytics: webrequest: refine frontend in staging.

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/729

analytics: webrequest: refine frontend in staging.

gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/735

analytics: webrequest: frontend disable email

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/735

analytics: webrequest: frontend disable email

Change #1026506 abandoned by Gmodena:

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.v1.

Reason: