Page MenuHomePhabricator

Relative Trending - Flink app for page_view
Open, Needs TriagePublic

Description

This task might be a parent task for many others related to the Flink application.

As defined in the Relative Trending design document, the first step of the process is to create a Flink application that will convert webrequest_frontend_text Kafka topic into webrequest.page_view topic.

This conversion is already defined and done in batch and this process should be as close a possible to it.
We don't want to repeat code, so, whenever is possible, this application will use the same UDFs and libraries used in batch. Investigation about how to share those functions may be required.

The input throughput is around 105k messages/second. Due to the high traffic and initial requirements, the application will stateless and won't implement any complex bot detection rules.

We also need to analyze the throughput of the destination topic to decide the number of partitions.

Task is done if:

  • A new stream for webrequest.page_view is defined
  • A new schema for webrequest.page_view is defined
  • The Kafka topic is using the right number of partitions
  • A Flink application is deployed in K8s, reading from webrequest_frontend_text and writing into webrequest.page_view

Tasks that are required but outside of the MVP:

  • Monitoring and alerting of the Flink application
  • Schema and application are productionized and running in v1

  • TBD: Should we convert webrequest_frontend_text into a regular stream with its own schema?

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Page_view transformations using refinery coderepos/data-engineering/mediawiki-event-enrichment!149javiermontonfeature/page-view-4-transformationsfeature/page-view-3-docker-image
page_view: Docker build & publishrepos/data-engineering/mediawiki-event-enrichment!148javiermontonfeature/page-view-3-docker-imagefeature/page-view-2
webrequest -> Pageview simple transformationrepos/data-engineering/mediawiki-event-enrichment!147javiermontonfeature/page-view-2feature/page-view-1
Basic empty Flink application for PageViewrepos/data-engineering/mediawiki-event-enrichment!146javiermontonfeature/page-view-1main
[WIP] Java base project for Flink 2repos/data-engineering/mediawiki-event-enrichment!105javiermontonfeature/java-base-2main
Customize query in GitLab

Event Timeline

Change #1294318 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1294318

Change #1294318 merged by jenkins-bot:

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1294318

Change #1294947 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: webrequest_page_view

https://gerrit.wikimedia.org/r/1294947

Change #1294947 merged by jenkins-bot:

[operations/deployment-charts@master] stream: webrequest_page_view

https://gerrit.wikimedia.org/r/1294947

Change #1295036 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1295036

Change #1295036 merged by jenkins-bot:

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1295036

Change #1296628 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1296628

Change #1296628 merged by jenkins-bot:

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1296628

Change #1297078 had a related patch set uploaded (by JavierMonton; author: JavierMonton):

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1297078

Change #1297078 merged by jenkins-bot:

[operations/deployment-charts@master] stream: webrequest-page-view

https://gerrit.wikimedia.org/r/1297078

A test with real data (Kafka Jumbo), with 10 replicas, is working without any issues with the current throughput:

image.png (2,100×745 px, 59 KB)

Matching other analyses we've done in batch, around 8% of the webrequest requests are considered a "page_view".

Starting from "earliest", similar to a situation where we'd need to backfill data, with 10 Task Manager and 2 slots, it processes 400k per second.

image.png (2,290×900 px, 91 KB)

Using 20 Task managers:

image.png (2,309×786 px, 88 KB)

Each Task Manager (K8s pod) is using 2 CPUs and 3GB Memory.

All these test don't include any HTTP call to retrieve wiki_id or similar that we may need to implement.