Page MenuHomePhabricator

Create data pipeline for the hashing algorithm
Closed, ResolvedPublic

Description

Currently, the hashing algorithm is done using static sample data downloaded from webreqeust table. We would like to

  1. Make it a spark job, for any given time interval.
  2. Test ran time per number of records

Later, the resulting hashes will be put into a openSearch instance for search and exploration purpose.

Sample code: https://gitlab.wikimedia.org/xiaoxiao/web_scraping/-/tree/experimental?ref_type=heads

Details

Due Date
Apr 11 2025, 4:00 AM

Event Timeline

XiaoXiao-WMF set Due Date to Feb 28 2025, 5:00 AM.
XiaoXiao-WMF added a subscriber: MunizaA.
XiaoXiao-WMF changed the task status from Open to In Progress.Feb 19 2025, 3:57 PM
XiaoXiao-WMF reassigned this task from XiaoXiao-WMF to fkaelin.
XiaoXiao-WMF triaged this task as High priority.

Updates

  • added transformations to research-datsets processing webrequests at scale to
    • detect patterns in uri path&query parameters based on a entropy heuristic
    • generate "semantic uri" based on detected patterns
  • notebooks to
    • experiment with configuration params and explore distributions for entropy
    • aggregate metrics

Updates

  • Validated datasets and aligned with approach used in web_scraping repo. Created separate thresholds for query and path components for generating the semantic uri.
  • Added basic unit tests for entropy calculation and thresholds. Opened a MR for research-datasets.

Updates

  • Datasets created for 1 week of webrequest data for various cost aggregations, stored /user/fab/traffic_patterns/time_series/

Moving from the quarterly lane to in-progress as I'm closing the quarterly lane. Please set/update the deadline for the task.

fkaelin changed Due Date from Feb 28 2025, 5:00 AM to Apr 11 2025, 4:00 AM.Apr 7 2025, 4:29 PM

This work was merged with MR