Page MenuHomePhabricator

Deduplicate page data that got logged multiple times during one run
Closed, DuplicatePublic

Description

Context

We already know of some situations where the logged page data can contain duplicates. These duplicates can either come from different revisions of the same page or from repeated runs on the same source. ( e.g. when a job needs to be restarted ).

Goal

To get accurate data we want to deduplicate these events based on the

  • wiki
  • pageId
  • runId (TBD)

we also always want to keep the latest row ordered by revision id .

Implementation

It's best to implement this as first step in the aggregation flow. The scraper cannot guarantee any order.

Event Timeline

WMDE-Fisch renamed this task from Filter duplicate log entries from scraper runs to Deduplicat pages that got scraped twice during one run.Jan 8 2026, 8:43 AM
WMDE-Fisch renamed this task from Deduplicat pages that got scraped twice during one run to Deduplicate pages that got scraped twice during one run.
WMDE-Fisch renamed this task from Deduplicate pages that got scraped twice during one run to Deduplicate page data that got logged multiple during one run.Jan 9 2026, 7:42 AM
WMDE-Fisch claimed this task.
WMDE-Fisch updated the task description. (Show Details)
WMDE-Fisch renamed this task from Deduplicate page data that got logged multiple during one run to Deduplicate page data that got logged multiple times during one run.Jan 9 2026, 8:02 AM