Context
We already know of some situations where the logged page data can contain duplicates. These duplicates can either come from different revisions of the same page or from repeated runs on the same source. ( e.g. when a job needs to be restarted ).
Goal
To get accurate data we want to deduplicate these events based on the
- wiki
- pageId
- runId (TBD)
we also always want to keep the latest row ordered by revision id .
Implementation
It's best to implement this as first step in the aggregation flow. The scraper cannot guarantee any order.