Page MenuHomePhabricator

Data quality: HTML dumps contain unexplainably outdated revisions of some pages
Open, Needs TriagePublic

Description

It's easiest to demonstrate this bug with a concrete example:

tar xzf dewiki-NS0-20240201-ENTERPRISE-HTML.json.tar.gz -O \
    | jq -r 'select(.name == "Jibril") | .name,.identifier,.version.identifier'

Jibril
2399174
213695747

This revision was made in 2021: https://de.wikipedia.org/w/index.php?oldid=213695747

But the page was later edited in 2023 to become a redirect, https://de.wikipedia.org/w/index.php?title=Jibril&action=history so why was such an old version included in the dump? In fact, on 2024-02-01 the page would already not be included in he GetAllPages listing because apfilterredir=nonredirects, so how is it even present in the dump?