Page MenuHomePhabricator

Request: changelog for Enterprise API HTML dumps
Open, Needs TriagePublic

Description

As a consumer of the enterprise_html dumps, I came across a large and surprising drop in numbers between the 2023-06-01 dump and the 2023-07-20 dewiki-NS0 dump. The most obvious difference is that the 2023-06-01 dump included a total of 2,988,697 pages and the 2023-07-20 dump included 2,137,879 pages. For comparison, w:de:Special:Statistics reports 2,840,668 content pages and mw.config.get('wgContentNamespaces') reports that only NS0 has been counted, so this number should have matched the dump size. Debugging my own code revealed that my 2023-06-01 page count was likely inflated by 142,425 duplicate rows, but this still doesn't come close to explaining the full difference.

I would like to read a comprehensive change log of anything that may have affected these dumps, so that I can be confident about the data quality. This should be discoverable from the dump indexes such as https://dumps.wikimedia.org/other/enterprise_html/ .