T313955#8115296 is the long winded version of this bug report, but basically, this is what happens:
- We sqoop the revision and archive tables from mediawiki
- For revision records, we use page_is_redirect
- For archive records (aka revisions on deleted pages) we use NULL as page_is_redirect.
- We then process these records together and build wmf.mediawiki_history
- When we then create mediawiki_history_reduced, we try to filter out redirect pages, but we do this using NOT page_is_redirect
That last point is where the issue is. We accidentally exclude all the deleted pages, thinking they're redirects. The fix is to improve this line so it includes the deleted pages, add deleted to the other_tags of these records, and test the rest of the pipelines and queries that looks at the data. An even better fix might be to measure deleted pages as a separate metric. The ultimate goal should be as mentioned here and in the other mentioned task: we should be able to see the number of articles over time, and sometimes deletions and when exactly they happened on the timeline are very important.
Original Report about numbers on svwiki
Wikistats showing incorrect data for Swedish Wikipedia
Swedish Wikipedia has seen a very large number of article creations and deletions. A while back it passed 3 million articles on the way down, and is now at 2,5 million articles. This is not reflected in Wikistats:
https://stats.wikimedia.org/#/sv.wikipedia.org/content/pages-to-date/normal%7Cline%7Call%7C~total%7Cmonthly
Looking at the statistics here, June 2022 seems correct for the number of pages but the numbers should be far higher in the previous years (when the wiki had more than 3 million articles in the main namespace), not slightly lower than today.
The documentation says it we "measure this by adding up the creations and restores of old pages, and subtracting the page deletions for each given month", but the page deletions don't seem to be subtracted.
Querying the source dataset indicated a few hundred deletions every year. This fewer than we see even if we hadn't had a large project deleting hundreds of thousands of articles every year over the last few years. While I only realized this now, the data seems to be incorrect for as long as I can see.