T313955#8115296 is the long winded version of this bug report, but basically, this is what happens:
* We sqoop the `revision` and `archive` tables from mediawiki
* For `revision` records, we [[ https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/sql/PageViewRegistrar.scala#59 | use ]] `page_is_redirect`
* For `archive` records (aka revisions on deleted pages) we [[ https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/sql/DeletedPageViewRegistrar.scala#55 | use ]] `NULL as page_is_redirect`.
* We then process these records together and build `wmf.mediawiki_history`
* When we then create `mediawiki_history_reduced`, we try to filter out redirect pages, but we do this [[ https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/oozie/mediawiki/history/reduced/generate_mediawiki_history_reduced.hql#123 | using ]] `NOT page_is_redirect`
That last point is where the issue is. We accidentally exclude all the deleted pages, thinking they're redirects. The fix is to improve this line so it includes the deleted pages, add `deleted` to the `other_tags` of these records, and test the rest of the pipelines and queries that looks at the data. An even better fix might be to measure deleted pages as a separate metric. The ultimate goal should be as mentioned here and in the other mentioned task: we should be able to see the number of articles over time, and sometimes deletions and when exactly they happened on the timeline are very important.
====Original Report about numbers on svwiki====
**Wikistats showing incorrect data for Swedish Wikipedia**
Swedish Wikipedia has seen a very large number of article creations and deletions. A while back it passed 3 million articles on the way down, and is now at 2,5 million articles. This is not reflected in Wikistats:
https://stats.wikimedia.org/#/sv.wikipedia.org/content/pages-to-date/normal%7Cline%7Call%7C~total%7Cmonthly
Looking at the statistics here, June 2022 seems correct for the number of pages but the numbers should be far higher in the previous years (when the wiki had more than 3 million articles in the main namespace), not slightly lower than today.
The [[ https://meta.wikimedia.org/wiki/Research:Wikistats_metrics/Pages_to_date | documentation ]] says it we "measure this by adding up the creations and restores of old pages, and subtracting the page deletions for each given month", but the page deletions don't seem to be subtracted.
Querying the source dataset indicated a few hundred deletions every year. This fewer than we see even if we hadn't had a large project deleting hundreds of thousands of articles every year over the last few years. While I only realized this now, the data seems to be incorrect for as long as I can see.