Page MenuHomePhabricator

Bug: Deleted pages are accidentally excluded from mediawiki_history_reduced
Open, Needs TriagePublic3 Estimated Story Points

Description

T313955#8115296 is the long winded version of this bug report, but basically, this is what happens:

  • We sqoop the revision and archive tables from mediawiki
  • For revision records, we use page_is_redirect
  • For archive records (aka revisions on deleted pages) we use NULL as page_is_redirect.
  • We then process these records together and build wmf.mediawiki_history
  • When we then create mediawiki_history_reduced, we try to filter out redirect pages, but we do this using NOT page_is_redirect

That last point is where the issue is. We accidentally exclude all the deleted pages, thinking they're redirects. The fix is to improve this line so it includes the deleted pages, add deleted to the other_tags of these records, and test the rest of the pipelines and queries that looks at the data. An even better fix might be to measure deleted pages as a separate metric. The ultimate goal should be as mentioned here and in the other mentioned task: we should be able to see the number of articles over time, and sometimes deletions and when exactly they happened on the timeline are very important.

Original Report about numbers on svwiki

Wikistats showing incorrect data for Swedish Wikipedia

Swedish Wikipedia has seen a very large number of article creations and deletions. A while back it passed 3 million articles on the way down, and is now at 2,5 million articles. This is not reflected in Wikistats:
https://stats.wikimedia.org/#/sv.wikipedia.org/content/pages-to-date/normal%7Cline%7Call%7C~total%7Cmonthly

Looking at the statistics here, June 2022 seems correct for the number of pages but the numbers should be far higher in the previous years (when the wiki had more than 3 million articles in the main namespace), not slightly lower than today.

The documentation says it we "measure this by adding up the creations and restores of old pages, and subtracting the page deletions for each given month", but the page deletions don't seem to be subtracted.

Querying the source dataset indicated a few hundred deletions every year. This fewer than we see even if we hadn't had a large project deleting hundreds of thousands of articles every year over the last few years. While I only realized this now, the data seems to be incorrect for as long as I can see.

Event Timeline

EChetty set the point value for this task to 3.
EChetty moved this task from Ready to In progress on the Data-Engineering-Planning (Sprint 02) board.
Milimetric added a subscriber: Milimetric.

We took a look at this. It seems that the confusion here is that this metric excludes redirects. So, we ran a query and found if we include redirects, for all time, on sv.wikipedia, we have:

8469790 creations, 2531709 deletions

And if we exclude redirects, we have:

4175681 creations, 2197 deletions

So most of the deletions you're talking about seem to be deletions of redirect pages. And the last set of numbers seems to match closely what's available on Wikistats.

There is the issue of filtering out deletions. The description of the metric we have on Wikistats says we are NOT filtering out deletions, but the description on Wikitech says we ARE filtering them out. We'll have to create a separate issue to look at that, but in this case the number of deletions of non-redirect pages is so small as to not matter.

@Milimetric I wonder if there's a mistake somewhere in the data around what is a redirect. Even if we just look at the last seven years, that would average less than one deletion of a non-redirect per day. This couldn't possible be true even ignoring the clean-up project – we delete far more than this.

I've also looked at some samples from the clean-up project and none of the deleted pages I looked at were redirects.

@Johan thank you, indeed, I believe you just found a bug.

Ok, so when we load deleted pages, we query the archive table. This doesn't have redirect information, so we set page_is_redirect to null. And later, when we generate the dataset that Wikistats queries, we check for NOT page_is_redirect. And NOT NULL is FALSE. That means we exclude all of the deleted pages by accidentally labeling them as redirects!

This is an interesting bug, because we want to exclude deleted pages from our metrics, and so this kind of does that. The number on Wikistats ends up still being correct, as the bug does what the metric wants, just via a mistake.

I'll talk to the team and see what they want to do. But what does this mean for you? Does the data make sense now or is there something else that feels wrong nested inside here?

But what does this mean for you? Does the data make sense now or is there something else that feels wrong nested inside here?

Answering my question a bit, the problem is that, in labeling these as redirects, we erase them from history completely. So we don't show the historical page count properly. This is tricky because the redirect information is really only stored in the wikitext itself, or in the page_is_redirect field on the Page table. The latter is lost when the page is deleted. So to better rebuild the history of the metric we'd have to look at the content or get more information in the delete event in the logging table. This will take some doing.

@Milimetric Yeah, my problem here is that I don't get the development of Swedish Wikipedia. We're missing a significant portion of the now deleted articles in the Wikistats graph, where the number of articles should have peaked years ago at a far higher count than today.

@Milimetric Yeah, my problem here is that I don't get the development of Swedish Wikipedia. We're missing a significant portion of the now deleted articles in the Wikistats graph, where the number of articles should have peaked years ago at a far higher count than today.

Makes perfect sense. I'm going to wrestle this task and the related T313985 to reflect the bug.

Milimetric renamed this task from Wikistats showing incorrect data for Swedish Wikipedia to Bug: Deleted pages are accidentally excluded from mediawiki_history_reduced.Aug 2 2022, 8:32 PM
Milimetric removed Milimetric as the assignee of this task.
Milimetric updated the task description. (Show Details)