Page MenuHomePhabricator

Wikistats - incorrect number of content articles for Latvian Wikipedia
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  1. Open Wikistats -> "Pages to date" for Latvian Wikipedia -> Look at the number of content pages: https://stats.wikimedia.org/#/lv.wikipedia.org/content/pages-to-date/normal|line|2-year|page_type~content|monthly
  2. Open Latvian Wikipedia statistics page -> Look at the number of current Content pages (Satura lapas): https://lv.wikipedia.org/wiki/Special:Statistics

What happens?:
Wikistats displays an incorrect amount of Content/Non Content pages when it comes to such stats as "Pages to date" or "New pages".
E.g. The "Pages to date" graph shows that in November, the Latvian wiki had 276 thousand content pages

What should have happened instead?:
The Latvian wiki currently has just under 125 thousand content pages

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Wikistats shows the correct number of "Total" pages, so to me it looks like it's currently including some non-content pages as content pages.
This issue appeared in October (2023)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

TL;DR; the data pipeline up to AQS seems fine, my guess is we're not filtering properly to exclude redirects in AQS 2, timeline corresponds with the reported problem. Sorry for the inconvenience, working on a fix.

Indeed, the numbers are as @Spnq reports here:

mysql:research@dbstore1007.eqiad.wmnet [lvwiki]> select page_namespace, page_is_redirect, count(*) count from page group by page_namespace, page_is_redirect order by count desc;

page_namespacepage_is_redirectcount
01153480
00124798
30110777
14037720
6026255
1014904
10010841
205002
1014164
113704
313054
1411299
401169
260001101

So the problem must be coming later down the pipe. Checking mediawiki history reduced, the source dataset:

unfiltered, mediawiki_history_reduced seems to agree:

select snapshot,
       page_type,
       count(*) as count
  from mediawiki_history_reduced
 where snapshot in ('2023-06', '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12')
   and project = 'lv.wikipedia'
   and event_entity = 'page'
   and event_type = 'create'
 group by snapshot, page_type
 order by snapshot, page_type
snapshotpage_typecount
2023-06content271074
2023-06non_content219379
2023-07content272376
2023-07non_content220538
2023-08content273473
2023-08non_content221576
2023-09content274388
2023-09non_content222475
2023-10content275571
2023-10non_content223373
2023-11content276782
2023-11non_content224616
2023-12content278110
2023-12non_content225639

But the other_tags field is supposed to help with this, to filter out redirects, and I remember we ran into this during testing but I thought we had fixed it. Indeed splitting by is_redirect brings the count of content pages much closer to what we would expect, namely 121053:

select snapshot,
       contains(other_tags, 'redirect') as is_redirect,
       count(*) as count
  from mediawiki_history_reduced
 where snapshot in ('2023-06', '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12')
   and project = 'lv.wikipedia'
   and event_entity = 'page'
   and event_type = 'create'
   and page_type = 'content'
 group by snapshot, contains(other_tags, 'redirect')
 order by snapshot, is_redirect;
snapshotis_redirectcount
2023-06false121052
2023-06true150022
2023-07false121723
2023-07true150653
2023-08false122268
2023-08true151205
2023-09false122757
2023-09true151631
2023-10false123415
2023-10true152156
2023-11false124053
2023-11true152729
2023-12false124714
2023-12true153396

So the problem then is how we're querying Druid. I'll look in AQS 2.0 code next.

Taking a look at the code, this issue seems to be related with something similar we found in editor-analytics some months ago (https://gerrit.wikimedia.org/r/c/generated-data-platform/aqs/editor-analytics/+/973139). We had to change the way we considered other_tags to filter properly and we didn't do the same for edit-analytics at that moment.

We have already made this change locally in edit-analytics and it seems to work fine with our local test environment. We're going to push a patch to the repo to be able to deploy to the staging environment to be able to test it using production data. If it works, we'll be sure that we can deploy the change to production.

Change 987965 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/edit-analytics@main] Incorrect values when requesting edited-pages/new endpoint

https://gerrit.wikimedia.org/r/987965

Change 987965 merged by jenkins-bot:

[generated-data-platform/aqs/edit-analytics@main] Incorrect values when requesting edited-pages/new endpoint

https://gerrit.wikimedia.org/r/987965

Change 987967 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] deploying a new edit-analytics version to staging environment

https://gerrit.wikimedia.org/r/987967

Change 987967 merged by jenkins-bot:

[operations/deployment-charts@master] deploying a new edit-analytics version to staging environment

https://gerrit.wikimedia.org/r/987967

After merging the patch and deploy the service to the staging environment (where we can make requests against production data) we were able to confirm that the fix is working fine. We got the same values as Dan got querying directly the source dataset.

Next Monday we will deploy the service to production.

Change 988486 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] Deploying edit-analytics to production

https://gerrit.wikimedia.org/r/988486

Change 988486 merged by jenkins-bot:

[operations/deployment-charts@master] Deploying edit-analytics to production

https://gerrit.wikimedia.org/r/988486

At this moment the fix is already deployed to production and it seems to work fine

The "Content page" count now looks to be correct 👍

But now there's another issue with the page count - stats shows that the total number of pages to date is 337k (https://stats.wikimedia.org/#/lv.wikipedia.org/content/pages-to-date/normal|line|2-year|~total|monthly), but in actuality it's a little over 500k -> the issue appears to be with the "Non-content" page count.

Maybe the same pages that previously were incorrectly counted as "Content" haven't been added to the "Non-content" page count?

P.S. Should I report this as a new bug or will this be resolved within this ticket?

Let us take a look. I think you're right and we are discarding always "redirect" pages from results. I think it's the reason you missed some pages from results when requesting 'non-content' or 'all-page-types'

I think we can keep this ticket to work in the issue. It's completely related to the previous one. We should have realized this before. We'll keep you posted here.

Change 989094 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/edit-analytics@main] Incorrect number of content pages on stats.wikimedia.org

https://gerrit.wikimedia.org/r/989094

Change 989094 merged by jenkins-bot:

[generated-data-platform/aqs/edit-analytics@main] Incorrect number of content pages on stats.wikimedia.org

https://gerrit.wikimedia.org/r/989094

Change 989098 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] Deploying to staging to test the fix with production data

https://gerrit.wikimedia.org/r/989098

Change 989098 merged by jenkins-bot:

[operations/deployment-charts@master] Deploying to staging to test the fix with production data

https://gerrit.wikimedia.org/r/989098

After pushing a new fix and testing it in the staging environment (where we have access to the production data), we are getting the following results:

  • 124714 for 'content' pages (without considering 'redirect' ones) (124 918 shown in the Statistics page for 'Satura lapas'). This is what we fixed with the previous patch. The current value keeps being the right one.
  • 503747 for 'all-page-types' pages which I think is what you mean when you say just 'pages' (504 410 is shown in the Statistic page for 'Lapas'). In this case, after the last patch, we are already considering 'redirect' cases for this count. This is what we are trying to fix right now and it seems to work fine.
  • 225637 for 'non-content' pages (which coincides with the results than Dan got from Druid directly). In this case 'redirect' cases are also considered. This number keeps being the right one

So it seems results are working fine at this moment. Any concern about the current results? Just in case we are missing something.

I reviewed the data available to me and realised that I had previously misread some results. The data on Wiki Stats is indeed correct. Thank you, Sfaci!

Do you mean we don't need the last patch?

Which should be the right value for the total number of pages? I'm not sure if I'm understanding well. You said that wikistats was wrong because it's showing 337k and it should show around 500k (the same value that lv.wikipedia shows as "Lapas"). Could you confirm that?

Yeah, I thought that wikistats was wrong, because it said that there were around 300k pages instead of around 500k, but then I noticed that the description of that statistic states: "The running count of all pages created, excluding pages being redirects". So I assume that that's the reason for the difference.

So I, as a user, don't see the need for any additional fixes.

You are right! Statistics page in every wiki includes redirects when showing "Lapas" results but wikistats shouldn't do that in any case. Ok!, because the second patch was not deployed to production, everything is already done here and we can consider the initial bug as solved.

Thank you very much!

After discarding the last issue I have redeployed again the previous version in the staging environment. The original bug is solved and the service working fine in both environments (staging and production)

Change 990595 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[generated-data-platform/aqs/edit-analytics@main] Changing an integration test case according to the last issue we fixed

https://gerrit.wikimedia.org/r/990595

Change 990595 merged by jenkins-bot:

[generated-data-platform/aqs/edit-analytics@main] Changing an integration test case according to the last issue we fixed

https://gerrit.wikimedia.org/r/990595