Page MenuHomePhabricator

napwikisource reports more content pages than all pages in its main namespace
Closed, InvalidPublic

Description

Our newest content wiki, the Neapolitan Wikisource, just (presumably) had its stats recounted on September 1st, for the first time since it was created and its content was imported. The "content pages" count dropped from 10,167 the previous day to 413. The reason I am opening this task is not the drop itself (which is large but not unprecedented for new wikis) but the count that it dropped to.

413 content pages is 230 pages higher than the total number of non-redirects in the main namspace (183).

(The main namespace is the wiki's only content namespace and the wiki uses the default 'link' counting method.)

How is this possible? How could it be fixed? Am I missing something here that makes this actually not an error?

Event Timeline

Dcljr created this task.Sep 2 2019, 5:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2019, 5:54 AM
Dcljr updated the task description. (Show Details)Sep 2 2019, 5:55 AM
Dcljr updated the task description. (Show Details)

Not sure which exact project tags to set here, but for some general information:

  • updateArticleCount.php runs on the 1st and 15th of a month. Hence numbers were updated.
  • updateArticleCount.php might have bugs, see for example T212706.
  • Note that the default for wgArticleCountMethod is set to link in InitialiseSettings.php.txt, so a page will only be counted if it contains a wiki link.
Urbanecm claimed this task.Sep 3 2019, 4:40 AM
Urbanecm added a subscriber: Urbanecm.

I'd treat the script tag as de-facto subproject of the site request one, anyway, either is fine IMO, people willing to look into this should watch both.

Claiming, will have a look soon.

Restricted Application added a project: User-Urbanecm. · View Herald TranscriptSep 3 2019, 4:40 AM
  • Note that the default for wgArticleCountMethod is set to link in InitialiseSettings.php.txt, so a page will only be counted if it contains a wiki link.

We had this problem on srwikisource and this was resolved with doing this, @Aklapper is right. But @Urbanecm claimed this task, so I want to he say what is better.

Dcljr added a comment.Sep 6 2019, 1:08 AM

We had this problem on srwikisource and this was resolved with doing this, @Aklapper is right.

@Zoranzoki21 : Umm… doing what? Aklapper's comment didn't contain any suggestion of something to "do".

Dcljr added a comment.Sep 6 2019, 1:17 AM

BTW, I already acknowledged the 'link' counting method was being used in the task description. My point about "non-redirects in the main namespace" is that the count given by 'link' should not be higher than the highest possible count 'any' could give, which in this case is (or was) 183.

(Hey, that would be an interesting test: change to 'any', recount, note the number, change back to 'link', recount again. But no one's actually going to do that… [grin])

Dcljr added a comment.Sep 6 2019, 1:30 AM

OBTW (again), I guess I should have also pointed out that the total number of pages in the main namespace (including redirects and the Main Page) is now 206 (would have been 205 on Sep 2).

Not that this number would be given by any counting method (it wouldn't), but just for the sake of completeness…

Urbanecm removed Urbanecm as the assignee of this task.Sep 7 2019, 4:56 PM
  • Note that the default for wgArticleCountMethod is set to link in InitialiseSettings.php.txt, so a page will only be counted if it contains a wiki link.

We had this problem on srwikisource and this was resolved with doing this, @Aklapper is right. But @Urbanecm claimed this task, so I want to he say what is better.

i just looked if script is a solution, and forgot to update

  • Note that the default for wgArticleCountMethod is set to link in InitialiseSettings.php.txt, so a page will only be counted if it contains a wiki link.

We had this problem on srwikisource and this was resolved with doing this, @Aklapper is right. But @Urbanecm claimed this task, so I want to he say what is better.

i just looked if script is a solution, and forgot to update

Ok, so we should set wgArticleCountMethod to any and I will claim this task.

Restricted Application added a project: User-Zoranzoki21. · View Herald TranscriptSep 7 2019, 7:25 PM
Dcljr added a comment.Sep 8 2019, 12:20 AM

Ok, so we should set wgArticleCountMethod to any and I will claim this task.

Whoa, hang on… is this what the napwikisource community wants?

Urbanecm changed the task status from Open to Stalled.Sep 8 2019, 1:16 AM
Urbanecm moved this task from Backlog to Config - to process on the Wikimedia-Site-requests board.

Good question. Zoranzoki, can you ask at their village pump to confirm this request?

Good question. Zoranzoki, can you ask at their village pump to confirm this request?

Zoran or Kizule, what you like more to call me... :)

I asked at their village pump, question is here.

Dcljr added a comment.Sep 9 2019, 3:20 AM

OK, it's fine to seek consensus about this change, but why is this the proposed solution to the reported problem? This is not trying to fix the underlying issue (whatever it may be), just trying to avoid it — and doing so in a way that presumably will not be acceptable to at least some wikis that may encounter the same problem in the future (so this cannot be used as a general workaround).

BTW, even if this does gain consensus at napwikisource, I suggest not acting on it until the wiki is recounted again. @Urbanecm, did you actually do this? Your comment above isn't completely clear about this. (Would be nice to see the results, too.)

Is there any other maintenance script that might affect article counting that could (also) be tried? Like rebuilding/repopulating/whatever the pagelinks database for the wiki? (I don't know…)

Dcljr added a comment.Sep 9 2019, 11:34 AM

…Or am I just misunderstanding and this (@Zoranzoki21's suggestion) is not being proposed as a "permanent fix" to the problem?

…Or am I just misunderstanding and this (@Zoranzoki21's suggestion) is not being proposed as a "permanent fix" to the problem?

@Dcljr My suggestion is related to (if possible) permanent fix of the problem.

Dcljr added a comment.Sep 9 2019, 11:54 AM

@Zoranzoki21 OK, then I have to ask (again): Why?

This seems to be trying to fix a completely different problem: namely, an otherwise correct content-page count that "seems too low" to the wiki community. That is not the issue we have here. (Granted, it may become that now that the community is discussing the proposed change… [grin])

I would like to see some attempts to actually diagnose and fix the problem (of having an impossibly high reported content-page count) before we just "paper over" the problem (by changing counting methods) and ignore it.

For example, does anyone recall (or can anyone find) a similar report for another Wikimedia wiki (not "too low" of a count, which is reported "all the time", but one that's "impossibly high")? I can't recall ever seeing such a report that wasn't based on (1) a misunderstanding of how MediaWiki works or how the wiki in question was configured, or (2) a temporary condition for a new wiki that was fixed the next time it was recounted.

The number should be updated two times a month, if the number doesn't expect community expectations, we should ask what is needed, so we can change the config to do what it is expected to.

Dcljr added a comment.Sep 10 2019, 1:00 AM

Are people actually reading this thread, or are they just skimming and trying to get the gist of what is being said?

This is not a "what configuration do you want" issue; this is a counting "bug". (Maybe it should be marked as such?)

The count is not simply "not meeting community expectations" (to paraphrase Urbanecm). This is an error in the content-page count: the current count cannot possibly be correct, regardless of what the community thinks or wants. (Thus, IMO, the community should not have been consulted until sysadmins/developers tried to figure out what was actually causing the problem. As I suggested above, they might request a change that would "hide" the problem, rendering it moot, but they cannot fix the problem with a configuration change request.)

If "we don't care" what is causing the problem, well, I guess I have nothing more to say about this. But I kinda figure it would be nice to know why this has happened, so it could be avoided or fixed legitimately in the future, if it happens again on a new wiki (or any wiki).

Finally, again I ask @Urbanecm: when you say "i just looked if script is a solution, and forgot to update", does this mean you actually ran a maintenance script? If so, which one (updateArticleCount, initSiteStats.php, or something else) and what was the result?

Dcljr added a comment.Sep 16 2019, 1:53 AM

FYI, the wiki was just recounted again (on September 15th), and it again gave an impossible number. So, the situation has not improved. In fact, the count seems to be diverging even farther away from the correct count.

The current content-page count is 465 (using the 'link' counting method), which is 279 higher than the total number of non-redirects in the main namespace, 186 (the number 'all' would give, which is a strict upper bound on what 'link' could give)—and, for reference, it's 258 higher than the total number of pages (including redirects) in the main namespace, 207 (a number no counting method would give, but which is, again, an upper bound [but not a least upper bound] on the number any counting method could possibly give, since the main namespace is the wiki's only content namespace).

Dcljr added a comment.Sep 16 2019, 3:47 AM

OBTW, FWIW, on Sep 9 (I think it was), I went through and null-edited every page in napwikisource's main namespace. It didn't seem to have any immediate effect on the content-page count, but I mention it here in case its effect was only seen when the wiki was recounted just now. (I don't know how likely that is, but it doesn't look to me like the increase in count from Sep 1 to today can be explained solely based on on-wiki editing activity. I could be wrong, though.)

Dcljr added a comment.Sep 16 2019, 4:12 AM

Another FWIW: I checked the last wiki to be created before this one (Western Armenian Wikipedia, which also had problems during its creation—see, e.g., T212597#5065429), and the content-page count over there (7,478) is entirely reasonable given the count of non-redirects in the main namespace (7,626) and the apparent level of linking seen in its articles (30 out of 30 randomly chosen main-namespace pages contained wikilinks).

So, whatever is causing this problem on napwikisource doesn't seem to be affecting hywwiki. FYI.

Dcljr updated the task description. (Show Details)Sep 16 2019, 4:14 AM
Reedy added a subscriber: Reedy.EditedSep 16 2019, 12:38 PM

413 content pages is 230 pages higher than the total number of non-redirects in the main namspace (183).

So there's a bit of a misunderstanding here. There isn't only NS 0 considered a "content" namespace on this wiki, 250 and 252 are too

reedy@deploy1001:~$ mwscript eval.php napwikisource
> $services = MediaWiki\MediaWikiServices::getInstance();

> var_dump( $services->getNamespaceInfo()->getContentNamespaces() );
array(3) {
  [0]=>
  int(0)
  [1]=>
  int(250)
  [2]=>
  int(252)
}

So while you can quote figures from the main namespace, but ignoring the two other namespaces (which MW is considering as content namespaces, which apparently have a *lot* of pages), you are of course going to get a figure that doesn't make sense in your maths

For reference, the first two queries are the results that would be given if wgArticleCountMethod wasn't link, the second two are if it is link, but showing the difference of the extra namespaces

MariaDB [napwikisource]> select count(distinct page_id) from page where page_namespace IN ( 0, 250, 252 );
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                   10000 |
+-------------------------+
1 row in set (0.01 sec)

MariaDB [napwikisource]> select count(distinct page_id) from page where page_namespace IN ( 0 );
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                     207 |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [napwikisource]> select count(distinct page_id) from page where page_namespace IN ( 250, 252 );
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                    9793 |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [napwikisource]> select count(distinct page_id) from page INNER JOIN pagelinks ON (pl_from=page_id) where page_namespace IN ( 0, 250, 252 );
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                     488 |
+-------------------------+
1 row in set (0.07 sec)

MariaDB [napwikisource]> select count(distinct page_id) from page INNER JOIN pagelinks ON (pl_from=page_id) where page_namespace IN ( 0 );
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                     195 |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [napwikisource]> select count(distinct page_id) from page INNER JOIN pagelinks ON (pl_from=page_id) where page_namespace IN ( 250, 252 );
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                     293 |
+-------------------------+
1 row in set (0.02 sec)
Dcljr added a comment.Sep 17 2019, 3:47 AM

There isn't only NS 0 considered a "content" namespace on this wiki, 250 and 252 are too

@Reedy Where is this defined? It wasn't in InitialiseSettings.php — which, frustratingly, just got "emptied out" (T233069). The settings are now found in VariantSettings.php, and that file still shows no setting for napwikisource under 'wgContentNamespaces' (so it "should" be using the default, NS_MAIN).

Is this "misunderstanding" because settings are in the process of being migrated to other locations? (Fun.)

Dcljr added a comment.Sep 17 2019, 3:56 AM

Let me ask this explicitly: How do I find the settings for a specific wiki as a "normal user" (not a sysadmin), now that InitialiseSettings.php is devoid of per-wiki settings and VariantSettings.php apparently is not to be trusted?

Let me ask this explicitly: How do I find the settings for a specific wiki as a "normal user" (not a sysadmin), now that InitialiseSettings.php is devoid of per-wiki settings and VariantSettings.php apparently is not to be trusted?

Well, InitialiseSettings.php wouldn't help you anyway, see it's old version. All configuration, regardless from where it is loaded, can be modified by extensions. In this very case, ProofreadPage extension adds its namespaces (ID 250 and 252) into ContentNamespaces. As such, all of main namespace (ID 0), page (250) and index (252) are considered content, and because of that, the number of content pages is higher than number of pages in NS_MAIN.

Finally, again I ask @Urbanecm: when you say "i just looked if script is a solution, and forgot to update", does this mean you actually ran a maintenance script? If so, which one (updateArticleCount, initSiteStats.php, or something else) and what was the result?

Sorry for not replying, no, I didn't run a script, I just looked if it will change something - and since it's run by cron, no, it wouldn't fix this issue :).

Urbanecm closed this task as Invalid.Sep 17 2019, 6:18 AM
Urbanecm removed Zoranzoki21 as the assignee of this task.

Closing as invalid per Reedy, the numbers are correct, so there's no bug.

Dcljr added a comment.Sep 18 2019, 2:15 AM

Let me ask this explicitly: How do I find the settings for a specific wiki as a "normal user" (not a sysadmin), now that InitialiseSettings.php is devoid of per-wiki settings and VariantSettings.php apparently is not to be trusted?

For the benefit of future readers, to answer my own question, an API query such as the following would list the namespaces on napwikisource and indicate which ones count as content:

https://nap.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&formatversion=2

As for other information, one must wade through (/search) the API help (in this case, mw:API:Siteinfo) to find which API parameters return info from which configuration variables (sometimes these are mentioned in the API documentation, sometimes not).