Page MenuHomePhabricator

Give up on 'comma' article-count method
Closed, ResolvedPublic

Description

As far as I can tell (not being a MediaWiki coder), the 'comma' method of article counting is a complete fiction, and has been so for many years now. I believe it may be time to give up on this feature entirely, and remove the option from the software (or, alternatively, implement the feature in a way that actually does what is advertised).

It appears that on wikis that use the 'comma' method (among Wikimedia wikis, currently only enwikibooks and ptwikibooks: T29256), instead of a non-redirect in a content-namespace being counted as an article if it contains a comma, such pages are counted as articles if they are merely nonempty. (If I am wrong about this, someone please say so!)

Presumably because some Wikibooks were counted differently, they were not included in the monthly article recounting that has been happening since 2015 (T68867). Now those wikis have been recounted, along with all other Wikimedia wikis (T187845), and that "complete" recounting is about to become a regular occurrence (T59788).

For this reason, I suggest that the enwikibooks and ptwikibooks communities be notified that 'comma' based counting is not actually working the way they think, ask them to choose 'link' or 'any' as a replacement method, remove mentions of the 'comma' method as a viable option in our documentation (e.g., the mw: page linked to in the first sentence above), and remove all traces of it from the MediaWiki code.

Event Timeline

OK, looks like I was a bit pessimistic in my assessment of 'comma' based counting. According to T59788#4008838, it is correctly implemented when a page is saved.

But given that all wikis have been recounted using a script that doesn't respect 'comma' based counting, the article counts on those wikis are effectively no longer based on the 'comma' criterion.

Yes, they counts right now will be wrong, slowly drift towards being more right as pages are edited, and then be reset to being wrong, every fortnight. Let's just kill the feature.

Change 415199 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/core@master] Drop 'comma' value for wgArticleCountMethod

https://gerrit.wikimedia.org/r/415199

Whoa, I didn't think anyone would act on this suggestion so quickly!

For further context, consider that, as reported at m:Wikimedia News#February 2018, when the two 'comma' based wikis were recounted by 'initSiteStats.php' on Feb 14/15th (T186947#3974067), the article count of enwikibooks changed from 57,843 to 79,075 (a 37% increase) and that of ptwikibooks changed from 8,742 to 11,493 (a 31% increase). (These counts are from c. 18 hours before and up to c. 6 hours after the script run.)

Unfortunately, there is no practical way to tell how much of the change in each case was due to the different counting method (since the script was counting non-empty non-redirect content pages, rather than comma-containing non-redirect content pages) and how much was due to the counts just plain being wrong because of previously existing counting bugs (since neither wiki, I don't think, were ever recounted after any major article-counting bugs were detected and fixed). The only way we could determine this information for sure would be to recount each wiki using a script that properly implements the 'comma' method. (A statistical approach could be used to suggest how different the "true" counts could be under the two counting methods, but that would likely take more time than anyone is willing to spend.)

FWIW, when the wikis were recounted again on Feb 21st (T187845#3988080), enwikibooks changed from 79,145 to 79,156 articles (well within the range of normal wiki activity, although I did not check its RecentChanges at the time) and ptwikibooks didn't change at all (a constant 11,500 articles).

Ultimately, it may come down to whether these two wiki communities care enough about comma-based counting to make it worth trying to salvage. Obviously JDF doesn't think so! [grin]

So, I've decided to ask the two communities for their opinions. I'll post links to the discussions once I've created them.

Would it be possible to use elasticsearch/cirrussearch to make comma counting a reality?

As @Dcljr requested (T59788#4008979), here is a summary of what I wrote at T59788#4008838:

Running the recount script frequently on all wikis seems problematic given the wgArticleCountMethod feature in MediaWiki core. The recount scripts support the any and link method of counting, but not the comma method of counting.

The monthly recount script we currently run explicitly skips the two wikis using that feature (wmf-config/InitialiseSettings#wgArticleCountMethod=comma).

If we were to run recounts on these wikis, it would essentially mean that at any given point the main number on Special:Statistics is based on the any method of counting, not the comma method. With the exception of the days in between each recount, where new pages will not increase the count if they don't have a comma. But then at the end of each 2-week cycle, they'll be included regardless going forward.

I see two options here:

  • Drop the feature (logic in WikiPage/WikitextContent, gitiles).
  • Or; Add support to the recount scripts for the comma method of counting. This would require programmatically browsing through all pages of the entire wiki, and fetching their raw wikitext, and look for a comma. While this is in theory possible, it would require two things
    1. Dedicated time (and likely WMF funds) to evaluate how this could be implemented in a way that is not going to be very expensive or stressful on foundation infrastructure, and to evaluate whether it can actually complete a full recount within a reasonable timeframe.
    2. Dedicated time and resourcing to implement, review and deploy the feature.

See also: https://meta.wikimedia.org/wiki/Article_counts_revisited#What_is_to_be_done.3F, specifically how it mentions the two Wikibooks project which still use the comma count method.

Change 415199 merged by jenkins-bot:
[mediawiki/core@master] Drop 'comma' value for wgArticleCountMethod

https://gerrit.wikimedia.org/r/415199

Change 416330 had a related patch set uploaded (by EddieGP; owner: EddieGP):
[operations/mediawiki-config@master] Article counts: Change 'comma' method to 'any'

https://gerrit.wikimedia.org/r/416330

Change 416330 merged by jenkins-bot:
[operations/mediawiki-config@master] Article counts: Change 'comma' method to 'any'

https://gerrit.wikimedia.org/r/416330

Mentioned in SAL (#wikimedia-operations) [2018-03-06T15:02:17Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Article counts: Change 'comma' method to 'any' - T188472 (duration: 01m 00s)