Page MenuHomePhabricator

Use transclusions to count articles as well
Closed, ResolvedPublic

Description

Author: mcdevitd

Description:
Currently, the article count used to generate {{NUMBEROFARTICLES}} and Special:Statistics only counts a page as an article if it includes a [[wikilink]]. Instead, this should be expanded to include {{transclusions}} as well as wikilinks. The issue here is that non-Wikipedia projects, like Wiktionary, do often have valid articles without wikilinks, because the wikilinks are contained in the templates that generate the article. As many as one fifth of valid Wiktionary articles may be inflection articles (plurals, verb form, etc.) and are mostly just a template. We've had to input wikilinks in the template, like {{plural of|[[word]]}}, but this is inefficient and also prevents us from passing that parameter to any other parts of the template, like a category.

There may be a downside, but I can't think of one, especially now that preventing page creations is done with cascade protection instead of protected templates.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=24754

Details

Reference
bz11868

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:59 PM
bzimport set Reference to bz11868.
bzimport added a subscriber: Unknown Object (MLST).

What about pages that transclude {{stub}} then?

robchur wrote:

(In reply to comment #1)

What about pages that transclude {{stub}} then?

As far as we're concerned, those are still articles - such pages would usually contain at least one link anyway, which would put them on the counter.

This seems to have an efficiency barrier. We'd need another site stat to track this. That or change the current one (with a retroactive batch query ran too).

ayg wrote:

The request is to change the existing metric. Would be easy enough to do. The batch query would only have to examine the "bad" articles, too, which is probably a good deal fewer than all of them.

fearow00 wrote:

Patch to article.php to fix problem (using diff command)

I have updated the isCountable method of Article to take into account templates if $wgCountTemplateOnlyPages is set to true. As well as getting this comitted, I would like that setting turned on for the english (and hopefully rest of) wiktionary. Lastly, a shell needs to run "maintenance/recount.sql" (or something like that) on the versions of wiktionary its enabled on.

Patch to DefaultSettings coming in a second.

Attached:

fearow00 wrote:

DefaultSettings.php patch, to coincide with Article.php patch above.

Attached:

ayg wrote:

  1. There's no reason to have a config option for this. If a page contains a template link, it should logically be counted, especially given that the template may itself include links. The behavior should be on automatically for all wikis.
  1. Please use the command "svn diff" to generate diff files. If you really don't want to check out the SVN repository, at *least* use diff -u, and concatenate the two diffs into one text file for easier reading (indicating within the single diff file which part of the diff corresponds to which file).
  1. I'm not willing to check this in without comment from Brion or Tim about how to proceed with recounting. Given (1) above, a recount needs to be done on update, but of course not every single time an update is done, once suffices. Maybe we should have a database flag for schema versions, as a general thing? This kind of issue has come up before, and we have no good answer for it.

If (3) is satisfied I'm willing to check in a one-line patch based on the given attachment. Have you tested it?

I think it should have the config option, to avoid an unnecessary step change in the non-Wiktionary counters. I've generally been against recounts on the large wikis, where years of counter drift have taken their toll, because of the psychological importance of the article count and its continuous growth.

The suggested patch is fine, except that it also needs a patch to maintenance/updateArticleCount.inc.php.

(In reply to comment #0)

Currently, the article count used to generate {{NUMBEROFARTICLES}} and
Special:Statistics only counts a page as an article if it includes a
[[wikilink]].

Not exactly. Per related bug 10834 comment #5 and live tests current good articles counter counts using the following method:

  1. page is in ns 0
  2. page is not redirect
  3. page contains "[[" string

Step 3 causes that pages with no wikilink but Image or Category inserted, even with <!-- [[ --> are counted.

ayg wrote:

Tim points out that the recount script already counts template inclusions. It would probably make the most sense to make Article.php use the updateArticleCount method (parsing and checking the resulting links) rather than adding the extra check for '{{'.

fearow00 wrote:

In that case, why was one updated and not the other? I also believe there should be a config option, as sites like wikipedia dont want pages that only include {{deletedpage}} or such.

Also, I don't have any command line svn utility, as I use a graphical system for SVN (hooked into right-click menu). So the diff command was my only choice. I'll use -u next time.

mcdevitd wrote:

Note that the {{deletedpage}} practice is obsolete ever since cascade protection, and, at least on enwp, has been completely converted to the new method (and {{deletedpage}} was itself deprecated following a deletion nomination).

(In reply to comment #11)

Also, I don't have any command line svn utility, as I use a graphical system
for SVN (hooked into right-click menu). So the diff command was my only choice.
I'll use -u next time.

TortoiseSVN I presume? Right-click on a file or folder, then click "Create patch".

fearow00 wrote:

No, some random program I cant remember the name of for Mac.

ayg wrote:

It should still support patch creation somewhere, check the docs. If not, you can just use the command line for this one thing. But that's off-topic.

initStats sets the good article count for all pages in defined content namespaces which are not redirects and are greater than 0 bytes in length.

The length check is an approximation due to the difficulty of checking for text contents in a single query like this (text would have to be loaded, uncompressed, and encoding-converted individually for every revision checked).

Then there seems to be *yet another* script which got shoved in there somehow, updateArticleCount, which does the above checks, plus a join against the pagelinks table to list those which have outgoing wiki links.

So we currently have *three different methods* of counting, all different:

  1. on every page update: check for text containing '[['

This is the canonical version; updates to the count on edit assume that the existing count was based on this -- the total count is incremented or decremented based on changes in state of this check between previous and new versions.

  1. on bulk initStats.php: check for non-empty text

This will overcount, including pages which have text but no links.

  1. on bulk updateArticleCount.php: check for non-empty text and outgoing links

This will overcount but not as much, including pages which transclude templates which themselves have links as well as extensions which record links but don't contain '[[' in the actual text.

What might actually be the sanest thing to do might be to add a page_is_counted field on the page table and update it at save time. Then bulk updates can be done a lot more sanely, and changes in the counter method won't cause as much weird drift. :P

But a good start would be to harmonize them:

  • Junk updateArticleCount and merge its check into initStats

This seems like a no-brainer... any reason not to?

  • Change the article count updates to be based on link count in parse state rather than text contents

Besides causing extra parses on save (slow), one obvious problem here is that link count from transcluded templates can change over time. A template might contain links at time T and no links at time T+1. Thus refreshes of links could change the state.

So... a check for transcludes as well, maybe?

Bleah.

ayg wrote:

(In reply to comment #16)

Besides causing extra parses on save (slow), one obvious problem here is that
link count from transcluded templates can change over time. A template might
contain links at time T and no links at time T+1. Thus refreshes of links could
change the state.

So... a check for transcludes as well, maybe?

The easy answer to that is yes, check for transcludes as well, as this bug suggests. ;) It makes sense anyway.

ayg wrote:

*** Bug 12566 has been marked as a duplicate of this bug. ***

wikt.3.connelm wrote:

Just a comment from en.wiktionary.org: It does not represent community consensus to suggest that anyone beyond a small minority wants transclusions "counted." Perhaps a separate statistic that show that, but not the count of "good" entries. For example, many assumptions have been made based on the so-called "incorrect" behavior. Entries marked with {{misspelling of}} do not contain any wikilinks, specifically so that they are not counted.

(In reply to comment #19)

Just a comment from en.wiktionary.org: It does not represent community
consensus to suggest that anyone beyond a small minority wants transclusions
"counted." Perhaps a separate statistic that show that, but not the count of
"good" entries. For example, many assumptions have been made based on the
so-called "incorrect" behavior. Entries marked with {{misspelling of}} do not
contain any wikilinks, specifically so that they are not counted.

Note: This is no longer correct. All mainspace pages on enwiktionary are counted due to a bot adding invisible links to all pages that otherwise don't have links.

(In reply to comment #16)

So we currently have *three different methods* of counting, all different:

  1. on every page update: check for text containing '[['

This is the canonical version; updates to the count on edit assume that the
existing count was based on this -- the total count is incremented or
decremented based on changes in state of this check between previous and new
versions.

  1. on bulk initStats.php: check for non-empty text

This will overcount, including pages which have text but no links.

  1. on bulk updateArticleCount.php: check for non-empty text and outgoing links

This will overcount but not as much, including pages which transclude templates
which themselves have links as well as extensions which record links but don't
contain '[[' in the actual text.

What's the status on this anno 2010 ? Still three methods ?
I guess 3) makes the most sense. Then when saving an article it counts outgoing links again (which it needs to do to update pagelinks anyway).

When changing templates, the job queue that updates caches within X minutes for tranclusions (whatlinkshere) etc. could be fixed to update this count as well.

Anyway, keeping three different methods that are inconsistent which eachother seems a bad thing no matter how we look at it.

I'm marking this as FIXED since the check is also based on the presence of links in templates since r88113.