On paged categories, article totals and subcat totals are incorrect
Closed, ResolvedPublic

Description

On large categories, both the articles and subcats are paged. This is fine, but
the totals given for both articles and subcats are incorrect.

Test case: Category:Business

Expected result: "There are [correct total] subcategories to this category."

Actual result: "There are 17 subcategories to this category." - it listed only
the ones on the page.

Quick fix: change text to say "on this page" - then it's true, but not so useful
for the reader.
Better fix: show the correct total for the entire category.


Version: unspecified
Severity: normal
URL: http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla

bzimport added a project: MediaWiki-Categories.Via ConduitNov 21 2014, 8:05 PM
bzimport set Reference to bz1212.
David_Gerard created this task.Via LegacyDec 27 2004, 5:11 PM
brion added a comment.Via ConduitJan 29 2005, 8:12 AM
  • Bug 1336 has been marked as a duplicate of this bug. ***
brion added a comment.Via ConduitFeb 25 2005, 8:01 AM
  • Bug 1588 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitJun 14 2005, 11:04 PM

gangleri wrote:

Hallo!

There are some differences now between the test wikis FiverAlpha and Nukawiki:

See:
http://jadesukka.homelinux.org:8180/betawiki/Betawiki:Categories/counters#bugzilla

http://test.leuksman.com/index.php?title=Category:Todo&from=Z
looks pretty good compared to
http://jadesukka.homelinux.org:8180/mediawiki15c/index.php?title=Luokka:Todo&from=Z

Regards Reinhardt [[user:gangleri]]

David_Gerard added a comment.Via ConduitSep 9 2005, 5:16 AM

*** Bug 3095 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitOct 12 2005, 5:39 AM

webmaster wrote:

It appears this error has made its way to 1.5.0 as seen on my site here:
http://www.marveldatabase.com/wiki/index.php/Category:Characters

I definitely think that David Gerard's 'better fix' should be the course of
action. Category pages should show the FULL count for all articles and sub-
categories of the category in question.

Also, the category page only displays the first 200 total links, both sub-
categories and articles combined. (See my referenced link) Shouldn't it show the
first 200 articles and ALL (limited to x, for SQL query-time reasons) sub-
categories?

bzimport added a comment.Via ConduitOct 15 2005, 1:11 PM

gangleri wrote:

*** Bug 3714 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitOct 15 2005, 2:51 PM

arenalpha wrote:

Hope this gets fixed..... eventually...

bzimport added a comment.Via ConduitDec 19 2005, 7:50 PM

webmaster wrote:

Any word on this bug getting some attention before version 1.6?

bzimport added a comment.Via ConduitDec 26 2005, 6:20 PM

robchur wrote:

*** Bug 4390 has been marked as a duplicate of this bug. ***

bzimport added a comment.Via ConduitDec 27 2005, 12:10 AM

stanley wrote:

Five similar bugs!. This is a very urgent request Mr.Developer!

bzimport added a comment.Via ConduitDec 27 2005, 2:30 PM

zigger wrote:

(changed URL to category currently with paging)

tstarling added a comment.Via ConduitJan 6 2006, 1:24 PM

Is SELECT COUNT(*) good enough here? Or do we need something more efficient?

bzimport added a comment.Via ConduitJan 6 2006, 2:36 PM

webmaster wrote:

First, it should display a TOTAL of all pages within the category (and a total of
all categories in the category). Second, it should give the user the option to
display the first 20, 50, 100 or 200 and have a 'previous' and 'next' feature.
Something closer to the 'recent changes page' setup...

Currently, the way it works is fine, but it lacks a TOTAL sum, the only way to
find out is to flip through page by page and count yourself. I also like the idea
of adding the choice of limiting the count (20, 50, 100 or 200) like all the
other special pages in MediaWiki.

Just my thoughts...

bzimport added a comment.Via ConduitJan 6 2006, 3:50 PM

robchur wrote:

(In reply to comment #12)

Is SELECT COUNT(*) good enough here? Or do we need something more efficient?

I don't see why not.

brion added a comment.Via ConduitJan 6 2006, 8:37 PM

The rumor was that SELECT COUNT(*) was too expensive for large categories (eg
Category:GFDL on Commons, with tens of thousands of entries). I'm not sure how real
that problem is.

bzimport added a comment.Via ConduitJan 6 2006, 11:38 PM

webmaster wrote:

I know how this will probably require the creation of another unessecary table,
but why don't we have a table that keeps the tallies of articles in each
category... This data could be cached server-side for larger wikis and cached
client-side for smaller ones. Everyone is happy and we still get our count!

When the cache expires, the query is called. Life is beautiful. :)

tstarling added a comment.Via ConduitJan 7 2006, 12:14 AM

We could just remove the count from the UI. That's what we do for
[[Special:Whatlinkshere]], which has precisely the same schema problem.

bzimport added a comment.Via ConduitJan 7 2006, 5:52 AM

stanley wrote:

(In reply to comment #17)

We could just remove the count from the UI. That's what we do for
[[Special:Whatlinkshere]], which has precisely the same schema problem.

Which is now being requested at bug 4394 (Provide a count in Special:Whatlinkshere)

bzimport added a comment.Via ConduitJan 8 2006, 1:00 AM

beland wrote:

Editors often count up how many articles are in maintenance categories.
Removing the automatic count would make that a lot more work. And it's useful
to readers to know what is below the virtual fold before they invest the time it
takes to get there.

Perhaps there is a clever query that would give an exact count for
reasonably-sized categories and an "overflow" indication for anything larger
than a certain threshold (possibly with a "click here to get an exact count"
link). Though it might be easiest just to implement the simply COUNT solution
at an off-peak time and check to see what kind of a performance hit there is, if
any. That would be useful information in general, and would also help set the
threshold optimally.

bzimport added a comment.Via ConduitFeb 4 2006, 1:43 AM

webmaster wrote:

Has anyone been able to do a successful test on what kind of performance issues
this may create? Could we not extrapolate an estimate? Grab a page-render time for
a <200 article COUNT and multiply by 50 and 500 to figure for a maximum threshold
of 10,000 or 100,000? (Accounting for other page-rendering factors, of course...)

bzimport added a comment.Via ConduitMay 30 2006, 9:27 PM

robchur wrote:

*** Bug 6146 has been marked as a duplicate of this bug. ***

Purodha added a comment.Via ConduitMay 31 2006, 5:31 AM

Design Suggestions:

  1. MediaWiki:Categoryarticlecount and MediaWiki:subcategorycount shoud have params $1 to $7 at least making

messages possible like: "showing 200 articles from # 401 'cat ears' to # 600 'dog pounds' of a total of 782
articles in category 'cats and dogs'."

  1. Count of articles/subcats per page (currently 200) to be installable in "user preferences" and selectable

via paging option links (display previous/next 10|50|100|200|500|1000|....) as elsewhere.

  1. Count of columns of articles/subcats per page (currently 3) to be set in "user preferences" (as 1...10, e.g

.) and possibly via paging option links.

  1. Show a page directoy, if there is more than one page, like with Special:Allpages, where pages are listed and

linked to in a "$first_to_$last" fashion. If there are more page diretory entries than fit on a page, make a
directory of page directories. (Iterate til page size is matched)

  1. With the page directory, MediaWiki:Categoryarticlepagingcount and MediaWiki:subcategorypagingcount should

be introduced having params $1 to $12 at least, allowing messages like: "showing 678 of 12.345 page directory
entries in category 'cats & dogs', from # 679 to # 1356 at level 2, corresponding to 134.500 subcategories of
12.345.678.987, covering subcategory # 135.600 (lions) to subcategory # 1.234.200 (eliphants)"

brion added a comment.Via ConduitNov 16 2006, 3:58 PM
  • Bug 7950 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitNov 24 2006, 2:18 AM

webmaster wrote:

There has been no developer commentary on this in nearly a year. Did we ever
determine if it was indeed too expensive?

What about the new table field to keep track as discussed.

Any thoughts Brion / Rob / Tim?

brion added a comment.Via ConduitNov 29 2006, 10:32 PM

Offhand, I might recommend changing the indexing for the categorylinks table to split up by namespace, _then_ sort key.

This would allow counting, paging, extracting etc the various namespace groups separately.
Thus category pages (subcats) and image pages (media) could be treated separately rather easily.
Other namespaces would also end up being paged separately, though, so it might take some UI
restructuring.

Raymond added a comment.Via ConduitDec 12 2006, 3:59 PM
  • Bug 8235 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitDec 31 2006, 5:34 AM

webmaster wrote:

How is the 'members' data generated on [[Special:Categories]]? (Bug 146)

It seems that each category has had a SELECT COUNT(*) to build the info (comment
#12) or is this data held somewhere else in a table?

If this is indeed a COUNT(*), wouldn't this be several times more expensive
(comment #15) than just counting the number of members when looking at just ONE
category?

Example:
http://www.marveldatabase.com/index.php?title=Special:Categories&limit=500&offset=500

  • Look at the 'Characters' category with over 6,400 'members'.

http://www.marveldatabase.com/Category:Characters - Shows 'There are 200
articles in this category', when we know this is only a count of how many appear
on this page.

Could we somehow use this 'member' count to resolve this bug?
(Don't forget this 'total count' should have a separate count for each
'articles' and 'images' in a category.)

If we can show: "Character Galleries ‎(1,095 members)", then we should be able
to accurately state: "There are 1,095 articles in this category"...

bzimport added a comment.Via ConduitDec 31 2006, 5:42 PM

ayg wrote:

There is no COUNT(*). If there were, there would be an actual count. ;) As it
is, we just iterate through the returned rows in PHP until there aren't any
more, and then keep the number handy as the count of how many categories there
are. Since we have a LIMIT clause, of course, this means the count only goes up
to the limit.

The question is whether a COUNT(*) would, in fact, be too slow. Just nobody has
bothered to determine whether it is, apparently. Don't look at me, I'm not
qualified. Maybe bug Domas or something.

bzimport added a comment.Via ConduitDec 31 2006, 7:17 PM

webmaster wrote:

So what is the LIMIT? (10,000/100,000?) And upon what circumstance do we run
through this 'iteration'?

Furthermore, why don't we just create a field that holds this 'pseudo-count'
with each category, which gets updated with each iteration. It will behave like
a 'cache' such that the figure isn't up-to-the-second, but depending on how
often the figures are updated, it might be worthwhile...

So far, Brions idea of first splitting up by namespace, then sort key sounds
best, long-term, I think. It would substantially lighten the load for
search/counting/paging etc.

Would this take much re-working of the categorylinks table?

bzimport added a comment.Via ConduitDec 31 2006, 7:27 PM

robchur wrote:

(In reply to comment #29)

So what is the LIMIT?

The paging limit is 200.

bzimport added a comment.Via ConduitFeb 25 2007, 9:03 PM

gunter.schmidt wrote:

If somebody would like to try the often diskussed question: "Is select count(*) too slow?", he might want to change this code in
CategoryPage.php. The field is indexed, so it should be rather fast.

Replace the following function (retrieved from MW 1.9.2):

function getPagesSection() {

$dbr =& wfGetDB( DB_SLAVE );
$catCount = $dbr->selectField( 'categorylinks', 'count(*)', array('cl_to' => $this->title->getDbKey() ), 'getPagesSection');
$ti = htmlspecialchars( $this->title->getText() );
$r = "<div id=\"mw-pages\">\n";
$r .= '<h2>' . wfMsg( 'category_header', $ti ) . "</h2>\n";
$r .= wfMsgExt( 'categoryarticlecount', array( 'parse' ), $catCount );#count( $this->articles) );
$r .= $this->formatList( $this->articles, $this->articles_start_char );
$r .= "\n</div>";
return $r;

}

bzimport added a comment.Via ConduitFeb 25 2007, 9:10 PM

gunter.schmidt wrote:

Given this small change, it might be wise to add a switch to MediaWiki, like $wgFullCategoryArticleCount = true / false. This way,
the best configuration can be set by the admin.

bzimport added a comment.Via ConduitMar 17 2007, 3:37 AM

webmaster wrote:

This might be exactly what we are looking for...
Could some one please implement this?

I very much also like the idea of a switch (off by default for Wikipedia's sake, of course ;)

bzimport added a comment.Via ConduitApr 30 2007, 3:43 AM

webmaster wrote:

Rob/Brion/et al,

Does Gunter's solution look viable?

bzimport added a comment.Via ConduitMay 10 2007, 5:24 PM

ayg wrote:

It looks viable. See URL:

http://en.wikipedia.org/wiki/Special:Categories?offset=Living_peopla

It takes just a few seconds to load, and that's the worst possible case at
present. Capping it at 250,000 should be fine if we're okay with a few category
pages taking an extra second or two to load. This is assuming that the only
issue is latency.

Adding "shell" to flag this for sysadmins to look at for performance impact.

bzimport added a comment.Via ConduitSep 18 2007, 10:57 PM

webmaster wrote:

That page seems to load quite quickly for me and it seems to be not cached. It is taking as long as it does for all 50 categories to be looked up. One can guesstimate the performance hit to check 1 category should be fractional, comparatively.

Does the information present in this bug, above, possibly provide a feasible solution?

bzimport added a comment.Via ConduitSep 18 2007, 10:59 PM

webmaster wrote:

Is there any way to accurately time how long this takes from a raw query standpoint?

http://en.wikipedia.org/w/index.php?title=Special:Categories&offset=Living_peopla&limit=1

bzimport added a comment.Via ConduitSep 18 2007, 11:03 PM

robchur wrote:

Somewhere, probably not on this laptop, I've got an uncommitted working copy of MediaWiki which effectively tracks and updates simple tallies of the number of pages in each category, and the number of subcategories, in a relatively efficient manner, by hooking into the link update process. This would make such a count theoretically trivially inexpensive.

I'll dig it out this weekend and whack a patch somewhere for further review and criticism.

bzimport added a comment.Via ConduitSep 18 2007, 11:11 PM

ayg wrote:

(In reply to comment #37)

Is there any way to accurately time how long this takes from a raw query
standpoint?

http://en.wikipedia.org/w/index.php?title=Special:Categories&offset=Living_peopla&limit=1

It takes 4.42 seconds on the toolserver to run the COUNT(*) query on Category:Living people, with 221446 rows. A proper solution would be better.

Turnstep added a comment.Via ConduitOct 18 2007, 2:31 PM

Created attachment 4268
Show true category counts

The count(*) solution seems to work fine for me - subsecond responses as it hits the cl_sortkey index. I made the table have 1.1 million rows, and a category with 221,000 entries in it still ran in under a second. (This is for Postgres - MySQL also seemed to use the index, but was much slower about it - maybe this is better on a newer version of MySQL? I only have an early 5.0 available at the moment). Still, no reason not to make this an option for sites that don't have large categorylinks tables, but want a true count. Patch attached: although probably deprecated by Rob's future one, maybe someone else can find use of it until then.

attachment fullcatcount.patch ignored as obsolete

bzimport added a comment.Via ConduitMar 9 2008, 4:31 AM

ayg wrote:

First pass at a patch (not yet ready for commit)

This patch adds a category table, as well as a Category class. The first application is this bug. I've gotten the basic functionality working, in that I can successfully view a category page and get the correct count. It needs considerable polishing, which I hope to put in tomorrow. Ideally I'll commit it to trunk in non-scary form tomorrow (i.e., a form that allows it to require no expensive operations on Wikimedia), and persuade Brion or someone to populate it properly at some point thereafter. However, I don't exactly trust my timetables. :)

I've marked Greg's patch obsolete. It would be neat if we could tolerate such large scans, because the normalized solution is of course a lot nicer, but we can't, unfortunately. If Postgres can handle the scanning, it would be neat if it implemented the category table as a view, with the count columns running the queries transparently and ignoring manual updates.

attachment categories.diff ignored as obsolete

bzimport added a comment.Via ConduitMar 10 2008, 1:49 AM

ayg wrote:

Almost-complete patch

The patch is nearly complete. Basic features:

  • Adds Category and CategoryList classes to represent categories themselves.
  • Adds a category table, giving each category a name, ID, and counts of all members, subcats only, and files.
  • Adds an updatelog table to track DB updates that can't be inferred from the schema, and uses it to track whether the category table has been populated.
  • Adds a maintenance script to populate the category table efficiently. This script is careful to wait for slaves and should be safe to run on a live database. The maintenance script's includes file is called by update.php.
  • Until the category table is populated, the patch handles weird category table rows gracefully. It detects whether they're obviously impossible, and if so, it outputs appropriate messages. Sample messages:
    • "The following 200 pages are in this category, out of 556 total." (category table has sane entry)
    • "The following 158 pages are in this category, out of 158 total." (category table is ignored, since this is the full set anyway)
    • "The following 200 pages are in this category." (category table entry is impossible, like less than 200: no commitments as to total number)

I would like review on this before I commit it, which I would like to do by the end of next weekend at the latest. The one major glaring blocker omission is that it totally ignores link updates: the only way to populate the category table is by running the maintenance script. Obviously it's useless until this is fixed. The other flaws that occurred to me I've written down as FIXMEs, some of which I just need to code a fix for and others of which I'd like comments on. Comments from anyone are appreciated, but especially Brion, if he has the time.

attachment categories2.diff ignored as obsolete

bzimport added a comment.Via ConduitMar 10 2008, 6:07 AM

sylvain_machefert wrote:

Hello,

I'll probably test your patch, but don't know when.

Can the maintenance be run from the web, or do I need console access ?

(In reply to comment #42)

  • Adds a maintenance script to populate the category table efficiently. This script is careful to wait for slaves and should be safe to run on a live database. The maintenance script's includes file is called by update.php.
bzimport added a comment.Via ConduitMar 10 2008, 10:17 AM

robert wrote:

Console access is required for all maintenace scripts, but a PHP or Java web-based SSH\telnet client provided by your host will usually suffice.

bzimport added a comment.Via ConduitMar 10 2008, 11:03 PM

ayg wrote:

Same as previous attachment, with all files actually included

I forgot to svn add a few things, whoops.

attachment categories3.diff ignored as obsolete

bzimport added a comment.Via ConduitMar 16 2008, 7:26 PM

ayg wrote:

Patch with all functionality, needs review

Okay, it now updates correctly on LinksUpdate and page deletion. I think this is it, but I really really want review from Brion or Tim before I check it in, since otherwise it will probably get reverted.

Attached: categories4.diff

werdna added a comment.Via ConduitMar 17 2008, 5:08 AM

Looks good (I spent about 10 minutes reading through the code, not a substitute for thorough testing). The only immediate issue I can see is that it's not obvious whether you've considered this scenario:

The schema change is applied, and a category which does not have an already-populated row in the category table is deleted from a page. If you haven't considered that, you'd get a PHP error for trying to subtract 1 from a signed field with value 0.

Maybe I didn't pick up on that logic in your code.

Well done, by the way. It's great to see this finally fixed up.

bzimport added a comment.Via ConduitMar 17 2008, 1:20 PM

ayg wrote:

(In reply to comment #47)

The schema change is applied, and a category which does not have an
already-populated row in the category table is deleted from a page. If you
haven't considered that, you'd get a PHP error for trying to subtract 1 from a
signed field with value 0.

Maybe I didn't pick up on that logic in your code.

I explicitly noted that in a comment:

+ # ... If [a row for] one of the "deleted" [categories] didn't
+ # exist, we might legitimately not create it, but it's simpler to just
+ # create it and then give it a negative value, since the value is bogus
+ # anyway.

You may have missed that the fields are actually signed, so you'll just get -1 with no errors, since that's a perfectly valid value for a signed int. I made it signed so that underflow errors are more obvious (as the comment in tables.sql says).

I actually thought of a better way to do the update, in a way that doesn't require explicit LOCK IN SHARE MODE, since it's all one query (but no subqueries needed either). I'll change it to use that and then just check it in. Hopefully any errors that are spotted are minor enough to just be fixed on the spot instead of having to back it out.

bzimport added a comment.Via ConduitMar 18 2008, 12:20 AM

ayg wrote:

Patch committed in r32085.

Add Comment