Divide wikis into database lists by approximate size for performance engineering
Closed, ResolvedPublic

Description

There are a number of bugs in which small wikis are unfairly impacted by the performance constraints of large wikis. For example, many Special pages have been disabled across all Wikimedia wikis (cf. bug 15434). A small wiki such as ch.wikipedia.org, with 151 content pages, is treated the same as a wiki with over four million content pages. This doesn't make any sense.

This situation is unacceptable. A small wiki should not see a reduced user experience because of the existence of (almost entirely unrelated) wikis that have millions of content pages. We know the approximate sizes involved, so we should be able to safely and sanely tier these wikis (and then periodically check those tiers for accuracy and appropriateness). While we all wish that every wiki could be treated equally, it doesn't make any sense to punish small wikis indefinitely due to circumstances over which they have no control or involvement (i.e., an explosion in growth on a sibling project).

Some stats are available at https://wiki.toolserver.org/view/Wiki_server_assignments. There are other lists at Meta-Wiki, I believe. And I can query the *links tables for size if that's deemed necessary.

As far I as understand this, step one would be to make a set of groupings and then create individual wiki lists. Or perhaps just have a small.dblist or a large.dblist and add conditional statements based on that?

It looks like a small.dblist may already exist, even? Is that a list of small wikis (https://noc.wikimedia.org/conf/small.dblist doesn't load for me)?


Version: unspecified
Severity: enhancement

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz39667.
MZMcBride created this task.Via LegacyAug 26 2012, 3:18 PM
Krenair added a comment.Via ConduitAug 26 2012, 4:13 PM

This looks useful: http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size

Where should the line be between a large and a small wiki?

MZMcBride added a comment.Via ConduitAug 26 2012, 4:16 PM

(In reply to comment #1)

Where should the line be between a large and a small wiki?

Any number is going to be arbitrary. Maybe the actual first step is to write a maintenance script that can evaluate the size of the wikis in the cluster and then output a file based on their sizes (with a --size flag or something). So it'd be something like "php measureWikis.php --size=10000 > large.dblist" or something?

Measuring the number of content pages is probably easiest, as it's a stored value (in site_stats) and it gives a decent comparison between wikis (or it should in theory, at least).

Krinkle added a comment.Via ConduitAug 26 2012, 5:03 PM

(In reply to comment #1)

This looks useful:
http://meta.wikimedia.org/wiki/List_of_Wikimedia_projects_by_size

Where should the line be between a large and a small wiki?

That Meta page is auto-generated based on Special:Statistics, which in turn is just queryable from the sitestats database table. So (not to be nitpicky), just to be clear if and when we're going to use a server-side script to create dblists[1] groups by pagecount; it can simply use the db directly, no need to use that wiki page.

[1] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=tree

Krinkle added a comment.Via ConduitAug 26 2012, 5:05 PM

btw, for technical aspects we should probably use total page count as opposed to article count. That way file pages / categories / users are also taken into account. Because as far as the database is concerned page and revisions are all the same, whether they are articles or not.

Fortunately both total page count and article count are tracked in site_stats.

MZMcBride added a comment.Via ConduitAug 26 2012, 7:02 PM

Marking this as easy. Writing a maintenance script to query the cluster and output the dblist(s) should be trivial.

MZMcBride added a comment.Via ConduitSep 25 2012, 3:11 AM
  1. Disable all the query pages that take more than about 15 minutes to update
  2. wgDisableQueryPageUpdate @{

'wgDisableQueryPageUpdate' => array(
'enwiki' => array(

		'Ancientpages',
		// 'CrossNamespaceLinks', # disabled by hashar - bug 16878
		'Deadendpages',
		'Lonelypages',
		'Mostcategories',
		'Mostlinked',
		'Mostlinkedcategories',
		'Mostlinkedtemplates',
		'Mostrevisions',
		'Fewestrevisions',
		'Uncategorizedcategories',
		'Wantedtemplates',
		'Wantedpages',

),
'default' => array(

		'Ancientpages',
		'Deadendpages',
		'Mostlinked',
		'Mostrevisions',
		'Wantedpages',
		'Fewestrevisions',
		// 'CrossNamespaceLinks', # disabled by hashar - bug 16878

),
),

@} end of wgDisableQueryPageUpdate

Source: http://noc.wikimedia.org/conf/InitialiseSettings.php.txt. Just pasting this here so I don't lose it.

Reedy added a comment.Via ConduitNov 15 2012, 11:37 PM

(In reply to comment #5)

Marking this as easy. Writing a maintenance script to query the cluster and
output the dblist(s) should be trivial.

I've actually just restored small.dblist from the history books.

It's VERY out of date

https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=blob;f=small.dblist;h=5b0a78abf7fe1018576518382cae7a4f5342e422;hb=HEAD

MZMcBride added a comment.Via ConduitNov 16 2012, 12:20 AM

(In reply to comment #7)

(In reply to comment #5)

Marking this as easy. Writing a maintenance script to query the cluster and
output the dblist(s) should be trivial.

I've actually just restored small.dblist from the history books.

I'm not sure what value that provides other than nostalgia. It's a very out of date list that needs a maintenance script of some kind to be able to re-generate (update) it. If you want to use "small.dblist" for the name of small databases list for nostalgia's sake (and continuity's sake as well, I suppose), that's fine, I guess. But we're really nowhere closer to resolving this bug.

Reedy added a comment.Via ConduitNov 16 2012, 2:06 AM

Created attachment 11366
Sizes!

attachment sizes.txt ignored as obsolete

Reedy added a comment.Via ConduitNov 16 2012, 2:06 AM

(In reply to comment #9)

Created attachment 11366 [details]
Sizes!

That's using the value of select ss_good_articles from site_stats

attachment sizes.txt ignored as obsolete

Reedy added a comment.Via ConduitNov 16 2012, 2:19 AM

Basic script (work in progress!) to dump all the wikis sorted by ss_good_articles in https://gerrit.wikimedia.org/r/#/c/33694

Reedy added a comment.Via ConduitNov 16 2012, 4:54 PM

Created attachment 11379
ss_total_pages

Attached: ss_total_pages.txt

Reedy added a comment.Via ConduitDec 7 2012, 10:34 PM

Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists to noc conf etc

MZMcBride added a comment.Via ConduitDec 26 2012, 3:30 AM

(In reply to comment #13)

Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists
to noc conf etc

This change has now been merged.

I wonder what more is needed to resolve this bug.

Aklapper added a comment.Via ConduitJan 4 2013, 3:08 PM

(In reply to comment #13 by Reedy)

Updated https://gerrit.wikimedia.org/r/#/c/33694 moar and added the dblists
to noc conf etc

Reedy: Any idea what else is needed to resolve this request completely?

Reedy added a comment.Via ConduitJan 4 2013, 3:15 PM

Personally (let Max chime in), I would've thought that this was enough.

We've now got a script to make size related dblist (parameters might want changing at a later date, but that's trivial). Those dblists have been created and are exposed via noc.

The next task is to potentially do something for bug 15434 using those new lists.

MZMcBride added a comment.Via ConduitJan 6 2013, 2:31 AM

Marking this bug resolved/fixed now that bug 43668 ("Re-enable disabled Special pages on small wikis (wikis in small.dblist)") exists. Thanks again, Reedy!

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.