Page MenuHomePhabricator

/* wfSpecialStatistics */ SELECT COUNT(cur_id) AS total FROM `cur`
Closed, ResolvedPublic

Description

Author: bugzilla_wikipedia_org.to.jamesd

Description:
This causes a full table scan of several gigabytes of data,
much of which isn't cached, and regularly takes more than a
minute for de and hundreds of seconds for en. Please change
it to use the query below, which uses an index which is
normally very well cached because it's used for lots of
article lookups.

select count(cur_namespace) from cur;
+----------------------+

count(cur_namespace)

+----------------------+

1292372

+----------------------+
1 row in set (6.29 sec)

mysql> explain select count(cur_namespace) from cur;
+-------+-------+---------------+----------------------+-----
----+------+---------+-------------+

tabletypepossible_keyskey

key_len | ref | rows | Extra |
+-------+-------+---------------+----------------------+-----
----+------+---------+-------------+

curindexNULLname_title_timestamp

270 | NULL | 1444561 | Using index |
+-------+-------+---------------+----------------------+-----
----+------+---------+-------------+
1 row in set (0.00 sec)

Next run takes 0.62 seconds. Current code causes results
like this on bacon:

1548197 wikiuser igoeje:53107 enwiki 402
Query /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS
total FROM cur

or this on holbach:

23994523 wikiuser smellie:33404 dewiki 112
Query /* wfSpecialStatistics */ SELECT COUNT(cur_id) AS
total FROM cur

Because of the long run time and high disk activity these
queries are also a regular cause of replication lag. It's
also fairly common to see several of these running at the
same time for wikis like en and de.

The smaller cur records for 1.5 will help these queries but
it'll still be better to have this index used.


Version: 1.4.x
Severity: normal

Details

Reference
bz1474
TitleReferenceAuthorSource BranchDest Branch
Revert "Add temporary debug output for T347483"repos/phabricator/phabricator!42aklapperT347483debugRevertwmf/stable
Draft: Do not expose defunct git-ssh.wikimedia.org as repo Clone URLrepos/phabricator/phabricator!21aklapperT347408-hide-gitssh-clone-uriwmf/stable
fix harbor deploymentrepos/cloud/toolforge/lima-kilo!83dcarofix_harbor_deploymentmain
build: Move 'publish-dev-image' step to own, un-run stage for nowrepos/abstract-wiki/wikifunctions/function-orchestrator!62jforresterT347487main
Add temporary debug output for T347483repos/phabricator/phabricator!17aklapperT347483debugwmf/stable
[toolforge-cli]: get --help to work well with subcommandsrepos/cloud/toolforge/toolforge-cli!9raymond-ndibefix_bug_in_subcommand_helpmain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:11 PM
bzimport set Reference to bz1474.
bzimport added a subscriber: Unknown Object (MLST).

zigger wrote:

I have a patch for SpecialStatistics (that I had forgotten to upload) which uses
higher-level methods of the database class, removes some unnecessary code, but
used COUNT(*) based on some mysql doco claiming that as the optimised form for
MyISAM and ISAM tables and removing the need to check for NULL values. Do you
have a timing for COUNT(*)?

bugzilla_wikipedia_org.to.jamesd wrote:

Count(*) is the optimised form for MyISAM and ISAM table types. In
the case of those non-transactional table types, the value is stored
and immediately available if you use count(*). That is, it's
available in a fraction of a second.

Wikimedia uses the InnoDB table type. So should other MediaWiki
users, when possible, because of the far better crash recovery and
the transaction support, which isn't available in MyISAM.

explain select count(*) from cur;
+-------+-------+---------------+--------+---------+------+---------+-
------------+

tabletypepossible_keyskeykey_lenrefrows

Extra |
+-------+-------+---------------+--------+---------+------+---------+-
------------+

curindexNULLcur_id4NULL1463251

Using index |
+-------+-------+---------------+--------+---------+------+---------+-
------------+
1 row in set (0.00 sec)

As you can see, count(*) will use the primary key, which is the
clustered key in InnoDB and so forces the same full table scan.

Optimal form for InnoDB is a count using an index which is already in
cache or is mostly in cache. That's the namespace, title and
timestamp index,which is routinely used for many common queries.

Caution on the MySQL manual: it is often written assuming MyISAM and
often does not say what is different for innoDB. For InnoDB:

'InnoDB does not keep an internal count of rows in a table. (This
would actually be somewhat complicated because of multi-versioning.)
To process a SELECT COUNT(*) FROM T statement, InnoDB must scan an
index of the table, which will take some time if the table is not
entirely in the buffer pool. To get a fast count, you have to use a
counter table you create yourself and let your application update it
according to the inserts and deletes it does. If your table does not
change often, using the MySQL query cache is a good solution. SHOW
TABLE STATUS also can be used if an approximate row count is
sufficient. See Section 15.12, “InnoDB Performance Tuning Tips”.'

The "the table is not entirely in the buffer pool" part is in error:
it's the index which must be in the buffer pool, not the table, when
scanning an index. I've submitted the correction.

There's already a table for statistics and any statistics reporting
code should ideally be just asking an object for the count, with the
object using and occasionally updating the statistics table. Perhaps
once every hour or every ten minutes. But for now, just changing the
variable will be a major improvment.

See http://dev.mysql.com/doc/mysql/en/innodb-restrictions.html

zigger wrote:

REL1_4 patch for SpecialStatistics.php

Optimises string quotes, uses higher-level database class methods, removes
misplaced call to getSkin(), optimises implicit db index usage by doing count
on columns that are not the primary index where possible. Simple test was done
locally.

Attached:

zigger wrote:

HEAD patch for SpecialStatistics.php

Uses higher-level database class methods, removes misplaced call to getSkin(),
optimises implicit db index usage by doing count on columns that are not the
primary index where possible. Simple test was done locally.

Attached:

bugzilla_wikipedia_org.to.jamesd wrote:

Those look good. Thanks.

One extra note on head, though it also applies to 1.4: LIKE '%
anything' alaways causes a full table scan. On the "when time is
available to do list" should be something like different rows in
user_rights for each capability with ur_user, ur_right as the primary
key. Then the leading % can be dropped and it'll be a much faster
indexed query. Shorter term goodness would be forcing sysop to be the
first user right if present, then scanning for 'sysop%'. The request
for lists of sysops is frequent enough that it's worth some work to
make that particular form use an index.

Changed in REL1_4 and HEAD to use COUNT(cur_name)/COUNT(page_name)
on James's suggestion.

I'm not applying the patch, as it makes unnecessary changes to use the new
functions without fitting the existing coding style for use of those functions.