Page MenuHomePhabricator

page2xml trouble
Closed, ResolvedPublic

Description

Author: bugzilla_wikipedia_org.to.jamesd

Description:
At Wikimedia sites the following page2xml query is now being automatically
killed on sight, after use of it caused a problem on one of the database servers
handling de:

/* page2xml */ SELECT old_id as id,old_timestamp as timestamp,old_user as
user,old_user_text as user_text,old_comment as comment,old_text as
text,old_flags as flags FROM old WHERE old_namespace='4' AND
old_title='whatever' ORDER BY old_timestamp

This query asks for all versions of the article, at about one disk seek per
revision for the current schema, unless the revision is in cache. Worse, it ends
up using the old_title index, so it actually retrieves from disk all revisions
in all namespaces. At the time the server overload caused me to take a look
there were three of these running with run times of 891, 583 and 339 seconds.
The last was killed by the changed querybane rules after 805 seconds.

Adding use_index(name_title_timestamp) improves it significantly and should be
done but it still needs a limit added to it.

The planned schema change, if it stores old articles with article ID as the
first part of the primary key, would make this far more efficient because
adjsacent revisions would be in the same database pages. It would still
potentially retrieve hundreds of thousands of revisions for a popular article so
some limit is still going to be needed even with the new schama. If page2xml is
intended to be a general retrieve an article call, that limit needs to be very
low - tens not hundreds of revisions.

For Wikimedia sites, limit 10 is a good choice at present, assuming the primary
purpose isn't retrieving the article history.


Version: 1.4.x
Severity: normal

Details

Reference
bz1324

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:08 PM
bzimport set Reference to bz1324.
bzimport added a subscriber: Unknown Object (MLST).

jeluf wrote:

enwiki doesn't have the old_title index, nor has a fresh REL1_4 install.
dewiki has this index.

Should old_title index be dropped? I'm not aware of any queries that use
old_title but not old_namespace.

bugzilla_wikipedia_org.to.jamesd wrote:

The old_title index was in some of the wikis - de, fr, pl, probably others, but
not in the new ones I looked at. I'll remove it when practical, since it's not
present in the newer wikis. The query does select the correct index when that
one is removed. Not going to be enough to make it practical to return all old
metadata and text, though.

bugzilla_wikipedia_org.to.jamesd wrote:

Obsolete - old is gone now.