Author: bugzilla_wikipedia_org.to.jamesd
Description:
At Wikimedia sites the following page2xml query is now being automatically
killed on sight, after use of it caused a problem on one of the database servers
handling de:
/* page2xml */ SELECT old_id as id,old_timestamp as timestamp,old_user as
user,old_user_text as user_text,old_comment as comment,old_text as
text,old_flags as flags FROM old WHERE old_namespace='4' AND
old_title='whatever' ORDER BY old_timestamp
This query asks for all versions of the article, at about one disk seek per
revision for the current schema, unless the revision is in cache. Worse, it ends
up using the old_title index, so it actually retrieves from disk all revisions
in all namespaces. At the time the server overload caused me to take a look
there were three of these running with run times of 891, 583 and 339 seconds.
The last was killed by the changed querybane rules after 805 seconds.
Adding use_index(name_title_timestamp) improves it significantly and should be
done but it still needs a limit added to it.
The planned schema change, if it stores old articles with article ID as the
first part of the primary key, would make this far more efficient because
adjsacent revisions would be in the same database pages. It would still
potentially retrieve hundreds of thousands of revisions for a popular article so
some limit is still going to be needed even with the new schama. If page2xml is
intended to be a general retrieve an article call, that limit needs to be very
low - tens not hundreds of revisions.
For Wikimedia sites, limit 10 is a good choice at present, assuming the primary
purpose isn't retrieving the article history.
Version: 1.4.x
Severity: normal