Page MenuHomePhabricator

sql dump schemata - seven tables should have their columns reordered
Closed, DeclinedPublic

Description

  1. Problem

The sql dumps show the following fields defined in the wrong position within their tables:
o categorylinks.cl_sortkey_prefix;
o imagelinks.il_from_namespace;
o interwiki.*;
o page.page_links_updated;
o page_restrictions.pr_id;
o pagelinks.pl_from_namespace; and
o templatelinks.tl_from_namespace.

This means that the database schema in the sql dumps (as of 2015-06) does not agree with maintenance/tables.sql (as of 2015-06).
See https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/maintenance%2Ftables.sql.

  1. Requested action

Please rewrite the CREATE TABLE commands for these seven tables for future sql dumps.

Event Timeline

wpmirrordev assigned this task to ArielGlenn.
wpmirrordev raised the priority of this task from to Needs Triage.
wpmirrordev updated the task description. (Show Details)
wpmirrordev subscribed.

FYI: New columns on a database get often added at end, but on a relation dbms this does not make any problems.

adding @jcrespo to comment on the CREATE and mysqldump field ordering issue.

@wpmirrordev Sorry, I do not fully understand "fields defined in the wrong position" (proabably it is just me).

Do you mean that CREATE TABLEs do not fit the order in which columns are dumped on the XML exports? Production/replica order? or something else?

You mention "SQL dumps", but we do not offer such a thing, that I know of, only XML dumps and static pages.

Can you point me to what are exactly are you trying to do that doesn't work for you, with a specific example?

In general, those CREATE TABLEs will not be modified, as they represent out internal state. But maybe we can help you in other way?

For @jcrespo:

0) Context

I am the author of https://www.mediawiki.org/wiki/Wp-mirror, which is a utility for building a mirror farm of wikis served locally by mysql, mediawiki, and apache2.
I am working on the next version of WP-MIRROR, which is more broadly conceived as 'WMF in a microcosm', meaning that it will be able to:

o build local mirror of dumps posted by WMF;
o build local mirror farm of wikis; and
o generate and post local dumps (XML, SQL, media, thumb, HTML/ZIM) from the mirror farm.

Hence my interest in dumps and dump architecture.

  1. SQL Dumps

The WMF does indeed post SQL dump files. See https://dumps.wikimedia.org/simplewiki/20150901/, and note the files with names like *links.sql.gz. These SQL dump files are needed for reasons of speed. Importation of XML dump files with maintenance/importDumps.php is two or three orders of magnitude slower than the best alternative. Namely, a) convert XML dumps into SQL dumps (of page, revision, and text tables) using tools such as mwxml2sql (by @ArielGlenn); b) import those three SQL dumps into mysql; and c) import the *.sql.gz SQL dumps to fill in the other tables.

  1. Database schema

If you compare the CREATE TABLE command given at the top of each SQL dump file, with that of https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/maintenance%2Ftables.sql, you will at once see that they do not always agree. Moreover, each of these database schemata change over time.

  1. The problem of updating

Currently, the simplest way to update a mirror, is to DROP TABLE and then CREATE TABLE. For the largest wikis, this means the mirror will be down several days each month. So the question is: How do we update a mirror without loss of service?

  1. Rewriting the SQL dumps

I use a GAWK script to rewrite all the SQL dumps. This script: a) strikes the DROP TABLE and CREATE TABLE commands; b) substitutes each INSERT with REPLACE, and c) substitutes VALUES with (col1, col2, ...) VALUES.

Here 4a) and 4b) mean that the rewritten SQL dumps can be imported without loss of existing records. Moreover, 4c) means that misordered columns and changes to the column order, as mentioned in 2) above, can be handled.

  1. Request

What I really would like is: a) for the database schema to match correctly; and b) for the SQL dump files to use (col1, col2,...) VALUES rather than VALUES.

You cannot assume the order of columns in a relational database to be always the same as on other instance. The concept of a relational database gives you column names to avoid internal things like this.

Just parse the column order from the create statement and use the order in the builded REPLACE. Your database will handle the reorder than. That will avoid problems with wrong order of columns.

The dump is builded to reload *as is* into a database, and it that case you can omit the columns, because the order is the order of the create statement. The dump is a mysql feature, not something a sysadmin can change the format.

To avoid downtime try to load into another schema and than do a REPLACE over into production (which than handle the reorder), maybe a bad idea if the wiki is big.

I'm going to decline this ticket; realistically, reordering the dbs in production so that their CREATE TABLE statements are identical with those in MediaWiki's tables.sql is not going to happen. We can talk about whether the xmlfileutils suite can be expanded to deal with this sort of problem better; I'm open to tickets and/or suggestions about that.