Page MenuHomePhabricator

PostgreSQL schema change for consistency with MySQL
Open, LowPublic

Description

The inconsistencies between the MySQL and PostgreSQL schemas cause a constant maintenance burden. PostgreSQL support is frequently broken in master. The PostgreSQL schema is certainly better than the MySQL one, if analysed in isolation, but the advantages to PostgreSQL users of having an improved schema are more than outweighed by the disadvantages of inconsistency with the remainder of the installed MediaWiki user base.

We should aim to have a schema which can be automatically generated from maintenance/tables.sql, which is already shared between MySQL and SQLite.

Proposal:

  • Require PostgreSQL 9.2 or later. Remove the many special cases from DatabasePostgres supporting earlier versions. It was released in 2012 and is the oldest version which still receives bugfixes. PostgreSQL 9.1 was EOL in September 2016.
  • Rename tables mwuser to user, and pagecontent to text. Table names are always quoted, so it doesn't matter if they are keywords. A table prefix can be used if desired to avoid having table names that are keywords.
  • Remove foreign key constraints. These were a nice idea, but have flow-on effects, causing inconsistencies and run-time errors.
    • Remove the dummy anonymous user
    • Remove the page_deleted trigger
  • Change old_text to BYTEA, for T26607 etc.
  • Change rc_ip to TEXT
  • Remove search update triggers, use SearchPostgres::update() instead.
  • Remove the profiling table.
  • Map AUTO_INCREMENT in the shared schema to PostgreSQL's SERIAL. This is essentially an abbreviated syntax for what the PostgreSQL schema does already. Rename any existing sequences which do not fit the resulting naming scheme.
  • Re-implement DatabasePostgres::insertId() using SELECT lastval(), so that it does not require nextSequenceValue() to be called first.
  • Change page_random to FLOAT(53). Also change it in the shared schema. This is an SQL-standard synonym for the existing column type on MySQL, and is also understood by PostgreSQL to mean the same thing. It is accepted on SQLite (which stores numbers as strings).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2017, 4:17 AM
tstarling triaged this task as Low priority.May 10 2017, 4:18 AM
Anomie added a subscriber: Anomie.May 10 2017, 12:52 PM

I like this. I might help if I have free time, or if this becomes a project that gets MediaWiki-Platform-Team resources.

We should aim to have a schema which can be automatically generated from maintenance/tables.sql

We should have CI require that the generator doesn't error out to make sure people don't accidentally break it, even if we don't test anything else about the output.

Or we could go one better and turn tables.sql into a "MediaWiki DDL" rather than MySQL/SQLite DDL statements (is there already a Phab task about that?). That could be anything from something that looks very much like it does now (but maybe with column types using keywords that are less MySQL-specific) to a PHP array structure or JSON.

Change the type of all timestamp columns, from TIMESTAMPTZ to the nearest equivalent of whatever the MySQL schema uses for the field in question.

I'm not opposing this change, but I like that MediaWiki mostly doesn't assume any particular format for storing timestamp values in the database. So this one makes me sad.

Change rc_ip to TEXT

I note tables.sql has it as varbinary, wouldn't bytea be the match for that?

We should aim to have a schema which can be automatically generated from maintenance/tables.sql

We should have CI require that the generator doesn't error out to make sure people don't accidentally break it, even if we don't test anything else about the output.
Or we could go one better and turn tables.sql into a "MediaWiki DDL" rather than MySQL/SQLite DDL statements (is there already a Phab task about that?). That could be anything from something that looks very much like it does now (but maybe with column types using keywords that are less MySQL-specific) to a PHP array structure or JSON.

Something like what proposes https://www.mediawiki.org/wiki/Requests_for_comment/Abstract_table_definitions ? (conceptually, not that specific implementation)

Change rc_ip to TEXT

I note tables.sql has it as varbinary, wouldn't bytea be the match for that?

bytea requires the caller to call Database::encodeBlob(), which encodes the value using octal escape sequences. So in the resulting SQL, bytea values are double-escaped, with double backslashes followed by octal numbers. And when you SELECT such values, you get them back with escape sequences still in and need to call Database::decodeBlob(). That makes it a not especially near equivalent for varbinary(). In MySQL, varbinary() is just a varchar() with a binary character set, which I chose for storing printable ASCII characters because it is more efficient in index space than varchar().

PG does not have per-column character sets, it only has a per-database character set, and "binary" is not an option. We use UTF-8.

We should probably change all the TEXT columns to the nearest equivalent to whatever MySQL uses. So VARCHAR(40) for rc_ip instead of TEXT. That will avoid a lot of special cases, assuming automatic schema translation.

PG does not have per-column character sets, it only has a per-database character set, and "binary" is not an option.

Actually it does have an 8-bit clean option, it is called (confusingly) SQL_ASCII. Anyway, we don't use it.

I was asked about why the MySQL schema uses CHAR(14) instead of DATETIME. This was a decision made by Lee Daniel Crocker. In Magnus's "phase 2" Wikipedia script, the schema was very similar to MediaWiki's, except that the timestamp fields were TIMESTAMP(14). In MySQL 3.23.x, which was used at the time, this had two major problems:

  1. Timestamps were stored in UTC and converted on storage or retrieval to the server's current timezone. So if clients inserted timestamps into the database assuming one timezone, and then the server timezone changed, the interpretation of all existing data would change. There was no SET TIME_ZONE, this was introduced in MySQL 4.1.3. The output of mysqldump was in the server's local timezone, and so moving the database either with mysqldump or by copying the data directory could corrupt all timestamps unless care was taken to match the timezone.
  2. If an UPDATE query was performed on a table with a TIMESTAMP column, and the column was not explicitly updated, MySQL would automatically update the timestamp to match the current time. Unlike in modern versions of MySQL, this behaviour was not configurable. So constant developer vigilance was required to avoid corrupting existing timestamps. For example, see Phase 2's special_protectpage.php.

Apparently neither of these problems affected the DATETIME type. However, the MySQL 3.23 manual contained no discussion of the timezone conversion issue, so it's possible that Lee assumed both TIMESTAMP and DATETIME were affected. I recall Lee citing "backup problems" as the main reason for using CHAR(14) in later discussions.

In any case, Lee made the decision to migrate from TIMESTAMP(14) to CHAR(14).

Change the type of all timestamp columns, from TIMESTAMPTZ to the nearest equivalent of whatever the MySQL schema uses for the field in question.

I'm not opposing this change, but I like that MediaWiki mostly doesn't assume any particular format for storing timestamp values in the database. So this one makes me sad.

I wonder if, given there is a future refactoring of some key tables, if it would be easy to migrate back to timestamps and keep postgres like that. Normally not worth it, but for tables where most rows are going to be rewritten, if that could be plausible. I see many problems (joins between formats, code assuming year or month manipulation, etc.), but I wonder if it has been considered at some point? The 14 -> 4 bytes saving may be interesting for some very tall tables.

saper added a subscriber: saper.Jun 2 2017, 12:49 AM
saper added a comment.Jun 2 2017, 1:36 AM

Apart from moving to 9.2 - which is a no brainer, seems like a tough choice. I've had impression the authors of the original PostgreSQL port tried to fix the deficiencies of the MySQL schema to some extent, resulting in well-known incompatibilities.

This change basically dumbs down PostgreSQL to the MySQL level, and leaves it there :)

  1. This will solve some most obvious compatibility problems, but will leave more nuanced one unaffected:
  • I was surprised to see how many dirty hack the API code might potentially have - judging by a quick git grep https://discourse.wmflabs.org/t/mysql-specific-hacks-in-the-api-code/169
  • I also remember having problems with different approaches to transaction handling
  • Unit tests suffer due to differences in schema handling (how to duplicate the schema quickly).
  1. It would be good to check if proposed changes will make core unit tests work on PostgreSQL (and easy to maintain for both)
  1. Trying some actual examples of extensions that might start working. Removing a foreign key requirement on some tables might ease porting (issues like https://phabricator.wikimedia.org/T59747). But problems like https://phabricator.wikimedia.org/T42757 are rooted in the bad design of the extensions themselves.
  1. Good to check if the migration will work, also on older databases (it was considerable effort to keep the updater doing its job).
  1. I find PostgresUpdater superior to how MySQL updates the schema, will we integrate those two, too?
  1. The most important: it would be check how that affects the query planner and performance of PostgreSQL on a non-trivial wiki.
  1. I would love to be able to use prepared statements (maybe https://phabricator.wikimedia.org/T37043 could be fixed by this nicely?)

Some MySQL schema design decisions go back to the ancient versions; maybe instead of dumbing down PostgreSQL we could make MySQL smarter a bit?

This change basically dumbs down PostgreSQL to the MySQL level, and leaves it there :)

That is how I'm describing it, yes. This is the best way to handle DBMS differences, considering that we're not going to drop MySQL support.

  1. This will solve some most obvious compatibility problems, but will leave more nuanced one unaffected:

The point of this change is to reduce the ongoing cost to developers of supporting PostgreSQL. I'm not trying to fix all the bugs.

  1. It would be good to check if proposed changes will make core unit tests work on PostgreSQL (and easy to maintain for both)

Maybe, if it is necessary to test the changes. Otherwise, it is a separate project.

  1. I find PostgresUpdater superior to how MySQL updates the schema, will we integrate those two, too?

Ultimately they should be integrated, one way or another. I can see the advantages of having a more structured way to represent schema updates, as opposed to SQL files. The point of the SQL files is to give DBAs some flexibility in how they execute them. As far as I know, update.php or similar has never been run on a wiki in the main cluster at WMF. But we could answer that use case by having MW generate an SQL file.

  1. I would love to be able to use prepared statements (maybe https://phabricator.wikimedia.org/T37043 could be fixed by this nicely?)

T37043 can be fixed with about 5 minutes' work. Just cherry-pick the SearchPostgres.php change into master so that it's not tied to that broken attempt at unit testing. It's already been tested manually and is a clear improvement on the existing code. I'd be happy to give it +2.

Using prepared statements would not really be an advantage for that bug, since you would still have to manually escape the search query string before binding it. It ends up double-escaped in the SQL.

Prepared statements in MySQL require an additional network round trip, so I don't think they're an ideal way to do quoting in ordinary circumstances.

Some MySQL schema design decisions go back to the ancient versions; maybe instead of dumbing down PostgreSQL we could make MySQL smarter a bit?

Changing the MySQL schema is extremely complicated, as we are finding with T161671. We can change the PostgreSQL schema fairly easily because the wikis that use PostgreSQL are relatively small.

daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Jun 7 2017, 7:54 PM
Catrope moved this task from Under discussion to Last Call on the TechCom-RFC board.EditedAug 16 2017, 8:40 PM
Catrope added a subscriber: Catrope.

As per the TechCom (FKA ArchCom) meeting on August 16th, this RFC is entering the Last Call period. Should no pertinent objections remain unaddressed by August 30th, the RFC will be approved for implementation.

binbot added a subscriber: binbot.Aug 17 2017, 5:30 AM

Not sure how I missed this task until now. I'm mostly okay with this, but the timestamp change in particular seems a step in the wrong direction.

Rename tables mwuser to user, and pagecontent to text. Table names are always quoted, so
it doesn't matter if they are keywords. A table prefix can be used if desired to avoid having
table names that are keywords.

It does matter as far as every other thing that is not direct MW code (e.g. SQL files, accessing via psql or other interface). So +1 for avoiding keywords even if that means another table prefix (can someone expand on what exactly that would mean and where it would be set?)

demon added a comment.Aug 17 2017, 4:45 PM

Rename tables mwuser to user, and pagecontent to text. Table names are always quoted, so
it doesn't matter if they are keywords. A table prefix can be used if desired to avoid having
table names that are keywords.

It does matter as far as every other thing that is not direct MW code (e.g. SQL files, accessing via psql or other interface). So +1 for avoiding keywords even if that means another table prefix (can someone expand on what exactly that would mean and where it would be set?)

You mean $wgDBprefix that can be set in the installer and LocalSettings.php?

Personally, I can't say I'd worry too much about keyword table names. In hand-crafted SQL files and psql it's not hard to put quotes around something. If you're using some other interface that does the sql for you, if it's not quoting things in the SQL it generates it's probably not that great an application.

Or you can set $wgDBprefix for your install to prefix all table names. I don't know if there's an easy way to change prefixes for an existing database though.

demon added a comment.Aug 17 2017, 5:03 PM

T37043 can be fixed with about 5 minutes' work. Just cherry-pick the SearchPostgres.php change into master so that it's not tied to that broken attempt at unit testing. It's already been tested manually and is a clear improvement on the existing code. I'd be happy to give it +2.

Done by dropping the test part of the change. It took all of like 10 seconds to do that.

Not sure how I missed this task until now. I'm mostly okay with this, but the timestamp change in particular seems a step in the wrong direction.

We've had several negative comments about the timestamp change. I don't think it's central to this RFC, so I'm happy to split it out and leave it for discussion at a later date. I proposed in the Technical Committee meeting today that we approve this RFC as so amended, and that motion was passed.

tstarling updated the task description. (Show Details)Aug 30 2017, 8:39 PM

Change 421972 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Bump required Postgres version to 9.2

https://gerrit.wikimedia.org/r/421972

Change 421972 merged by jenkins-bot:
[mediawiki/core@master] Bump required Postgres version to 9.2

https://gerrit.wikimedia.org/r/421972

Anomie added a comment.Apr 2 2018, 6:21 PM

Or we could go one better and turn tables.sql into a "MediaWiki DDL" rather than MySQL/SQLite DDL statements (is there already a Phab task about that?). That could be anything from something that looks very much like it does now (but maybe with column types using keywords that are less MySQL-specific) to a PHP array structure or JSON.

I wrote a task for that: T191231: RFC: Abstract schemas and schema changes

lilydjwg removed a subscriber: lilydjwg.Oct 2 2018, 5:38 AM