Page MenuHomePhabricator

RFC: Abstract schemas and schema changes
Open, MediumPublic

Description

At some point this should be a TechCom-RFC, but at the moment it's still in the drafting stage.

Problem

MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle Database. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.

We can improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements.

Approved solution

Create a PHP interface (or base class) for schemas and schema changes. We implement this in top of Doctrine DBAL. Schemas and schema changes will be defined in JSON files conforming to https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema, which will be read into Schema or SchemaDiff objects. DBAL will then take care of generating SQL for any supported RDBMS.

In addition, database support (in the form of a subclass of the Database base class) should be made pluggable. Extensions that want to provide support for a database backend would provide a Database subclass as well as a suitable implementation of the schema and schema change interface. This would be trivial for any database that is supported by DBAL.

Notes:

  • This means we drop support for MSSQL and Oracle RDBMS from MediaWiki core, since DBAL support for them is insufficient and/or the schema for these databases has diverged from the main line schema. WMF will not continue support for these database backends. Volunteers have shown interest in bringing back support for these backends in form of extensions.
  • For schema definitions, we go with JSON for now. But the we may want to switch to YAML later, for easier editing. JSON and YAML can easily be converted into one another.
  • If someone wants to introduce a schema change, there should be a new deployable file which can contain several schema changes. Existing schema change (json) files should not be changed to perform additional changes.

Old proposals

Proposal #1

We should write a schema and schema change abstraction layer to integrate with MediaWiki's existing runtime database abstraction. Details are on-wiki at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements, but in short:

  • We would have one schema, expressed as a structure in a JSON file. We would have one definition of each schema change, expressed as a structure in a JSON file.
  • Database-specific classes would exist to turn the schema or schema-change into SQL statements, much as we have database-specific subclasses of Wikimedia\Rdbms\Database.
  • We'd also tighten up some of the other database-level things: limited identifier lengths, index name uniqueness, data type consistency, charset consistency, etc.

The reason we didn't go with this:

  • It's lots of work to write a schema and schema change abstraction from scratch.

Proposal #2

Try to integrate Doctrine Migrations for schema creation and updates.

Pros (compared to Proposal #1):

  • We wouldn't have to implement all the database-specific logic ourself.
  • Probably a larger community fixing any bugs that exist.
  • Familiar system for (some subset of) PHP developers, simplifying onboarding of devs.

The reasons we didn't go with this:

  • We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.
  • We may have to custom-implement a "mwtimestamp" type, or else standardize all DBs on using 14-byte strings.
  • We may still have to work around issues like MSSQL's different treatment of NULLs in unique indexes.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
OpenLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
OpenNone
OpenNone
ResolvedReedy
OpenNone
ResolvedLadsgroup
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenDaimona
ResolvedMarostegui
OpenBstorm
OpenNone
ResolvedUrbanecm
DuplicateNone
OpenNone
InvalidNone
InvalidNone
ResolvedReedy
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedReedy
ResolvedReedy
InvalidNone
OpenNone
OpenNone
OpenNone
ResolvedReedy
ResolvedReedy
ResolvedReedy
InvalidNone
OpenNone
ResolvedReedy
InvalidNone
OpenNone
OpenNone
ResolvedPhysikerwelt
InvalidNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
InvalidNone
ResolvedReedy
OpenNone
OpenSTran
OpenNone
ResolvedTchanders
InvalidNone
InvalidNone
InvalidNone
OpenNone
ResolvedReedy
InvalidNone
ResolvedLadsgroup
InvalidNone
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Note that when closing T194125 (charset issue) this new task still does not address the backend conformance or emulation level we'd need to check (possibly with additional extensions to load in the backend SQL server). Each backend must still be able to pass a conformance test and fail, or enable an emulation layer to be used to allow compatibility or background migration of schemas (without causing a server upgrade to be imemdiately shutdown for hours... or days).

When upgrading the schema, an upward compatiblity test should be done, in order to cancel the upgrade and avoid breaking the existing server, until a compatiblity layer is implemented or a background migration tool is developed that allows the wiki to be restarted (even if it will temporarily perform more slowly while the migration is ongoing, meaning that there may be the need to rename some existing tables, while populating another newer table and rebuilding the necessary indexes).
As well, the triggers and constraints must be adapted. But I don't really see my MySQL/MariaDB is so different from Sybase/MSSQL; after all Sybase has made the job to allow integrating MySQL in a farm of SQL servers, MySQL (and MSSQL or ESE) working through adapters, and MySQL/MariaDB being seen as a lower-end SQL engine.
Oracle as well as integrated a connector for MySQL in its enterprise-level Oracle SQL engines. PostgreSQL is the most versatile engine. Many things can be adapted into Oracle or Sybase/MSSQL using a set of triggers and constraints. And years ago I worked on a large enterprise project that started in Sybase, then was adapted to Informix, IBM DB2, MS SQL, MySQL, Oracle, without having to rewrite the appliclation that depended on a very large set of stored procedures. even the large set of stored procedures were automatically converted to equivalents in Oracle, Informix, DB2, MSSQL. The DDL was also converted, and the application contained a self-written adapter that dynamically rewrote the queries (icnlujding the most complex ones using outer joins and custom subselections, temporary tables, and multilevel transactions (synchozed between local databases and a remote central database). Despite of the emulation used in Oracle and Informix, they performed faster than their equivalent in Sybase. The key was in using stored procedures/functions, allowing specific tuning for specific engines that could run natively in each engine wihtout transfering much data between the server and client not necessarily running on the same host: joins had to be resolved locally, possibly using the local procedural language (TransacSQL converted to PL/SQL, initially automatically by a build tool, then tuned manually). Then the applications were just calling procedures with simple parameters to return a result set, and the layer of adaptation used a transactional system controled by the application but using the specific transactional system.
All datatypes were converted (including the support for textual or binary blobs). And all this allowed the lcient to work through slow remote connections. All applications were interactive and the central server was managing hundreds of users simultaneously across a large region or a whole country from dozens of branches. At that time it was not a web application, the constraints were much more sever than for Mediawiki that jsut has to work though local networks and all is behing a front webserver.

But it's very surprizing that you can't handle the special case of the behavior of unique indexes on null columns: this is probably a symtom of bad conceptual design of the schema (where you may have used NULL as a unique value when it should never be, NULL being better reserved for 0to-1 or 0-to-N outer joins or optional attributes, and normally having NO "unique" constraints in these joins; if you use 0-to-N, you need computed columsn or aggregates, or an additional intermediate relation table in your schema)

Change 595240 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Start wiring automatically generated sql schemas to installation

https://gerrit.wikimedia.org/r/595240

I found a php-based formatter of SQL code, it may come handy to format automatically-generate SQL code: https://github.com/jdorn/sql-formatter (haven't tested it).

Apparently doctrine includes its own one (but wasn't used), so that will be used instead.

Apparently doctrine includes its own one (but wasn't used), so that will be used instead.

Well, they are one and the same, as per https://github.com/doctrine/sql-formatter#history :)

This package is a fork from https://github.com/jdorn/sql-formatter

Change 595269 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Add doctrine/sql-formatter to pretty print generated SQL files

https://gerrit.wikimedia.org/r/595269

Change 595269 merged by jenkins-bot:
[mediawiki/core@master] Add doctrine/sql-formatter to pretty print generated SQL files

https://gerrit.wikimedia.org/r/595269

Change 595240 merged by jenkins-bot:
[mediawiki/core@master] Wire empty abstract schema into installer

https://gerrit.wikimedia.org/r/595240

@Ladsgroup I'm unsure if this was already discussed or implemented, but perhaps we could add a PHPUnit test to ensure that the contents of the tables-generated.sql files match the output generated by generateSchemaSql.php. Something along the lines of AutoLoaderStructureTest. What do you think?

@Ladsgroup I'm unsure if this was already discussed or implemented, but perhaps we could add a PHPUnit test to ensure that the contents of the tables-generated.sql files match the output generated by generateSchemaSql.php. Something along the lines of AutoLoaderStructureTest. What do you think?

I have been thinking about it but I keep forgetting to start a ticket or possibly implement it (my unconscious is pretty lazy). I do it now.

Change 634679 had a related patch set uploaded (by Ammarpad; owner: Ammarpad):
[mediawiki/core@master] Vary timestamp default value in abstract schema

https://gerrit.wikimedia.org/r/634679

Change 634748 had a related patch set uploaded (by Ammarpad; owner: Ammarpad):
[mediawiki/core@master] Expand DoctrineSchemaBuilderTest

https://gerrit.wikimedia.org/r/634748

Change 634679 merged by jenkins-bot:
[mediawiki/core@master] Vary timestamp default value per platform in abstract schema

https://gerrit.wikimedia.org/r/634679

Change 634748 merged by jenkins-bot:
[mediawiki/core@master] Expand DoctrineSchemaBuilderTest

https://gerrit.wikimedia.org/r/634748

Change 636235 had a related patch set uploaded (by Ammarpad; owner: Ammarpad):
[mediawiki/core@master] Fix ip_changes.ipc_rev_id column default value

https://gerrit.wikimedia.org/r/636235

Change 636235 merged by jenkins-bot:
[mediawiki/core@master] Fix ip_changes.ipc_rev_id column default value

https://gerrit.wikimedia.org/r/636235