Page MenuHomePhabricator

RFC: Abstract schemas and schema changes
Open, Needs TriagePublic

Description

At some point this should be a TechCom-RFC, but at the moment it's still in the drafting stage.

Problem

MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle Database. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.

We can improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements.

Proposal #1

We should write a schema and schema change abstraction layer to integrate with MediaWiki's existing runtime database abstraction. Details are on-wiki at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements, but in short:

  • We would have one schema, expressed as a structure in a JSON file. We would have one definition of each schema change, expressed as a structure in a JSON file.
  • Database-specific classes would exist to turn the schema or schema-change into SQL statements, much as we have database-specific subclasses of Wikimedia\Rdbms\Database.
  • We'd also tighten up some of the other database-level things: limited identifier lengths, index name uniqueness, data type consistency, charset consistency, etc.

Proposal #2

Try to integrate Doctrine Migrations for schema creation and updates.

Pros (compared to Proposal #1):

  • We wouldn't have to implement all the database-specific logic ourself.
  • Probably a larger community fixing any bugs that exist.
  • Familiar system for (some subset of) PHP developers, simplifying onboarding of devs.

Cons:

  • We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.
  • We may have to custom-implement a "mwtimestamp" type, or else standardize all DBs on using 14-byte strings.
  • We may still have to work around issues like MSSQL's different treatment of NULLs in unique indexes.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle edited projects, added TechCom-RFC; removed Proposal.Jul 30 2018, 9:38 PM
@Anomie wrote

At some point this should be a TechCom-RFC, but at the moment it's still in the drafting stage.

I've made it an "RFC" now for discovery purposes, but remaining in the backlog per your statement of it being in a drafting stage. Once ready for wider input, move it to either "Inbox" or "Request IRC meeting".

Krinkle moved this task from Inbox to Backlog on the TechCom-RFC board.Aug 1 2018, 8:41 PM
Ladsgroup added a subscriber: Ladsgroup.

There is also this RFC related to abstract schema definitions. Let's call it Proposal #3 (even though the page has been around since 2012): https://www.mediawiki.org/wiki/Requests_for_comment/Abstract_table_definitions

We would have one schema, expressed as a structure in a JSON file.

FWIW I think using JSON for this is a terrible idea. It is not a good format for this kind of data and will add a lot of unnecessary cruft into something that could otherwise be done fairly simply.

saper added a subscriber: saper.Aug 5 2018, 7:56 PM

Are we really so nicely abstract on the DML level? My impression was this is a MySQL app and other databases apply workarounds to get rid of MySQL-specific things sprinkled around the code.

Are we really so nicely abstract on the DML level? My impression was this is a MySQL app and other databases apply workarounds to get rid of MySQL-specific things sprinkled around the code.

The question is really do we want to be so nicely abstract on the DML level? If so, then everything else is a problem to be overcome. However, you're right - there might be some big problems lurking under the surface and they should probably be looked at and considered early on.

There are therefore probably a couple of prerequisites for this ticket:

  1. Perform a gap analysis to work out how far away the current code-base is from being reasonably abstractable.
  2. Assuming this doesn't blow the idea out of the water, close the gap by updating existing code where necessary. This doesn't require any kind of abstraction to be implemented (though the design should probably happen in parallel, so that the migration travels in the correct direction); it just involves migrating the current code to replace any MySQL-specific elements with something that can be generalised.

In my view, there's no point worrying about how to implement the abstraction without performing those first two steps, as before those have been done we won't know if an abstraction is even possible!

TheDJ added a subscriber: TheDJ.Aug 24 2018, 12:09 PM
TheDJ updated the task description. (Show Details)Aug 24 2018, 12:15 PM
Anomie updated the task description. (Show Details)Aug 24 2018, 2:05 PM

There is also this RFC related to abstract schema definitions. Let's call it Proposal #3 (even though the page has been around since 2012): https://www.mediawiki.org/wiki/Requests_for_comment/Abstract_table_definitions

Let's not have to write a parser for a custom format. Other than that, there doesn't seem to be anything there that isn't already in Proposal #1.

FWIW I think using JSON for this is a terrible idea. It is not a good format for this kind of data and will add a lot of unnecessary cruft into something that could otherwise be done fairly simply.

What do you think would be a good format? As I see it, no matter what we're going to have to have a data structure in memory, and it seems reasonable to represent it with the same structure on disc. MediaWiki already contains code to load JSON.

Are we really so nicely abstract on the DML level?

For the most part, yes we are, at least in MediaWiki core. We run unit tests against SQLite and PostgreSQL on merge, at least for core. Licensing issues prevent us from testing MSSQL or Oracle.

Our DML classes do have a lot of ways to specify things to make MySQL happier that are ignored by the other database backends, such as index forcing.

There is also this RFC related to abstract schema definitions. Let's call it Proposal #3 (even though the page has been around since 2012): https://www.mediawiki.org/wiki/Requests_for_comment/Abstract_table_definitions

Let's not have to write a parser for a custom format. Other than that, there doesn't seem to be anything there that isn't already in Proposal #1.

I was primarily including it for completeness. I haven't compared the two propositions in detail.

That said, a discussion of functionality is probably a better place to start than by bikeshedding the syntax. What 'commands' do we need the abstraction to support? The complexity of the answer to that question should be a good guide when deciding how best to represent it - some formats are terrible for complex data, whereas others are overkill for simple data. I wouldn't dismiss a custom format outright at this stage.

FWIW I think using JSON for this is a terrible idea. It is not a good format for this kind of data and will add a lot of unnecessary cruft into something that could otherwise be done fairly simply.

What do you think would be a good format?

Something that allows inline comments.

Licensing issues prevent us from testing MSSQL or Oracle.

I don't know where you get your info but for Oracle, you could:

  • use XE which is totally free and has all the features it's only limited by hardware it will use and max amount of stored data
  • use any SE or EE edition which is also free for development purposes on a single machine/instance/user installation, so you could use this to run repo changes against an Oracle DB

Not sure, but i think there are similar options for MSSQL

While there are "gratis"[1] developer versions of both Oracle and MSSQL that could likely be legally used in a CI context, software run in Wikimedia's CI infrastructure must be under an Open Source license.[2][3]

I don't know who might be able to make an exception to that policy or what would be required to convince them to do so. Following up on that would be outside the scope of this RFC.

[1]: a.k.a. free-as-in-beer, versus free-as-in-speech.
[2]: https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use#What_uses_of_Cloud_Services_do_we_not_like?
[3]: The edit summary of https://www.mediawiki.org/w/index.php?diff=2750812&oldid=2750666.

I don't know who might be able to make an exception to that policy or what would be required to convince them to do so. Following up on that would be outside the scope of this RFC.

https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use#What_uses_of_Cloud_Services_do_we_not_like?
(...)
3 - Proprietary software: Do not use or install any software unless the software is licensed under an Open Source license.
(...)
Exceptions to this list of prohibited uses can be requested for limited circumstances, such as testing compatibility with proprietary software. To request an exception, please submit a proposal to the Cloud Services administrators.

After reading this, it looks like an email to "the Cloud Services administrators" is a good starting point.

While there are "gratis"[1] developer versions of both Oracle and MSSQL that could likely be legally used in a CI context, software run in Wikimedia's CI infrastructure must be under an Open Source license.[2][3]

If your sole obstacle is that you are unable to install Oracle due to your internal guidelines regarding openness of the software used i could have a talk with other co-owners of my company and if they agree i could open a path to a DB on one of our testing rigs and you could use it remotely.

True, it won't be fast, might be broken at times and even then i can't guarantee 99.9% uptime as we tend to bounce weird hacky ideas against those ... but i think it would do for validating code changes.

Well, it's just an idea

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network. @Skizzerz had some ideas about doing something for MSSQL in travis-ci post-merge, I'm not sure how much progress he made.

I do agree with Anomie that it's out of scope for this RfC, maybe we need another ticket for it?

Regarding the cons of the second proposal:

We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.

Is there a way to use doctorine only for producing the schema change and schema installer SQL string and leave it to mediawiki to handle connections and stuff?

Ladsgroup added a comment.EditedJan 21 2019, 9:37 AM

I also want to mention this method: https://github.com/wikimedia/mediawiki/blob/c477bcf2c5c482d3189ec3579c5dee444eb06f7d/includes/libs/rdbms/database/DatabaseSqlite.php#L898
That's not the way we should handle things...

Plus the current situation is very much prone to errors. Like this patch that went unnoticed for four releases and more than a year was adding the right column to the wrong table:
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/328377/22/maintenance/mssql/archives/patch-user_groups-ug_expiry.sql#4

Regarding the cons of the second proposal:

We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.

Is there a way to use doctorine only for producing the schema change and schema installer SQL string and leave it to mediawiki to handle connections and stuff?

For creating an initial schema, that could probably work assuming Doctrine supports it. But a schema change is two-way, you usually have to check the current state before knowing what changes needs to be made. Particularly for sqlite where most changes require creating a whole new table, copying the data from the old table, and then dropping the old and renaming the new into place.

Then, too, we might still be left with datatype differences to deal with, e.g. where Doctrine wants to use varchar while MediaWiki wants varbinary or something like that.

Agreed. That's a bunch of string-replacement hacks so we don't need to have maintenance/sqlite/tables.sql in addition to the other four tables.sql files.

mobrovac added a subscriber: mobrovac.EditedFeb 5 2019, 6:50 PM

Personally, option (1) seems like a clear winner here. If done correctly, it would provide the correct abstraction layer around relational data, which would allow a way to provide support for arbitrary RDBMS systems. Moreover, in the context of making cleaner and clearer interfaces in mw-core, going with option (1) would do us a lot of good in terms of reassessing internal interfaces and those exposed to extensions and to external entities.

(/me puts his TechCom hat on.) @Anomie do you think this needs more discussion here or would you like to request an IRC RFC meeting? Most of the discussion here seems to have been focused on proprietary software and our CI and licences.

Yeah, I think it's about time to have an actual RFC meeting. Although, assuming it's approved, it'll probably be a quarter or so before I have time to actually start working on it.

kchapman added a subscriber: kchapman.

TechCom is hosting an IRC meeting next week March 6 2pm PST(22:00 UTC, 23:00 CET) in #wikimedia-office

daniel added a subscriber: daniel.Mar 4 2019, 1:56 PM

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

HappyDog added a comment.EditedMar 4 2019, 2:53 PM

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

I agree - seems a bit of a double standard!

Also, it's a bit of an odd approach to say we have the code and we have the tests, but we're not going to run the tests.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

Correct me if I'm wrong but it seems the reason for this is that our current system is not extendible for other types of databases so it's either we support it or we can't support it (even through an extension). This is one of things this RFC can look into and move away MSSQL and other propriety databases to extensions and leave it to people who wants to maintain it.

Anomie added a comment.Mar 4 2019, 4:50 PM

Correct me if I'm wrong but it seems the reason for this is that our current system is not extendible for other types of databases so it's either we support it or we can't support it (even through an extension). This is one of things this RFC can look into and move away MSSQL and other propriety databases to extensions and leave it to people who wants to maintain it.

You're right, at least from a practical standpoint.

Technically you could probably have an extension that adds support for some other database now: the user would load your extension in LocalSettings.php and configure $wgDBtype, $wgLBFactoryConf, and so on to use your database classes rather than one of the built-in ones. But it would be a lot of work to actually get it started and even more to keep up to date:

  • You'd have to have your own version of tables.sql in your extension, or else you'd need to write code to parse core's maintenance/tables.sql (or maintenance/postgres/tables.sql, maintenance/mssql/tables.sql, or maintenance/oracle/tables.sql) and turn it into something that works for your new DB.
  • Then you'd need to keep up with every core schema change (or somehow get the changes from the DatabaseUpdater subclass, including parsing the patch files in maintenance/archives/).
  • Also, to work with extensions, you'd need to duplicate or similarly parse the DB changes they inject with the LoadExtensionSchemaUpdates hooks.

What would probably actually happen is that you'd keep up with it for a short time, then lose interest (or stop having that as part of your job at a third party) and abandon the extension. And likely no one else would bother to pick it up.

If this RFC is implemented, it would be a lot more straightforward: you'd implement an interface with methods like "tableExists()", "createTable()", "addColumn()", and so on for specific schema-change operations and be done. The rest of the details of handling schema creation and schema change are handled for you by core code calling that interface's methods.

I agree with @Anomie ...

Moving a proprietary DB type to extension and make the extension maintainer keep up with the changes makes it essentially unusable. Even when i was actively working on Oracle support i had trouble keeping up with the changes and in many cases trying to convince the core DB devs that some changes are not even applicable for my back-end.

Honestly, unless you not only standardize trough abstraction the DDL, but also ALL of DQL and DML trough an abstraction layer without ANY hard coded SQL it's a loosing battle for anyone who picks it up. I doubt that anyone will be able to pull such abstraction off anyway. There were many attempts and it always came down to sacrificing WMF branch performance to maintain the abstraction or a heap of exceptions which never played well for others. I'd say it's a waste of time to even consider it.

So while it was fun working on this project i have lost track of the changes more than a year ago and I think it's best interest of everyone, that if you do not plan to keep it in the core and offer active support and maintenance from the core dev side, to just draw a line and drop the proprietary DB code from repos.

daniel added a comment.Mar 5 2019, 8:16 AM

So while it was fun working on this project i have lost track of the changes more than a year ago and I think it's best interest of everyone, that if you do not plan to keep it in the core and offer active support and maintenance from the core dev side, to just draw a line and drop the proprietary DB code from repos.

Indeed. If we don't have quality assurance for it, we should not ship it. If we don't do proprietary DBs in CI, we should not have them in core. I'm in favor of dropping support for them, it was never complete anyway.

The abstraction layer would still be useful to remove the need of manually maintaining patches for postgres and sqlite.

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

Are you proposing to drop support for all calls to wfIsWindows() for Windows compatibility or any macOS workarounds? What about the few hacks that are there for proprietary webservers? Or the compatibility JavaScript we have for proprietary web browsers? None of those run through our CI.

I'm always one of the first ones to advocate for free software, but taking the philosophy of dropping support for all non-free software in core seems not well thought out.

daniel added a comment.Mar 5 2019, 9:16 AM

Are you proposing to drop support for all calls to wfIsWindows() for Windows compatibility or any macOS workarounds? What about the few hacks that are there for proprietary webservers? Or the compatibility JavaScript we have for proprietary web browsers? None of those run through our CI.

If we ship them, they should run through CI. "We write code for it but we don't test it" seems a bad idea to me. But discussion of this argument with regards to things other than databases is off topic here.

The abstraction layer would still be useful to remove the need of manually maintaining patches for postgres and sqlite.

Don't get me wrong .... the abstraction layer for DB is necessary, because the current state is a PITA to maintain even without the proprietary DBs.

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

Honestly, unless you not only standardize trough abstraction the DDL, but also ALL of DQL and DML trough an abstraction layer without ANY hard coded SQL it's a loosing battle for anyone who picks it up. I doubt that anyone will be able to pull such abstraction off anyway. There were many attempts and it always came down to sacrificing WMF branch performance to maintain the abstraction or a heap of exceptions which never played well for others. I'd say it's a waste of time to even consider it.

Abstracting DDL is a good first step towards the above. We have semi-decent abstractions already for the other stuff, although they aren't quite robust enough in some areas. In any case, as long as the wrappers are being used for queries, the db could (worst case) re-write bits of SQL in a way that works for that db. Which is what a number of them currently do. I'd prefer if we could just deprecate and remove DatabaseBase::query() entirely though and shore up the abstraction layers to be able to specify the things that query() is currently being used for. In any case, not relevant to this particular RFC but it's something that I'd love to see long-term.

Anomie added a comment.EditedMar 5 2019, 11:24 PM

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

While dropping or extension-izing MSSQL and Oracle support is something to discuss and the extension-izing would be made easier by this RFC, it's mostly not in scope for this RFC. General support for non-free software is even less so. File a separate task if you want to continue discussing that, please, or to some other communications channel.

In scope are the compatibility issues mentioned in Proposal #1, largely with respect to Oracle, that would likely mean dropping Oracle support unless someone steps up to implement it. MSSQL seems more likely to be able to have something put together mostly-blindly (or without too much demand on a volunteer's time) to maintain nominal support.

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

I pointed out a rabbit hole, and suggested we avoid going down it right here and now. I'm happy to discuss it at some other place and time, I didn't mean to shut you down.

My concern here is that, if we don't have CI for the proprietary databases, we have no way to make sure that we don't break support for them (even more) when implementing this proposal.

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

I pointed out a rabbit hole, and suggested we avoid going down it right here and now. I'm happy to discuss it at some other place and time, I didn't mean to shut you down.

My concern here is that, if we don't have CI for the proprietary databases, we have no way to make sure that we don't break support for them (even more) when implementing this proposal.

Well, I guess Proposal #2 fits all, if you implement Doctrine Migrations for supported databases, why worry about break not supported Db if they are "not supported"...?

Probably a larger community fixing any bugs that exist.... and extend the work to others Db will be easier than with Proposal #1.

On the other hand, maybe people using Oracle probably are stucked at MW v1.22 or so, which is also not supported anymore.
How could this be broken...?

JMTC.

Reading the minutes of the meeting (sorry I couldn't join) I volunteer to do a basic research on Doctorine and how feasible it is to use. Is it okay with the TechCom and @Anomie ?

daniel added a comment.Mar 7 2019, 3:21 PM

Reading the minutes of the meeting (sorry I couldn't join) I volunteer to do a basic research on Doctorine and how feasible it is to use. Is it okay with the TechCom and @Anomie ?

Sounds great, thank you!

Anomie added a comment.EditedMar 7 2019, 3:25 PM

Reading the minutes of the meeting (sorry I couldn't join) I volunteer to do a basic research on Doctorine and how feasible it is to use. Is it okay with the TechCom and @Anomie ?

Please do!

You might look over https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements to get an idea of some of the challenges that might be encountered.

saper added a comment.Mar 7 2019, 5:35 PM

Recently I got into https://github.com/djrobstep/migra which does an amazing job for PostgreSQL to create what some people call "migrations" between the database versions. This makes Proposal 1 quite attractive - it could be used to create a target empty schema (from JSON) and then just use such a tool (or similar ones) to migrate. There are limitations (migra itself is PostgreSQL-only, version 10+, other tools must have been used for other DBs), but I wouldn't give up on Proposal 1.

https://djrobstep.com/talks/your-migrations-are-bad-and-you-should-feel-bad

Anomie added a comment.Mar 7 2019, 6:01 PM

While that's an interesting slide deck, it seems to be written for a proprietary development shop where you never have to worry about anyone having to upgrade from a version older than X-1. That's not a model we can aim for. In particular, just diffing schemas doesn't tell you anything about what data might need to be migrated from old tables/columns to new tables/columns, or how to populate the new non-null no-default column you're adding to a table that's already full of data.

With Option 1 I do hope to be able to include a maintenance script to show the diff between two schema-files and print differences between the current DB state and the schema-file. We will still have the directory[1] full of schema-change files, but I also hope to include unit tests for "base schema-file + all updates = current schema-file" and "DB loaded with base schema file + all updates = state consistent with the current schema-file".

[1]: But just one directory instead of five like we do now!

I looked at the codes and doctrine projects. Doctrine as a whole is too much for us, it hides so many things (like database connections) from the outside that is basically unusable. Same goes for Doctrine Migrations. But there is something called Doctrine DBAL that seems a good fit for us.
For schema creation, We basically check out a yaml/json/xml config file (I call it "database scheme config file" from now on) that DBAL can create the schema object for it. Imagine a SchemaBuilder class that does this based on a config file.
(Example from the docs)

$schema = new \Doctrine\DBAL\Schema\Schema();
$myTable = $schema->createTable("my_table");
$myTable->addColumn("id", "integer", array("unsigned" => true));
$myTable->addColumn("username", "string", array("length" => 32));
$myTable->setPrimaryKey(array("id"));
$myTable->addUniqueIndex(array("username"));
$schema->createSequence("my_table_seq");

Turning this into SQL command is pretty easy:

$queries = $schema->toSql($myPlatform);

$myPlatform can be extracted from doctrine connection object (which we don't want to have) OR we can instantiate it directly, For example for MySQL57Platform class (List of all platforms). it'll be: $platform = new MySQL57Platform();
The nice thing here is that it has different platforms for different versions of a DBMS. Something that we didn't support before.

Implementation idea: we can make a class called DoctrineDatabaseUpdater that extends DatabaseUpdater and is parallel to MySqlUpdater and get the platform as a constructor argument.

I think we can also handle the table and column prefix automatically in the SchemaBuilder class by just making it to automatically prefix the table/column name (I'm not sure if it's doable and doesn't confuse the DBAL, needs more investigation)

For schema changes, there are two ways:

  • We check out all "database schema config file"s into VCS, the system goes through each one of them and builds the schema object using SchemaBuilder class and finds the difference using codes like:
$comparator = new \Doctrine\DBAL\Schema\Comparator();
$schemaDiff = $comparator->compare($fromSchema, $toSchema);

$queries = $schemaDiff->toSql($myPlatform);

I don't like this option because it's lots of duplication.

  • The second option here is to put all "databse schema config file"s for stable releases and go through each line of DoctrineDatabaseUpdater and apply them in code and take the SQL out of it and then send it to DatabaseUpdater It should not be super hard to implement.

I like the second option.

It should not be hard to move forward and implement a POC based on what I wrote here but I'm a little bit sick, I'll do it once I get better.

@Ladsgroup that seems quite promising! @Anomie, do you think this is ready for a prove-of-concept?

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this), and also to have ignored data type requirements. But if @Ladsgroup wants to try writing an implementation, I won't say he can't.

Change 496421 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] [POC] USe of Doctrine DBAL for oracle schema and schema changes

https://gerrit.wikimedia.org/r/496421

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this).

Good point but I think we should focus scope of this ticket to unify and abstract away schema changes and not fix all of the issues with our current schema change system.

and also to have ignored data type requirements.

This can be rather easily fixable. Look at the POC.

But if @Ladsgroup wants to try writing an implementation, I won't say he can't.

I just made something concrete so we can discuss about it.

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this).

Good point but I think we should focus scope of this ticket to unify and abstract away schema changes and not fix all of the issues with our current schema change system.

I'd argue that this is in scope: in order to perform schema changes, you need to know what changes to perform. I mean, we can simply use the old system for that, at least for now, if it's sufficiently generic and working for all database types. But it would have to be clear how the old and the new system relate to each other, whether they use the same underlying connection, etc.

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this).

Good point but I think we should focus scope of this ticket to unify and abstract away schema changes and not fix all of the issues with our current schema change system.

Checking the existing schema is part of the existing system. Part of the focus of this ticket[1] is already to improve that to be easier for developers to do the right thing.

[1]: I should know, as the author of the RFC.

I'm sorry for the misunderstanding, as you wish. The POC is basically what I could get out of Doctring DBAL. If you think it's not good enough, we can go with the first proposal. I leave the decision to others in here.

I would like to understand the problem better. If DBAL can do the schema changes, why can't we use it for the schema changes, even if it doesn't help us with the other parts of the issue?

I would like to understand the problem better. If DBAL can do the schema changes, why can't we use it for the schema changes, even if it doesn't help us with the other parts of the issue?

Yes, That's my suggestion here, we can use DBAL as much as possible and write other parts we need.

Yes, That's my suggestion here, we can use DBAL as much as possible and write other parts we need.

@Anomie do you see a problem with that? Or do you think that writing the other parts will rendering using DBAL for the actual schema change redundant? It seems to me like we mostly have the other parts in place (doing selects, checking whether a table, field or index exists, etc). So why not just keep that as it is, and use DBAL instead of plain SQL code for the schema changes?

I'm worried that we'll wind up doing more work to address mismatches between DBAL and MediaWiki than we would to just directly implement code that works as part of Wikimedia\Rdbms without mismatches.

That said, though, I've been busy the past week and haven't had a chance to look at the code yet. I may find that my skepticism is unfounded. I hope to get a chance to review it this week.

It also sounds like using DBAL, if it works out, would be only one piece of the whole task here? So some amount of additional development would probably be needed in any case to achieve the goals here.

It also sounds like using DBAL, if it works out, would be only one piece of the whole task here? So some amount of additional development would probably be needed in any case to achieve the goals here.

Yes, seems like it. DBAL may be useful, but won't be sufficient.

I've said it before, but for work I'm really loving laravel migrations (Also DBAL, but with its own layers/subclasses around it). The class design can be found at Schema and Migrations if anyone wants some inspiration for our own solution. See Blueprint class for where they implement their own mwtimestamp stuff for instance. I also like how they defined Grammers.

However I will point out that when it comes to migrations, indeed especially wrt SQLite (i usually run unittests against sqlite), it's still not fully transparent. Some migrations are simply problematic (column type changes), though it is pretty easy to conditionally modify your migration to do connector specific things.

Lastly I'd like to highlight some of the tooling features of Laravel migrations [artisan migrate] that I like (esp for development, test and staging) and that we should maybe consider adding to update.php long term (maybe separate ticket?):

  1. Tracking of which migration has run (and when) in a migrations table
  2. There is a --pretend option to dump the SQL queries that will be run (bit like our --schema option)
  3. There is a --step option to allow you to do migrations 1 by 1, instead of bulk
  4. All migrations have a rollback function that you can implement allowing you to undo your change.
  5. Easily bootstrap/make a new migration class
  6. Takes into account maintenance mode

Now I think that running migrations on Wikimedia is a lot harder than what this still relatively simple tooling allows for, but in my opinion the pre-WMF-production workflow improvements in itself would be worth having these features.

Just my brain dump.

So I talked to @daniel yesterday and went through the POC together. This is my proposed solution based on Doctoring DBAL:

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)
  • We need to write some classes that turn this json data to doctrine DBAL "Schema" (for replacing tables.sql) or "SchemaDiff" (for replacing other sql files) abstract objects.
  • DatabaseUpdater and installers basically define a Doctorine DBAL platform (for example "SqlitePlatform") The platform is responsible to turning the abstract objects to sql commands.

Everything else stays the same including the way mediawiki checks for current schema to determine whether it would apply the patch or not. This narrows focus of the RFC a bit.

daniel added a comment.EditedMar 26 2019, 3:27 PM

On thing worth pointing out, which I learned talking to @Ladsgroup yesterday: DBAL turns an abstract schema or schema update (written in JSON) into SQL - nothing more. It doesn't need a database connection, it doesn't want to talk to any database at all, it just generates SQL statements. So it seems a good tool for specifying our schema and schema changes in a database agnostic way, without having to change a lot of scary and brittle code.

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

So, what is the proposed method of doing that?

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

So, what is the proposed method of doing that?

That's a very valid concern, the thing is the part that turns json config to php abstract schema objects needs to be written in mediawiki so we have freedom to choose whatever we want. We can even go yaml (the reason I went with json was 1- php has a straightforward support for json 2- extension.json and i18n json files are pretty common in mediawiki while yaml is not used much). I'm open to ideas for this part. The way extension registry works with comments is also can be used here.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

I'm open to ideas for this part.

The current schema has no table or column comments. And the documentation on mediawiki.org (eg. http://www.mediawiki.org/wiki/Manual:Logging_table) is frequently out of date (like log_type/log_action). Luckily both the current system's problem and the problem you mention here have the same solution: put comments in the columns and tables themselves. There should be a place in the JSON for these, as they're part of standard create table statements. This should be the canonical documentation of the schema, and it would then be usable from any replicas and documentation generation scripts.

@Ladsgroup Did you check how DBAL holds up against the heap of nasty edge cases that @Anomie compiled? https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements

Also, there is the issue of long term ownership of the code. Once it's written, who maintains it? It seems to me like this would fall to the Core Platform Team. @Anomie, if @Ladsgroup implements this as proposed by him in T191231#5058412, would you be willing to do code review on that implementation, and maintain the code in the long term? If not, I may be able to take this on myself, btu I have less experience with the schema updater code and all the little things that can go wrong there.

@Ladsgroup Did you check how DBAL holds up against the heap of nasty edge cases that @Anomie compiled? https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements

This is a really long list ;)
As far as I checked (which might be incomplete). The DBAL either handles them or error out. For example, in case of identifier validity, DBAL has method for that which each platform (DBMS) implements. For example this is the identifier validator in Oracle which strangely doesn't have any size limit. For the rest of requirements, the check can be done in SchemaBuilder object that we supposed to write for example the check that the foreign key actually points to a primary or unique column in another table can be done in SchemaBuilder.

The nice thing we can now have with DBAL is that in our CI we can make it our abstract schema be tested against all DBMSes by trying to turn the abstract php objects to SQL commands in all of supported platforms (which now we can add even more DBMSes like SQLAnywhere, SQLAzure, etc. to the list) and check if they error out because DBAL can catch most of the edgecases as I exampled above.

Datatypes are mapped from Doctroine datatypes to the DBMS ones. An example. By searching "length" in the Sqlite platform, I find the sophisticated way to handle length.