Page MenuHomePhabricator

RFC: Abstract schemas and schema changes
Open, Needs TriagePublic

Description

At some point this should be a TechCom-RFC, but at the moment it's still in the drafting stage.

Problem

MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle Database. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.

We can improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements.

Proposal #1

We should write a schema and schema change abstraction layer to integrate with MediaWiki's existing runtime database abstraction. Details are on-wiki at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements, but in short:

  • We would have one schema, expressed as a structure in a JSON file. We would have one definition of each schema change, expressed as a structure in a JSON file.
  • Database-specific classes would exist to turn the schema or schema-change into SQL statements, much as we have database-specific subclasses of Wikimedia\Rdbms\Database.
  • We'd also tighten up some of the other database-level things: limited identifier lengths, index name uniqueness, data type consistency, charset consistency, etc.

Proposal #2

Try to integrate Doctrine Migrations for schema creation and updates.

Pros (compared to Proposal #1):

  • We wouldn't have to implement all the database-specific logic ourself.
  • Probably a larger community fixing any bugs that exist.
  • Familiar system for (some subset of) PHP developers, simplifying onboarding of devs.

Cons:

  • We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.
  • We may have to custom-implement a "mwtimestamp" type, or else standardize all DBs on using 14-byte strings.
  • We may still have to work around issues like MSSQL's different treatment of NULLs in unique indexes.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Regarding the cons of the second proposal:

We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.

Is there a way to use doctorine only for producing the schema change and schema installer SQL string and leave it to mediawiki to handle connections and stuff?

Ladsgroup added a comment.EditedJan 21 2019, 9:37 AM

I also want to mention this method: https://github.com/wikimedia/mediawiki/blob/c477bcf2c5c482d3189ec3579c5dee444eb06f7d/includes/libs/rdbms/database/DatabaseSqlite.php#L898
That's not the way we should handle things...

Plus the current situation is very much prone to errors. Like this patch that went unnoticed for four releases and more than a year was adding the right column to the wrong table:
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/328377/22/maintenance/mssql/archives/patch-user_groups-ug_expiry.sql#4

Regarding the cons of the second proposal:

We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.

Is there a way to use doctorine only for producing the schema change and schema installer SQL string and leave it to mediawiki to handle connections and stuff?

For creating an initial schema, that could probably work assuming Doctrine supports it. But a schema change is two-way, you usually have to check the current state before knowing what changes needs to be made. Particularly for sqlite where most changes require creating a whole new table, copying the data from the old table, and then dropping the old and renaming the new into place.

Then, too, we might still be left with datatype differences to deal with, e.g. where Doctrine wants to use varchar while MediaWiki wants varbinary or something like that.

Agreed. That's a bunch of string-replacement hacks so we don't need to have maintenance/sqlite/tables.sql in addition to the other four tables.sql files.

mobrovac added a subscriber: mobrovac.EditedFeb 5 2019, 6:50 PM

Personally, option (1) seems like a clear winner here. If done correctly, it would provide the correct abstraction layer around relational data, which would allow a way to provide support for arbitrary RDBMS systems. Moreover, in the context of making cleaner and clearer interfaces in mw-core, going with option (1) would do us a lot of good in terms of reassessing internal interfaces and those exposed to extensions and to external entities.

(/me puts his TechCom hat on.) @Anomie do you think this needs more discussion here or would you like to request an IRC RFC meeting? Most of the discussion here seems to have been focused on proprietary software and our CI and licences.

Yeah, I think it's about time to have an actual RFC meeting. Although, assuming it's approved, it'll probably be a quarter or so before I have time to actually start working on it.

kchapman added a subscriber: kchapman.

TechCom is hosting an IRC meeting next week March 6 2pm PST(22:00 UTC, 23:00 CET) in #wikimedia-office

daniel added a subscriber: daniel.Mar 4 2019, 1:56 PM

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

HappyDog added a comment.EditedMar 4 2019, 2:53 PM

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

I agree - seems a bit of a double standard!

Also, it's a bit of an odd approach to say we have the code and we have the tests, but we're not going to run the tests.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

Correct me if I'm wrong but it seems the reason for this is that our current system is not extendible for other types of databases so it's either we support it or we can't support it (even through an extension). This is one of things this RFC can look into and move away MSSQL and other propriety databases to extensions and leave it to people who wants to maintain it.

Anomie added a comment.Mar 4 2019, 4:50 PM

Correct me if I'm wrong but it seems the reason for this is that our current system is not extendible for other types of databases so it's either we support it or we can't support it (even through an extension). This is one of things this RFC can look into and move away MSSQL and other propriety databases to extensions and leave it to people who wants to maintain it.

You're right, at least from a practical standpoint.

Technically you could probably have an extension that adds support for some other database now: the user would load your extension in LocalSettings.php and configure $wgDBtype, $wgLBFactoryConf, and so on to use your database classes rather than one of the built-in ones. But it would be a lot of work to actually get it started and even more to keep up to date:

  • You'd have to have your own version of tables.sql in your extension, or else you'd need to write code to parse core's maintenance/tables.sql (or maintenance/postgres/tables.sql, maintenance/mssql/tables.sql, or maintenance/oracle/tables.sql) and turn it into something that works for your new DB.
  • Then you'd need to keep up with every core schema change (or somehow get the changes from the DatabaseUpdater subclass, including parsing the patch files in maintenance/archives/).
  • Also, to work with extensions, you'd need to duplicate or similarly parse the DB changes they inject with the LoadExtensionSchemaUpdates hooks.

What would probably actually happen is that you'd keep up with it for a short time, then lose interest (or stop having that as part of your job at a third party) and abandon the extension. And likely no one else would bother to pick it up.

If this RFC is implemented, it would be a lot more straightforward: you'd implement an interface with methods like "tableExists()", "createTable()", "addColumn()", and so on for specific schema-change operations and be done. The rest of the details of handling schema creation and schema change are handled for you by core code calling that interface's methods.

I agree with @Anomie ...

Moving a proprietary DB type to extension and make the extension maintainer keep up with the changes makes it essentially unusable. Even when i was actively working on Oracle support i had trouble keeping up with the changes and in many cases trying to convince the core DB devs that some changes are not even applicable for my back-end.

Honestly, unless you not only standardize trough abstraction the DDL, but also ALL of DQL and DML trough an abstraction layer without ANY hard coded SQL it's a loosing battle for anyone who picks it up. I doubt that anyone will be able to pull such abstraction off anyway. There were many attempts and it always came down to sacrificing WMF branch performance to maintain the abstraction or a heap of exceptions which never played well for others. I'd say it's a waste of time to even consider it.

So while it was fun working on this project i have lost track of the changes more than a year ago and I think it's best interest of everyone, that if you do not plan to keep it in the core and offer active support and maintenance from the core dev side, to just draw a line and drop the proprietary DB code from repos.

daniel added a comment.Mar 5 2019, 8:16 AM

So while it was fun working on this project i have lost track of the changes more than a year ago and I think it's best interest of everyone, that if you do not plan to keep it in the core and offer active support and maintenance from the core dev side, to just draw a line and drop the proprietary DB code from repos.

Indeed. If we don't have quality assurance for it, we should not ship it. If we don't do proprietary DBs in CI, we should not have them in core. I'm in favor of dropping support for them, it was never complete anyway.

The abstraction layer would still be useful to remove the need of manually maintaining patches for postgres and sqlite.

Wikimedia CI as a matter of principle (backed up by the terms of use) isn't going to support any non-free software regardless of whether it's installed locally or connecting over the network.

Not supporting CI for any non-free software, but having support for that non-free software in core, seems contradictory. We should have both, or neither.

Are you proposing to drop support for all calls to wfIsWindows() for Windows compatibility or any macOS workarounds? What about the few hacks that are there for proprietary webservers? Or the compatibility JavaScript we have for proprietary web browsers? None of those run through our CI.

I'm always one of the first ones to advocate for free software, but taking the philosophy of dropping support for all non-free software in core seems not well thought out.

daniel added a comment.Mar 5 2019, 9:16 AM

Are you proposing to drop support for all calls to wfIsWindows() for Windows compatibility or any macOS workarounds? What about the few hacks that are there for proprietary webservers? Or the compatibility JavaScript we have for proprietary web browsers? None of those run through our CI.

If we ship them, they should run through CI. "We write code for it but we don't test it" seems a bad idea to me. But discussion of this argument with regards to things other than databases is off topic here.

The abstraction layer would still be useful to remove the need of manually maintaining patches for postgres and sqlite.

Don't get me wrong .... the abstraction layer for DB is necessary, because the current state is a PITA to maintain even without the proprietary DBs.

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

Honestly, unless you not only standardize trough abstraction the DDL, but also ALL of DQL and DML trough an abstraction layer without ANY hard coded SQL it's a loosing battle for anyone who picks it up. I doubt that anyone will be able to pull such abstraction off anyway. There were many attempts and it always came down to sacrificing WMF branch performance to maintain the abstraction or a heap of exceptions which never played well for others. I'd say it's a waste of time to even consider it.

Abstracting DDL is a good first step towards the above. We have semi-decent abstractions already for the other stuff, although they aren't quite robust enough in some areas. In any case, as long as the wrappers are being used for queries, the db could (worst case) re-write bits of SQL in a way that works for that db. Which is what a number of them currently do. I'd prefer if we could just deprecate and remove DatabaseBase::query() entirely though and shore up the abstraction layers to be able to specify the things that query() is currently being used for. In any case, not relevant to this particular RFC but it's something that I'd love to see long-term.

Anomie added a comment.EditedMar 5 2019, 11:24 PM

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

While dropping or extension-izing MSSQL and Oracle support is something to discuss and the extension-izing would be made easier by this RFC, it's mostly not in scope for this RFC. General support for non-free software is even less so. File a separate task if you want to continue discussing that, please, or to some other communications channel.

In scope are the compatibility issues mentioned in Proposal #1, largely with respect to Oracle, that would likely mean dropping Oracle support unless someone steps up to implement it. MSSQL seems more likely to be able to have something put together mostly-blindly (or without too much demand on a volunteer's time) to maintain nominal support.

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

I pointed out a rabbit hole, and suggested we avoid going down it right here and now. I'm happy to discuss it at some other place and time, I didn't mean to shut you down.

My concern here is that, if we don't have CI for the proprietary databases, we have no way to make sure that we don't break support for them (even more) when implementing this proposal.

But discussion of this argument with regards to things other than databases is off topic here.

That makes no sense, why is it off topic? I don't think it's very fair for you to make a broad claim saying that we should remove support for non-free software in core that CI doesn't test and then respond to a counter-argument with "that's off topic".

I pointed out a rabbit hole, and suggested we avoid going down it right here and now. I'm happy to discuss it at some other place and time, I didn't mean to shut you down.
My concern here is that, if we don't have CI for the proprietary databases, we have no way to make sure that we don't break support for them (even more) when implementing this proposal.

Well, I guess Proposal #2 fits all, if you implement Doctrine Migrations for supported databases, why worry about break not supported Db if they are "not supported"...?

Probably a larger community fixing any bugs that exist.... and extend the work to others Db will be easier than with Proposal #1.

On the other hand, maybe people using Oracle probably are stucked at MW v1.22 or so, which is also not supported anymore.
How could this be broken...?

JMTC.

Reading the minutes of the meeting (sorry I couldn't join) I volunteer to do a basic research on Doctorine and how feasible it is to use. Is it okay with the TechCom and @Anomie ?

daniel added a comment.Mar 7 2019, 3:21 PM

Reading the minutes of the meeting (sorry I couldn't join) I volunteer to do a basic research on Doctorine and how feasible it is to use. Is it okay with the TechCom and @Anomie ?

Sounds great, thank you!

Anomie added a comment.EditedMar 7 2019, 3:25 PM

Reading the minutes of the meeting (sorry I couldn't join) I volunteer to do a basic research on Doctorine and how feasible it is to use. Is it okay with the TechCom and @Anomie ?

Please do!

You might look over https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements to get an idea of some of the challenges that might be encountered.

saper added a comment.Mar 7 2019, 5:35 PM

Recently I got into https://github.com/djrobstep/migra which does an amazing job for PostgreSQL to create what some people call "migrations" between the database versions. This makes Proposal 1 quite attractive - it could be used to create a target empty schema (from JSON) and then just use such a tool (or similar ones) to migrate. There are limitations (migra itself is PostgreSQL-only, version 10+, other tools must have been used for other DBs), but I wouldn't give up on Proposal 1.

https://djrobstep.com/talks/your-migrations-are-bad-and-you-should-feel-bad

Anomie added a comment.Mar 7 2019, 6:01 PM

While that's an interesting slide deck, it seems to be written for a proprietary development shop where you never have to worry about anyone having to upgrade from a version older than X-1. That's not a model we can aim for. In particular, just diffing schemas doesn't tell you anything about what data might need to be migrated from old tables/columns to new tables/columns, or how to populate the new non-null no-default column you're adding to a table that's already full of data.

With Option 1 I do hope to be able to include a maintenance script to show the diff between two schema-files and print differences between the current DB state and the schema-file. We will still have the directory[1] full of schema-change files, but I also hope to include unit tests for "base schema-file + all updates = current schema-file" and "DB loaded with base schema file + all updates = state consistent with the current schema-file".

[1]: But just one directory instead of five like we do now!

I looked at the codes and doctrine projects. Doctrine as a whole is too much for us, it hides so many things (like database connections) from the outside that is basically unusable. Same goes for Doctrine Migrations. But there is something called Doctrine DBAL that seems a good fit for us.
For schema creation, We basically check out a yaml/json/xml config file (I call it "database scheme config file" from now on) that DBAL can create the schema object for it. Imagine a SchemaBuilder class that does this based on a config file.
(Example from the docs)

$schema = new \Doctrine\DBAL\Schema\Schema();
$myTable = $schema->createTable("my_table");
$myTable->addColumn("id", "integer", array("unsigned" => true));
$myTable->addColumn("username", "string", array("length" => 32));
$myTable->setPrimaryKey(array("id"));
$myTable->addUniqueIndex(array("username"));
$schema->createSequence("my_table_seq");

Turning this into SQL command is pretty easy:

$queries = $schema->toSql($myPlatform);

$myPlatform can be extracted from doctrine connection object (which we don't want to have) OR we can instantiate it directly, For example for MySQL57Platform class (List of all platforms). it'll be: $platform = new MySQL57Platform();
The nice thing here is that it has different platforms for different versions of a DBMS. Something that we didn't support before.

Implementation idea: we can make a class called DoctrineDatabaseUpdater that extends DatabaseUpdater and is parallel to MySqlUpdater and get the platform as a constructor argument.

I think we can also handle the table and column prefix automatically in the SchemaBuilder class by just making it to automatically prefix the table/column name (I'm not sure if it's doable and doesn't confuse the DBAL, needs more investigation)

For schema changes, there are two ways:

  • We check out all "database schema config file"s into VCS, the system goes through each one of them and builds the schema object using SchemaBuilder class and finds the difference using codes like:
$comparator = new \Doctrine\DBAL\Schema\Comparator();
$schemaDiff = $comparator->compare($fromSchema, $toSchema);

$queries = $schemaDiff->toSql($myPlatform);

I don't like this option because it's lots of duplication.

  • The second option here is to put all "databse schema config file"s for stable releases and go through each line of DoctrineDatabaseUpdater and apply them in code and take the SQL out of it and then send it to DatabaseUpdater It should not be super hard to implement.

I like the second option.

It should not be hard to move forward and implement a POC based on what I wrote here but I'm a little bit sick, I'll do it once I get better.

@Ladsgroup that seems quite promising! @Anomie, do you think this is ready for a prove-of-concept?

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this), and also to have ignored data type requirements. But if @Ladsgroup wants to try writing an implementation, I won't say he can't.

Change 496421 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] [POC] USe of Doctrine DBAL for oracle schema and schema changes

https://gerrit.wikimedia.org/r/496421

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this).

Good point but I think we should focus scope of this ticket to unify and abstract away schema changes and not fix all of the issues with our current schema change system.

and also to have ignored data type requirements.

This can be rather easily fixable. Look at the POC.

But if @Ladsgroup wants to try writing an implementation, I won't say he can't.

I just made something concrete so we can discuss about it.

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this).

Good point but I think we should focus scope of this ticket to unify and abstract away schema changes and not fix all of the issues with our current schema change system.

I'd argue that this is in scope: in order to perform schema changes, you need to know what changes to perform. I mean, we can simply use the old system for that, at least for now, if it's sufficiently generic and working for all database types. But it would have to be clear how the old and the new system relate to each other, whether they use the same underlying connection, etc.

I note the review so far seems to have ignored that part of schema changes is checking the existing schema to determine whether the schema needs changing (MW currently does a somewhat crappy job of this).

Good point but I think we should focus scope of this ticket to unify and abstract away schema changes and not fix all of the issues with our current schema change system.

Checking the existing schema is part of the existing system. Part of the focus of this ticket[1] is already to improve that to be easier for developers to do the right thing.

[1]: I should know, as the author of the RFC.

I'm sorry for the misunderstanding, as you wish. The POC is basically what I could get out of Doctring DBAL. If you think it's not good enough, we can go with the first proposal. I leave the decision to others in here.

I would like to understand the problem better. If DBAL can do the schema changes, why can't we use it for the schema changes, even if it doesn't help us with the other parts of the issue?

I would like to understand the problem better. If DBAL can do the schema changes, why can't we use it for the schema changes, even if it doesn't help us with the other parts of the issue?

Yes, That's my suggestion here, we can use DBAL as much as possible and write other parts we need.

Yes, That's my suggestion here, we can use DBAL as much as possible and write other parts we need.

@Anomie do you see a problem with that? Or do you think that writing the other parts will rendering using DBAL for the actual schema change redundant? It seems to me like we mostly have the other parts in place (doing selects, checking whether a table, field or index exists, etc). So why not just keep that as it is, and use DBAL instead of plain SQL code for the schema changes?

I'm worried that we'll wind up doing more work to address mismatches between DBAL and MediaWiki than we would to just directly implement code that works as part of Wikimedia\Rdbms without mismatches.

That said, though, I've been busy the past week and haven't had a chance to look at the code yet. I may find that my skepticism is unfounded. I hope to get a chance to review it this week.

It also sounds like using DBAL, if it works out, would be only one piece of the whole task here? So some amount of additional development would probably be needed in any case to achieve the goals here.

It also sounds like using DBAL, if it works out, would be only one piece of the whole task here? So some amount of additional development would probably be needed in any case to achieve the goals here.

Yes, seems like it. DBAL may be useful, but won't be sufficient.

I've said it before, but for work I'm really loving laravel migrations (Also DBAL, but with its own layers/subclasses around it). The class design can be found at Schema and Migrations if anyone wants some inspiration for our own solution. See Blueprint class for where they implement their own mwtimestamp stuff for instance. I also like how they defined Grammers.

However I will point out that when it comes to migrations, indeed especially wrt SQLite (i usually run unittests against sqlite), it's still not fully transparent. Some migrations are simply problematic (column type changes), though it is pretty easy to conditionally modify your migration to do connector specific things.

Lastly I'd like to highlight some of the tooling features of Laravel migrations [artisan migrate] that I like (esp for development, test and staging) and that we should maybe consider adding to update.php long term (maybe separate ticket?):

  1. Tracking of which migration has run (and when) in a migrations table
  2. There is a --pretend option to dump the SQL queries that will be run (bit like our --schema option)
  3. There is a --step option to allow you to do migrations 1 by 1, instead of bulk
  4. All migrations have a rollback function that you can implement allowing you to undo your change.
  5. Easily bootstrap/make a new migration class
  6. Takes into account maintenance mode

Now I think that running migrations on Wikimedia is a lot harder than what this still relatively simple tooling allows for, but in my opinion the pre-WMF-production workflow improvements in itself would be worth having these features.

Just my brain dump.

So I talked to @daniel yesterday and went through the POC together. This is my proposed solution based on Doctoring DBAL:

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)
  • We need to write some classes that turn this json data to doctrine DBAL "Schema" (for replacing tables.sql) or "SchemaDiff" (for replacing other sql files) abstract objects.
  • DatabaseUpdater and installers basically define a Doctorine DBAL platform (for example "SqlitePlatform") The platform is responsible to turning the abstract objects to sql commands.

Everything else stays the same including the way mediawiki checks for current schema to determine whether it would apply the patch or not. This narrows focus of the RFC a bit.

daniel added a comment.EditedMar 26 2019, 3:27 PM

On thing worth pointing out, which I learned talking to @Ladsgroup yesterday: DBAL turns an abstract schema or schema update (written in JSON) into SQL - nothing more. It doesn't need a database connection, it doesn't want to talk to any database at all, it just generates SQL statements. So it seems a good tool for specifying our schema and schema changes in a database agnostic way, without having to change a lot of scary and brittle code.

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

So, what is the proposed method of doing that?

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.
Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.
So, what is the proposed method of doing that?

That's a very valid concern, the thing is the part that turns json config to php abstract schema objects needs to be written in mediawiki so we have freedom to choose whatever we want. We can even go yaml (the reason I went with json was 1- php has a straightforward support for json 2- extension.json and i18n json files are pretty common in mediawiki while yaml is not used much). I'm open to ideas for this part. The way extension registry works with comments is also can be used here.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

I'm open to ideas for this part.

The current schema has no table or column comments. And the documentation on mediawiki.org (eg. http://www.mediawiki.org/wiki/Manual:Logging_table) is frequently out of date (like log_type/log_action). Luckily both the current system's problem and the problem you mention here have the same solution: put comments in the columns and tables themselves. There should be a place in the JSON for these, as they're part of standard create table statements. This should be the canonical documentation of the schema, and it would then be usable from any replicas and documentation generation scripts.

@Ladsgroup Did you check how DBAL holds up against the heap of nasty edge cases that @Anomie compiled? https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements

Also, there is the issue of long term ownership of the code. Once it's written, who maintains it? It seems to me like this would fall to the Core Platform Team. @Anomie, if @Ladsgroup implements this as proposed by him in T191231#5058412, would you be willing to do code review on that implementation, and maintain the code in the long term? If not, I may be able to take this on myself, btu I have less experience with the schema updater code and all the little things that can go wrong there.

@Ladsgroup Did you check how DBAL holds up against the heap of nasty edge cases that @Anomie compiled? https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements

This is a really long list ;)
As far as I checked (which might be incomplete). The DBAL either handles them or error out. For example, in case of identifier validity, DBAL has method for that which each platform (DBMS) implements. For example this is the identifier validator in Oracle which strangely doesn't have any size limit. For the rest of requirements, the check can be done in SchemaBuilder object that we supposed to write for example the check that the foreign key actually points to a primary or unique column in another table can be done in SchemaBuilder.

The nice thing we can now have with DBAL is that in our CI we can make it our abstract schema be tested against all DBMSes by trying to turn the abstract php objects to SQL commands in all of supported platforms (which now we can add even more DBMSes like SQLAnywhere, SQLAzure, etc. to the list) and check if they error out because DBAL can catch most of the edgecases as I exampled above.

Datatypes are mapped from Doctroine datatypes to the DBMS ones. An example. By searching "length" in the Sqlite platform, I find the sophisticated way to handle length.

In the discussion with @daniel and @Anomie it become apparent that it's better to drop support for Oracle and MSSQL before trying to abstract it and moving support to extensions and if someone wants to maintain the support, they should do it in the extension.

To recap. I suggest these two:

  • Drop official support for Oracle and MSSQL
  • Use Doctorine DBAL to handle SQL generation of schema and schem changes (practically turning .sql files to .json files)

If there's no objection in the next week, I will request a last call on this.

daniel added a comment.Jun 3 2019, 8:22 AM

To recap. I suggest these two:

  • Drop official support for Oracle and MSSQL
  • Use Doctorine DBAL to handle SQL generation of schema and schem changes (practically turning .sql files to .json files)

If there's no objection in the next week, I will request a last call on this.

For a meaningful last call, it would have to made clear what it would take to implement Oracle and MSSQL support in an extension. From what I understand, the problem is that the schemas we currently maintain in core for Oracle and MSSQL have diverged significantly from the MySQL schema in some places. What would an extension that provides support for Oracle (or MSSQL) have to do to handle these differences?

Knowing how hard it will be to move Oracle and MSSQL to extensions will allow us to assess the implications of dropping support from core: does it just mean that someone has to write a few hundred lines of compat code in an extension, or is the task much more complex, meaning that we'd be dropping support for these databases entirely?

If we end up dropping Oracle and MSSQL entirely, we'd be using DBAL mostly to convert from MySQL to PostGres. Is that worth the effort?

These seem to be the trade-offs we should be considering.

Thanks for the comment :)

For a meaningful last call, it would have to made clear what it would take to implement Oracle and MSSQL support in an extension. From what I understand, the problem is that the schemas we currently maintain in core for Oracle and MSSQL have diverged significantly from the MySQL schema in some places. What would an extension that provides support for Oracle (or MSSQL) have to do to handle these differences?

I would say there is no need or at least it's not worth the effort. The Oracle or MSSQL were not just diverged but actually broken for a very long time (and no one noticed).

This is my favorite example: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/480554
More than four full cycle releases or in other words two years, our schema for MSSQL added the right column to the wrong table. And no one noticed. It means MSSQL is not used and is not going to be used.

Knowing how hard it will be to move Oracle and MSSQL to extensions will allow us to assess the implications of dropping support from core: does it just mean that someone has to write a few hundred lines of compat code in an extension, or is the task much more complex, meaning that we'd be dropping support for these databases entirely?

It will be a few hundred lines of code but there is no guarantee to work, it depends on how properly Doctrine DBAL support those. If it's good enough. It should not be an issue.

If we end up dropping Oracle and MSSQL entirely, we'd be using DBAL mostly to convert from MySQL to PostGres. Is that worth the effort?

I would say it's definitely worth the effort for three reasons:

  • We still need to have Sqlite as well, doing schema changes on Sqlite is pretty complex (you need to make a temporary table and put things there and then rename the table to the main one).
  • We need to fix and remove this: https://github.com/wikimedia/mediawiki/blob/c477bcf2c5c482d3189ec3579c5dee444eb06f7d/includes/libs/rdbms/database/DatabaseSqlite.php#L898 This is a technical debt that's going to bite us.
  • Having plain string as SQL is quite dangerous, it opens the possibility to several types of errors like I mentioned above, there are so many other things that are also more subtle (like forgetting /*$wgDBprefix*/, /*_*/ /*i*/ or mixing them up) That all can be avoided by abstracting those away into the SchemaBuilder and SchemaDiffBuilder objects. Making mistakes in schemas can cause data corruption and should be handled with more care.

To recap. I suggest these two:

  • Drop official support for Oracle and MSSQL

a) It looks like colliding with https://phabricator.wikimedia.org/T113831
b) It looks not to be part of Proposal #1 neither #2, wich is what about discussion here.

I'm confused about. Cheers.

a) It looks like colliding with https://phabricator.wikimedia.org/T113831

Thanks for pointing out to this ticket. It gave me lots of information.

b) It looks not to be part of Proposal #1 neither #2, wich is what about discussion here.
I'm confused about. Cheers.

Yes, Sorry for not being more verbose.

https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements is the requirement if we want to abstract either with proposal #1 or #2, or #3 (proposal three is adding support using Doctring DBAL, practically a hybrid of proposal #1 and #2). And as you can see, both Oracle and MSSQL have a very weird requirements that we need to do gymnastics to support and some are not even supported in Doctrine DBAL (most importantly unique indexes which is not supported for oracle but it's supported for MSSQL). So I asked for dropping support for both.

Looking back and reading T113831: Remove MSSQL support from MediaWiki core and given that doctrine DBAL supports unique indexes requirment on MSSQL. I think it would be okay to just drop official support for oracle (it seems there's a universal support for dropping support for oracle) and leave MSSQL for now. I mean oracle won't be "officially" supported but the code can easily be moved to an extension and keep the support by people who want to maintain it. Another note: MSSQL has been broken for a while now (Look at my above comment)

Keep in mind abstracting makes it easier to add support or keep current support in extensions or core. Adding support will be only one hundred lines of code for most DBMSes but for some (like oracle) you need to patch and add support to Doctrine DBAL first before moving forward (All of this paragraph assumes this RFC gets accepted with proposal #3)

HTH

daniel added a comment.Jun 3 2019, 3:26 PM

a) It looks like colliding with https://phabricator.wikimedia.org/T113831

IIRC, one of the main reasons that was declined was that we didn't have a way to abstract schema migrations.

Another note: MSSQL has been broken for a while now (Look at my above comment)

As per the linked ticket, the primary maintainer of the MSSQL code appears to only target LTS releases, for a couple of reasons. Therefore this is not necessarily indicative of the maintenance status (though it's clearly a problem).

Use Doctorine DBAL to handle SQL generation of schema and schem changes (practically turning .sql files to .json files)

I raised some objections to using JSON as the format (see T191231#5063939) which have not yet been adequately addressed.

There was a suggestion of using table/column comments, but I'm not sure this really resolves the issue. I would hope and expect to see an average of about 5 lines worth of documentation comments per field, and probably an average of about 10 lines per table. I'm aware that the documentation is not currently at this level, but I would hope that improving that is something we all aspire to. Limiting it to the amount that can be represented in a simple string variable (which would obviously require proper escaping, too) is likely to make things worse, rather than better.

A data format that allows proper comments, of any length, please! YAML may be a good choice, or a custom JSON + comments format if YAML is too heavy.

Abstract schemas is an argument for keeping MSSQL, Oracle, and other lightly-supported DBMSes in Core, not splitting them out. Once we have an abstract schema, the supporting code to make use of it need only be written once, and that same schema can then apply everywhere.

There will likely be some bugs, but it will greatly reduce the maintenance burden as well as the burden on those submitting changes (they need only provide the one abstract patch instead of 5 different sql files).

In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing. Once that happens and is well-supported rather than an afterthought, I’d be happy to re-evaluate moving less-supported DBMS into extensions or whatever system is devised for that.

Abstract schemas is an argument for keeping MSSQL, Oracle, and other lightly-supported DBMSes in Core, not splitting them out.

You are completely right but that's true when there's an existing abstraction in place. With current system and specially in case of oracle, it's not possible to use Doctrine DBAL to build the abstraction or requires a great amount of developer work to write a DBAL from scratch. (See this)

There was a suggestion of using table/column comments, but I'm not sure this really resolves the issue. I would hope and expect to see an average of about 5 lines worth of documentation comments per field, and probably an average of about 10 lines per table. I'm aware that the documentation is not currently at this level, but I would hope that improving that is something we all aspire to. Limiting it to the amount that can be represented in a simple string variable (which would obviously require proper escaping, too) is likely to make things worse, rather than better.

I would argue5 lines worth of documentation comments per field, and probably an average of about 10 lines per table makes any sort of DDL to a documentation piece than a DDL, What is needed is updated documentation in mediawiki.org and clear links of those in the DDL.

HappyDog added a comment.EditedJun 8 2019, 9:31 PM

I would argue5 lines worth of documentation comments per field, and probably an average of about 10 lines per table makes any sort of DDL to a documentation piece than a DDL, What is needed is updated documentation in mediawiki.org and clear links of those in the DDL.

That's an argument against code commenting in general.

I strongly disagree with the suggestion that we should limit the amount of comments a developer is allowed to provide, either explicitly or implicitly (by choice of format). I can guarantee that any recommendation that developers should add their comments somewhere other than alongside the code they are writing will simply result in developers not writing comments at all.

To give some context to my argument, here is a table definition from one of my own projects. It is written in a custom abstract schema format, that works well for us (but is probably not sophisticated enough for MediaWiki's use, which is why I haven't suggested it for this thread), but regardless of the format, this is the level of commenting I would hope to see for any DB schema I'm expected to work with:

-- --------------------------------------------------------
-- Table structure for tblSurveyLinks
-- --------------------------------------------------------
-- Each project can have any number of external surveys attached to it.
-- Draft projects may have zero survey links defined, and there may even be a
-- use-case for a live project with no survey links (though this is not currently
-- possible via the interface).  Most projects will have one or more links.
-- The LinkTypeID column indicates the function of the survey within the
-- context of the project.
-- If the same survey is used in multiple projects, it will appear multiple
-- times in this list.
-- --------------------------------------------------------

TABLE tblSurveyLinks

	FIELD	SurveyLinkID            AUTONUMBER
	FIELD	ProjectID               FOREIGNKEY --> tblProjects

-- This is the Survey ID for the external survey, as provided by their API.
-- The format of the ID should be considered a 'black box' as it is generated by
-- the external system and the format may change at any time.
	FIELD	ExternalSurveyID        SHORTTEXT

-- This will contain one of the SURVEY_TYPE_* constants, which define whether this
-- is a baseline survey, the control arm or one of potentially multiple intervention
-- arms.  See the definitions in Constants.php for more details.
	FIELD	LinkTypeID              INTEGER

-- The SortOrder field contains the order that the surveys are listed within the
-- interface.  It is always populated, and numbering is sequential, starting from 1.
-- This ensures the order doesn't change between page views.
	FIELD	SortOrder               INTEGER

-- The RandomisationOrder is only applicable in situations where randomisation is
-- being used, and only for the surveys that are being randomised.
-- It is populated automatically when the survey is published, to ensure that there
-- is no connection between how these items are shown on the randomisation monitoring
-- page (which just displays "Survey A" and "Survey B") and what the user sees in
-- the editing interface.  This ensures the trial remains 'blind'.
-- The field contains an integer, with numbering starting at 1 and proceeding
-- sequentially, or it will be NULL if not relevant for this link.
	FIELD	RandomisationOrder      INTEGER

	PRIMARYKEY	SurveyLinkID

END TABLE

That's an argument against code commenting in general.
I strongly disagree with the suggestion that we should limit the amount of comments a developer is allowed to provide, either explicitly or implicitly (by choice of format). I can guarantee that any recommendation that developers should add their comments somewhere other than alongside the code they are writing will simply result in developers not writing comments at all.

Well, Developers should avoid writing comments as much as possible too. I know it seems counter-intuitive but there's a whole chapter of "Clean Code" by Uncle Bob dedicated to this. There are lots of reasons for that:

  • The code should be self-explanatory through variables/classes/methods names and structure (readability of the code is an important factor in maintainability of it) then comments become redundant and a cognitive load without giving out anything new.
  • When you have code and comments, they might get out of sync, they become misleading, they decay and now you need to main two things instead of one. (If you're maintaining documentation, like what we do for schema structures at mediawiki.org, you have three things to maintain, the DDL, the comments, the mediawiki.org documentation).

As I said, for more information, please read the book

This is getting somewhat off-topic, but:

The code should be self-explanatory through variables/classes/methods names and structure (readability of the code is an important factor in maintainability of it) then comments become redundant and a cognitive load without giving out anything new.

This is the common argument I hear against commenting code, but it is absolutely incorrect. The code may tell you what it does, but it doesn't tell you why it does it, or what assumptions were made when writing it. If this isn't captured in comments, then it only exists in the developers brain... and normally only for a limited period of time.

If you rephrase it as Developer's should avoid writing bad comments for the sake of it, then I'm 100% in agreement, but it's not an argument against commenting in general.

When you have code and comments, they might get out of sync, they become misleading,

When you rename variables, code might get out of sync., but you would never use that as an argument against refactoring. I read this reason as if developers don't do their job properly, we end up with bad code. That's correct, but not a helpful observation.

I will read the book, but from your examples I think I'm going to find more that I disagree with than I agree with.

HappyDog added a comment.EditedJun 9 2019, 11:25 AM

The code may tell you what it does, ...

Actually, responding to my own point, even that is not an argument against commenting the 'what'. The code will tell you what it does, but it won't tell you what it is supposed to do and therefore there is nothing to cross-check the code against. Is it working as intended, or not? Good unit-test coverage can mitigate this, providing the tests are being run and that they were written by someone other than the original developer (who can very easily write tests that encapsulate the same logic/assumption errors that are in the code), however there is still some benefit to having this cross-check in the codebase, where developers can see it.

For example:

function getDaysSinceCreation() {
    $age = $this->getSourceObject()->getAge();
    $age = $age * 2;
    return $age;
}

Contrast this with the following, which not only clearly shows where the error is, but also how to fix it:

function getDaysSinceCreation() {
    $age = $this->getSourceObject()->getAge();
    // Covert age in hours to age in days.
    $age = $age * 2;
    return $age;
}

That example is basic and bit contrived, but it demonstrates the point. You can't rely on the code being its own documentation.

kchapman added a comment.EditedFri, Jun 28, 4:51 AM

TechCom is hosting a meeting on this next Wednesday July 3rd at July 3 2019 14:00 UTC/15:00 CEST/06:00 PDT) (note new time)

Meeting will be rescheduled

This is one of things this RFC can look into and move away MSSQL and other propriety databases to extensions and leave it to people who wants to maintain it.

I've been working on making Percona available as a DB driver and doing it as an extension.

While this is a MySQL-based DB, there are some important considerations that mean just using the MySQL driver isn't possible. Some of the choices I've made are a bit hackish, but it shows that this can currently be done.

TechCom is hosting a meeting on this next Wednesday July 3rd at July 3 2019 14:00 UTC/15:00 CEST/06:00 PDT) (note new time)

14:00 UTC should translate to 16:00 CEST AFAIK.

The TechCom IRC meeting this week has been cancelled.

Thanks @Ladsgroup we are rescheduling this for another week so that @Anomie is available.

TechCom has rescheduled the IRC meeting in #wikimedia-office to July 17 2019 14:00 UTC/16:00 CEST/06:00 PDT

Lahwaacz removed a subscriber: Lahwaacz.Sat, Jul 6, 8:51 PM
saper added a comment.EditedSun, Jul 7, 9:14 PM

Excerpt from an announcement from June 2019 about Gitlab dropping MySQL support:

By providing support for both database backends (PostgreSQL and MySQL) we were unable to truly take advantage of either. Where we wanted to utilize specific performance and realiability capabilities unique to a backend, we had to instead choose the lowest common denominator. As an example (there are more), we wanted to use PostgreSQL's LATERAL JOIN for optimizing dashboards events, but couldn't because it was not available in MySQL.

I actually wish we've had more possibilities to diverge, for example to use native Internet Protocol address data types for ipb_address etc...
So abstractions would have been done up in the application layer, not just on some single storage layer.

saper added a comment.EditedMon, Jul 8, 8:45 AM

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.

Otherwise we will end up one day with everything being varbinary or blob.

In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing. Once that happens and is well-supported rather than an afterthought, I’d be happy to re-evaluate moving less-supported DBMS into extensions or whatever system is devised for that.

I was able to get support for Percona as a selection in the installer by adding one function: Installer::addDBType().

Comments and +2 on above welcome! :)

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.
Otherwise we will end up one day with everything being varbinary or blob.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

If the application level has a concrete need for a data type that cannot be met directly by one of the available types, then we may want to look into defining such higher level types. But as soon as the representation of such types changes the schema structurally, e.b. by representing geo-coordinates in one field on some DBs, and in two or more fields on other DBs, then we get dangerously close to writing an ORM system. With all the complexity and performance issue that this implies. I'd be very, very careful about going in that direction.

Also, moxing application level concepts into the DB abstraction layer (e.g. ARTICLE_TEXT) sounds like a bad idea. In this example in particular, since article content may not be text - it may in fact be JSON, and should be using a completely different type internally. And the knowledge what type to use, and how, needs to reside in the code that defines the content model.

So, in summary - it would probably be nice to support a JSON type, but not an ARTICLE_TEXT type. Things like EMAIL or GEO are tempting to split into multiple fields or even tables - but structural choices about the schema should be left to the application logic and should not be implicit in the DB abstraction. This is tempting because it looks nice and clean (see ORM), but tends to lead to scalability problems. The opportunity of optimization based on knowedge of the data type and the capabilities of the DB tend to be less than the opportunity to optimize on application specific use cases and access patterns. E.g. EMAIL could be split into domain and user, with the domains normalized, and a way to query all email addresses per domain. Whether this is and advantage or just gets in the way and causes problems depends on the application - so it should be a choice of the application.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

In that context, ARTICLE_TEXT is a sensible type, as it forces you to define what that looks like and how you need to interact with in a more targeted manner. A general JSON type wouldn't make a distinction between the quite different usage profiles you might need for ARTICLE_TEXT, CACHED_OBJECT, USER_SETTINGS, etc. even though they could all be represented as JSON (which, itself, could be represented as LONGTEXT or equivalent).

If this were for a general abstraction layer, I would build it so that the mappings are defined (or overridable at least) at the application layer, so whilst there might be a default representation of an EMAIL field as the most appropriate string type, the application could choose to use an alternative representation. Therefore I'm not even sure that this is a problem in the general case.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

I believe that stratification is a good thing. A lot of problems in the MediaWiki code base arise from the fact that we don't gave any distinction between storage layer, application logic, and presentation/interaction. This makes the software extremely hard to maintain.

I'd rather manage the representation of domain entities by defining a schemaon the level of a storage service, than as data types in the database layer.

If this were for a general abstraction layer, I would build it so that the mappings are defined (or overridable at least) at the application layer

That's exactly what a storage service does. The database abstraction layer shouldn't know about it. Keeping the DB abstraction oblivious of the domain model makes it much easier to implement and maintain. When introducing new concepts into core, having to maintain compatibility with DB abstractions that would now all need to support the new concept would be problematic.

Ottomata added a subscriber: Ottomata.EditedMon, Jul 15, 1:45 PM

Just came across this ticket after reading the TechCom radar email.

This sounds pretty awesome. I like the idea of JSON to specify the schemas. I wonder if this could be done using JSONSchema. If so, Analytics (and others) would be able to more easily integrate mediawiki schemas and event schemas. (We chose JSONSchema as part of an RFC last year.)

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

JSON is just a subset of YAML. We use YAML for event schemas, and put what would be table comments in the description field of the JSONSchema element, but still use code comments for developer readability.

In case useful: we are building tooling around auto versioning and dereferencing JSONSchemas. This would allow you to do things like declare a common field or subschema in a separate definitions file, and reuse it in the real schema documents. We're also planning to build CI for ensuring that the schemas conform to any conventions we choose.