Page MenuHomePhabricator

RFC: Abstract schemas and schema changes
Open, NormalPublic

Description

At some point this should be a TechCom-RFC, but at the moment it's still in the drafting stage.

Problem

MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle Database. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.

We can improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements.

Approved solution

Create a PHP interface (or base class) for schemas and schema changes. We implement this in top of Doctrine DBAL. Schemas and schema changes will be defined in JSON files conforming to https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema, which will be read into Schema or SchemaDiff objects. DBAL will then take care of generating SQL for any supported RDBMS.

In addition, database support (in the form of a subclass of the Database base class) should be made pluggable. Extensions that want to provide support for a database backend would provide a Database subclass as well as a suitable implementation of the schema and schema change interface. This would be trivial for any database that is supported by DBAL.

Notes:

  • This means we drop support for MSSQL and Oracle RDBMS from MediaWiki core, since DBAL support for them is insufficient and/or the schema for these databases has diverged from the main line schema. WMF will not continue support for these database backends. Volunteers have shown interest in bringing back support for these backends in form of extensions.
  • For schema definitions, we go with JSON for now. But the we may want to switch to YAML later, for easier editing. JSON and YAML can easily be converted into one another.
  • If someone wants to introduce a schema change, there should be a new deployable file which can contain several schema changes. Existing schema change (json) files should not be changed to perform additional changes.

Old proposals

Proposal #1

We should write a schema and schema change abstraction layer to integrate with MediaWiki's existing runtime database abstraction. Details are on-wiki at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements, but in short:

  • We would have one schema, expressed as a structure in a JSON file. We would have one definition of each schema change, expressed as a structure in a JSON file.
  • Database-specific classes would exist to turn the schema or schema-change into SQL statements, much as we have database-specific subclasses of Wikimedia\Rdbms\Database.
  • We'd also tighten up some of the other database-level things: limited identifier lengths, index name uniqueness, data type consistency, charset consistency, etc.

The reason we didn't go with this:

  • It's lots of work to write a schema and schema change abstraction from scratch.

Proposal #2

Try to integrate Doctrine Migrations for schema creation and updates.

Pros (compared to Proposal #1):

  • We wouldn't have to implement all the database-specific logic ourself.
  • Probably a larger community fixing any bugs that exist.
  • Familiar system for (some subset of) PHP developers, simplifying onboarding of devs.

The reasons we didn't go with this:

  • We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.
  • We may have to custom-implement a "mwtimestamp" type, or else standardize all DBs on using 14-byte strings.
  • We may still have to work around issues like MSSQL's different treatment of NULLs in unique indexes.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I've said it before, but for work I'm really loving laravel migrations (Also DBAL, but with its own layers/subclasses around it). The class design can be found at Schema and Migrations if anyone wants some inspiration for our own solution. See Blueprint class for where they implement their own mwtimestamp stuff for instance. I also like how they defined Grammers.

However I will point out that when it comes to migrations, indeed especially wrt SQLite (i usually run unittests against sqlite), it's still not fully transparent. Some migrations are simply problematic (column type changes), though it is pretty easy to conditionally modify your migration to do connector specific things.

Lastly I'd like to highlight some of the tooling features of Laravel migrations [artisan migrate] that I like (esp for development, test and staging) and that we should maybe consider adding to update.php long term (maybe separate ticket?):

  1. Tracking of which migration has run (and when) in a migrations table
  2. There is a --pretend option to dump the SQL queries that will be run (bit like our --schema option)
  3. There is a --step option to allow you to do migrations 1 by 1, instead of bulk
  4. All migrations have a rollback function that you can implement allowing you to undo your change.
  5. Easily bootstrap/make a new migration class
  6. Takes into account maintenance mode

Now I think that running migrations on Wikimedia is a lot harder than what this still relatively simple tooling allows for, but in my opinion the pre-WMF-production workflow improvements in itself would be worth having these features.

Just my brain dump.

So I talked to @daniel yesterday and went through the POC together. This is my proposed solution based on Doctoring DBAL:

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)
  • We need to write some classes that turn this json data to doctrine DBAL "Schema" (for replacing tables.sql) or "SchemaDiff" (for replacing other sql files) abstract objects.
  • DatabaseUpdater and installers basically define a Doctorine DBAL platform (for example "SqlitePlatform") The platform is responsible to turning the abstract objects to sql commands.

Everything else stays the same including the way mediawiki checks for current schema to determine whether it would apply the patch or not. This narrows focus of the RFC a bit.

daniel added a comment.EditedMar 26 2019, 3:27 PM

On thing worth pointing out, which I learned talking to @Ladsgroup yesterday: DBAL turns an abstract schema or schema update (written in JSON) into SQL - nothing more. It doesn't need a database connection, it doesn't want to talk to any database at all, it just generates SQL statements. So it seems a good tool for specifying our schema and schema changes in a database agnostic way, without having to change a lot of scary and brittle code.

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

So, what is the proposed method of doing that?

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.
Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.
So, what is the proposed method of doing that?

That's a very valid concern, the thing is the part that turns json config to php abstract schema objects needs to be written in mediawiki so we have freedom to choose whatever we want. We can even go yaml (the reason I went with json was 1- php has a straightforward support for json 2- extension.json and i18n json files are pretty common in mediawiki while yaml is not used much). I'm open to ideas for this part. The way extension registry works with comments is also can be used here.

Over 50% of the lines in tables.sql are comments, and these comments are crucial documentation in understanding the schema and the basic functioning of MediaWiki. It is therefore vital that any solution preserves this inline documentation.

I'm open to ideas for this part.

The current schema has no table or column comments. And the documentation on mediawiki.org (eg. http://www.mediawiki.org/wiki/Manual:Logging_table) is frequently out of date (like log_type/log_action). Luckily both the current system's problem and the problem you mention here have the same solution: put comments in the columns and tables themselves. There should be a place in the JSON for these, as they're part of standard create table statements. This should be the canonical documentation of the schema, and it would then be usable from any replicas and documentation generation scripts.

@Ladsgroup Did you check how DBAL holds up against the heap of nasty edge cases that @Anomie compiled? https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements

Also, there is the issue of long term ownership of the code. Once it's written, who maintains it? It seems to me like this would fall to the Core Platform Team. @Anomie, if @Ladsgroup implements this as proposed by him in T191231#5058412, would you be willing to do code review on that implementation, and maintain the code in the long term? If not, I may be able to take this on myself, btu I have less experience with the schema updater code and all the little things that can go wrong there.

@Ladsgroup Did you check how DBAL holds up against the heap of nasty edge cases that @Anomie compiled? https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements

This is a really long list ;)
As far as I checked (which might be incomplete). The DBAL either handles them or error out. For example, in case of identifier validity, DBAL has method for that which each platform (DBMS) implements. For example this is the identifier validator in Oracle which strangely doesn't have any size limit. For the rest of requirements, the check can be done in SchemaBuilder object that we supposed to write for example the check that the foreign key actually points to a primary or unique column in another table can be done in SchemaBuilder.

The nice thing we can now have with DBAL is that in our CI we can make it our abstract schema be tested against all DBMSes by trying to turn the abstract php objects to SQL commands in all of supported platforms (which now we can add even more DBMSes like SQLAnywhere, SQLAzure, etc. to the list) and check if they error out because DBAL can catch most of the edgecases as I exampled above.

Datatypes are mapped from Doctroine datatypes to the DBMS ones. An example. By searching "length" in the Sqlite platform, I find the sophisticated way to handle length.

In the discussion with @daniel and @Anomie it become apparent that it's better to drop support for Oracle and MSSQL before trying to abstract it and moving support to extensions and if someone wants to maintain the support, they should do it in the extension.

To recap. I suggest these two:

  • Drop official support for Oracle and MSSQL
  • Use Doctorine DBAL to handle SQL generation of schema and schem changes (practically turning .sql files to .json files)

If there's no objection in the next week, I will request a last call on this.

daniel added a comment.Jun 3 2019, 8:22 AM

To recap. I suggest these two:

  • Drop official support for Oracle and MSSQL
  • Use Doctorine DBAL to handle SQL generation of schema and schem changes (practically turning .sql files to .json files)

If there's no objection in the next week, I will request a last call on this.

For a meaningful last call, it would have to made clear what it would take to implement Oracle and MSSQL support in an extension. From what I understand, the problem is that the schemas we currently maintain in core for Oracle and MSSQL have diverged significantly from the MySQL schema in some places. What would an extension that provides support for Oracle (or MSSQL) have to do to handle these differences?

Knowing how hard it will be to move Oracle and MSSQL to extensions will allow us to assess the implications of dropping support from core: does it just mean that someone has to write a few hundred lines of compat code in an extension, or is the task much more complex, meaning that we'd be dropping support for these databases entirely?

If we end up dropping Oracle and MSSQL entirely, we'd be using DBAL mostly to convert from MySQL to PostGres. Is that worth the effort?

These seem to be the trade-offs we should be considering.

Thanks for the comment :)

For a meaningful last call, it would have to made clear what it would take to implement Oracle and MSSQL support in an extension. From what I understand, the problem is that the schemas we currently maintain in core for Oracle and MSSQL have diverged significantly from the MySQL schema in some places. What would an extension that provides support for Oracle (or MSSQL) have to do to handle these differences?

I would say there is no need or at least it's not worth the effort. The Oracle or MSSQL were not just diverged but actually broken for a very long time (and no one noticed).

This is my favorite example: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/480554
More than four full cycle releases or in other words two years, our schema for MSSQL added the right column to the wrong table. And no one noticed. It means MSSQL is not used and is not going to be used.

Knowing how hard it will be to move Oracle and MSSQL to extensions will allow us to assess the implications of dropping support from core: does it just mean that someone has to write a few hundred lines of compat code in an extension, or is the task much more complex, meaning that we'd be dropping support for these databases entirely?

It will be a few hundred lines of code but there is no guarantee to work, it depends on how properly Doctrine DBAL support those. If it's good enough. It should not be an issue.

If we end up dropping Oracle and MSSQL entirely, we'd be using DBAL mostly to convert from MySQL to PostGres. Is that worth the effort?

I would say it's definitely worth the effort for three reasons:

  • We still need to have Sqlite as well, doing schema changes on Sqlite is pretty complex (you need to make a temporary table and put things there and then rename the table to the main one).
  • We need to fix and remove this: https://github.com/wikimedia/mediawiki/blob/c477bcf2c5c482d3189ec3579c5dee444eb06f7d/includes/libs/rdbms/database/DatabaseSqlite.php#L898 This is a technical debt that's going to bite us.
  • Having plain string as SQL is quite dangerous, it opens the possibility to several types of errors like I mentioned above, there are so many other things that are also more subtle (like forgetting /*$wgDBprefix*/, /*_*/ /*i*/ or mixing them up) That all can be avoided by abstracting those away into the SchemaBuilder and SchemaDiffBuilder objects. Making mistakes in schemas can cause data corruption and should be handled with more care.

To recap. I suggest these two:

  • Drop official support for Oracle and MSSQL

a) It looks like colliding with https://phabricator.wikimedia.org/T113831
b) It looks not to be part of Proposal #1 neither #2, wich is what about discussion here.

I'm confused about. Cheers.

a) It looks like colliding with https://phabricator.wikimedia.org/T113831

Thanks for pointing out to this ticket. It gave me lots of information.

b) It looks not to be part of Proposal #1 neither #2, wich is what about discussion here.
I'm confused about. Cheers.

Yes, Sorry for not being more verbose.

https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements is the requirement if we want to abstract either with proposal #1 or #2, or #3 (proposal three is adding support using Doctring DBAL, practically a hybrid of proposal #1 and #2). And as you can see, both Oracle and MSSQL have a very weird requirements that we need to do gymnastics to support and some are not even supported in Doctrine DBAL (most importantly unique indexes which is not supported for oracle but it's supported for MSSQL). So I asked for dropping support for both.

Looking back and reading T113831: Remove MSSQL support from MediaWiki core and given that doctrine DBAL supports unique indexes requirment on MSSQL. I think it would be okay to just drop official support for oracle (it seems there's a universal support for dropping support for oracle) and leave MSSQL for now. I mean oracle won't be "officially" supported but the code can easily be moved to an extension and keep the support by people who want to maintain it. Another note: MSSQL has been broken for a while now (Look at my above comment)

Keep in mind abstracting makes it easier to add support or keep current support in extensions or core. Adding support will be only one hundred lines of code for most DBMSes but for some (like oracle) you need to patch and add support to Doctrine DBAL first before moving forward (All of this paragraph assumes this RFC gets accepted with proposal #3)

HTH

daniel added a comment.Jun 3 2019, 3:26 PM

a) It looks like colliding with https://phabricator.wikimedia.org/T113831

IIRC, one of the main reasons that was declined was that we didn't have a way to abstract schema migrations.

Another note: MSSQL has been broken for a while now (Look at my above comment)

As per the linked ticket, the primary maintainer of the MSSQL code appears to only target LTS releases, for a couple of reasons. Therefore this is not necessarily indicative of the maintenance status (though it's clearly a problem).

Use Doctorine DBAL to handle SQL generation of schema and schem changes (practically turning .sql files to .json files)

I raised some objections to using JSON as the format (see T191231#5063939) which have not yet been adequately addressed.

There was a suggestion of using table/column comments, but I'm not sure this really resolves the issue. I would hope and expect to see an average of about 5 lines worth of documentation comments per field, and probably an average of about 10 lines per table. I'm aware that the documentation is not currently at this level, but I would hope that improving that is something we all aspire to. Limiting it to the amount that can be represented in a simple string variable (which would obviously require proper escaping, too) is likely to make things worse, rather than better.

A data format that allows proper comments, of any length, please! YAML may be a good choice, or a custom JSON + comments format if YAML is too heavy.

Abstract schemas is an argument for keeping MSSQL, Oracle, and other lightly-supported DBMSes in Core, not splitting them out. Once we have an abstract schema, the supporting code to make use of it need only be written once, and that same schema can then apply everywhere.

There will likely be some bugs, but it will greatly reduce the maintenance burden as well as the burden on those submitting changes (they need only provide the one abstract patch instead of 5 different sql files).

In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing. Once that happens and is well-supported rather than an afterthought, I’d be happy to re-evaluate moving less-supported DBMS into extensions or whatever system is devised for that.

Abstract schemas is an argument for keeping MSSQL, Oracle, and other lightly-supported DBMSes in Core, not splitting them out.

You are completely right but that's true when there's an existing abstraction in place. With current system and specially in case of oracle, it's not possible to use Doctrine DBAL to build the abstraction or requires a great amount of developer work to write a DBAL from scratch. (See this)

There was a suggestion of using table/column comments, but I'm not sure this really resolves the issue. I would hope and expect to see an average of about 5 lines worth of documentation comments per field, and probably an average of about 10 lines per table. I'm aware that the documentation is not currently at this level, but I would hope that improving that is something we all aspire to. Limiting it to the amount that can be represented in a simple string variable (which would obviously require proper escaping, too) is likely to make things worse, rather than better.

I would argue5 lines worth of documentation comments per field, and probably an average of about 10 lines per table makes any sort of DDL to a documentation piece than a DDL, What is needed is updated documentation in mediawiki.org and clear links of those in the DDL.

HappyDog added a comment.EditedJun 8 2019, 9:31 PM

I would argue5 lines worth of documentation comments per field, and probably an average of about 10 lines per table makes any sort of DDL to a documentation piece than a DDL, What is needed is updated documentation in mediawiki.org and clear links of those in the DDL.

That's an argument against code commenting in general.

I strongly disagree with the suggestion that we should limit the amount of comments a developer is allowed to provide, either explicitly or implicitly (by choice of format). I can guarantee that any recommendation that developers should add their comments somewhere other than alongside the code they are writing will simply result in developers not writing comments at all.

To give some context to my argument, here is a table definition from one of my own projects. It is written in a custom abstract schema format, that works well for us (but is probably not sophisticated enough for MediaWiki's use, which is why I haven't suggested it for this thread), but regardless of the format, this is the level of commenting I would hope to see for any DB schema I'm expected to work with:

-- --------------------------------------------------------
-- Table structure for tblSurveyLinks
-- --------------------------------------------------------
-- Each project can have any number of external surveys attached to it.
-- Draft projects may have zero survey links defined, and there may even be a
-- use-case for a live project with no survey links (though this is not currently
-- possible via the interface).  Most projects will have one or more links.
-- The LinkTypeID column indicates the function of the survey within the
-- context of the project.
-- If the same survey is used in multiple projects, it will appear multiple
-- times in this list.
-- --------------------------------------------------------

TABLE tblSurveyLinks

	FIELD	SurveyLinkID            AUTONUMBER
	FIELD	ProjectID               FOREIGNKEY --> tblProjects

-- This is the Survey ID for the external survey, as provided by their API.
-- The format of the ID should be considered a 'black box' as it is generated by
-- the external system and the format may change at any time.
	FIELD	ExternalSurveyID        SHORTTEXT

-- This will contain one of the SURVEY_TYPE_* constants, which define whether this
-- is a baseline survey, the control arm or one of potentially multiple intervention
-- arms.  See the definitions in Constants.php for more details.
	FIELD	LinkTypeID              INTEGER

-- The SortOrder field contains the order that the surveys are listed within the
-- interface.  It is always populated, and numbering is sequential, starting from 1.
-- This ensures the order doesn't change between page views.
	FIELD	SortOrder               INTEGER

-- The RandomisationOrder is only applicable in situations where randomisation is
-- being used, and only for the surveys that are being randomised.
-- It is populated automatically when the survey is published, to ensure that there
-- is no connection between how these items are shown on the randomisation monitoring
-- page (which just displays "Survey A" and "Survey B") and what the user sees in
-- the editing interface.  This ensures the trial remains 'blind'.
-- The field contains an integer, with numbering starting at 1 and proceeding
-- sequentially, or it will be NULL if not relevant for this link.
	FIELD	RandomisationOrder      INTEGER

	PRIMARYKEY	SurveyLinkID

END TABLE

That's an argument against code commenting in general.
I strongly disagree with the suggestion that we should limit the amount of comments a developer is allowed to provide, either explicitly or implicitly (by choice of format). I can guarantee that any recommendation that developers should add their comments somewhere other than alongside the code they are writing will simply result in developers not writing comments at all.

Well, Developers should avoid writing comments as much as possible too. I know it seems counter-intuitive but there's a whole chapter of "Clean Code" by Uncle Bob dedicated to this. There are lots of reasons for that:

  • The code should be self-explanatory through variables/classes/methods names and structure (readability of the code is an important factor in maintainability of it) then comments become redundant and a cognitive load without giving out anything new.
  • When you have code and comments, they might get out of sync, they become misleading, they decay and now you need to main two things instead of one. (If you're maintaining documentation, like what we do for schema structures at mediawiki.org, you have three things to maintain, the DDL, the comments, the mediawiki.org documentation).

As I said, for more information, please read the book

This is getting somewhat off-topic, but:

The code should be self-explanatory through variables/classes/methods names and structure (readability of the code is an important factor in maintainability of it) then comments become redundant and a cognitive load without giving out anything new.

This is the common argument I hear against commenting code, but it is absolutely incorrect. The code may tell you what it does, but it doesn't tell you why it does it, or what assumptions were made when writing it. If this isn't captured in comments, then it only exists in the developers brain... and normally only for a limited period of time.

If you rephrase it as Developer's should avoid writing bad comments for the sake of it, then I'm 100% in agreement, but it's not an argument against commenting in general.

When you have code and comments, they might get out of sync, they become misleading,

When you rename variables, code might get out of sync., but you would never use that as an argument against refactoring. I read this reason as if developers don't do their job properly, we end up with bad code. That's correct, but not a helpful observation.

I will read the book, but from your examples I think I'm going to find more that I disagree with than I agree with.

HappyDog added a comment.EditedJun 9 2019, 11:25 AM

The code may tell you what it does, ...

Actually, responding to my own point, even that is not an argument against commenting the 'what'. The code will tell you what it does, but it won't tell you what it is supposed to do and therefore there is nothing to cross-check the code against. Is it working as intended, or not? Good unit-test coverage can mitigate this, providing the tests are being run and that they were written by someone other than the original developer (who can very easily write tests that encapsulate the same logic/assumption errors that are in the code), however there is still some benefit to having this cross-check in the codebase, where developers can see it.

For example:

function getDaysSinceCreation() {
    $age = $this->getSourceObject()->getAge();
    $age = $age * 2;
    return $age;
}

Contrast this with the following, which not only clearly shows where the error is, but also how to fix it:

function getDaysSinceCreation() {
    $age = $this->getSourceObject()->getAge();
    // Covert age in hours to age in days.
    $age = $age * 2;
    return $age;
}

That example is basic and bit contrived, but it demonstrates the point. You can't rely on the code being its own documentation.

kchapman added a comment.EditedJun 28 2019, 4:51 AM

TechCom is hosting a meeting on this next Wednesday July 3rd at July 3 2019 14:00 UTC/15:00 CEST/06:00 PDT) (note new time)

Meeting will be rescheduled

This is one of things this RFC can look into and move away MSSQL and other propriety databases to extensions and leave it to people who wants to maintain it.

I've been working on making Percona available as a DB driver and doing it as an extension.

While this is a MySQL-based DB, there are some important considerations that mean just using the MySQL driver isn't possible. Some of the choices I've made are a bit hackish, but it shows that this can currently be done.

TechCom is hosting a meeting on this next Wednesday July 3rd at July 3 2019 14:00 UTC/15:00 CEST/06:00 PDT) (note new time)

14:00 UTC should translate to 16:00 CEST AFAIK.

The TechCom IRC meeting this week has been cancelled.

Thanks @Ladsgroup we are rescheduling this for another week so that @Anomie is available.

TechCom has rescheduled the IRC meeting in #wikimedia-office to July 17 2019 14:00 UTC/16:00 CEST/06:00 PDT

Lahwaacz removed a subscriber: Lahwaacz.Jul 6 2019, 8:51 PM
saper added a comment.EditedJul 7 2019, 9:14 PM

Excerpt from an announcement from June 2019 about Gitlab dropping MySQL support:

By providing support for both database backends (PostgreSQL and MySQL) we were unable to truly take advantage of either. Where we wanted to utilize specific performance and realiability capabilities unique to a backend, we had to instead choose the lowest common denominator. As an example (there are more), we wanted to use PostgreSQL's LATERAL JOIN for optimizing dashboards events, but couldn't because it was not available in MySQL.

I actually wish we've had more possibilities to diverge, for example to use native Internet Protocol address data types for ipb_address etc...
So abstractions would have been done up in the application layer, not just on some single storage layer.

saper added a comment.EditedJul 8 2019, 8:45 AM

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.

Otherwise we will end up one day with everything being varbinary or blob.

In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing. Once that happens and is well-supported rather than an afterthought, I’d be happy to re-evaluate moving less-supported DBMS into extensions or whatever system is devised for that.

I was able to get support for Percona as a selection in the installer by adding one function: Installer::addDBType().

Comments and +2 on above welcome! :)

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.
Otherwise we will end up one day with everything being varbinary or blob.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

If the application level has a concrete need for a data type that cannot be met directly by one of the available types, then we may want to look into defining such higher level types. But as soon as the representation of such types changes the schema structurally, e.b. by representing geo-coordinates in one field on some DBs, and in two or more fields on other DBs, then we get dangerously close to writing an ORM system. With all the complexity and performance issue that this implies. I'd be very, very careful about going in that direction.

Also, moxing application level concepts into the DB abstraction layer (e.g. ARTICLE_TEXT) sounds like a bad idea. In this example in particular, since article content may not be text - it may in fact be JSON, and should be using a completely different type internally. And the knowledge what type to use, and how, needs to reside in the code that defines the content model.

So, in summary - it would probably be nice to support a JSON type, but not an ARTICLE_TEXT type. Things like EMAIL or GEO are tempting to split into multiple fields or even tables - but structural choices about the schema should be left to the application logic and should not be implicit in the DB abstraction. This is tempting because it looks nice and clean (see ORM), but tends to lead to scalability problems. The opportunity of optimization based on knowedge of the data type and the capabilities of the DB tend to be less than the opportunity to optimize on application specific use cases and access patterns. E.g. EMAIL could be split into domain and user, with the domains normalized, and a way to query all email addresses per domain. Whether this is and advantage or just gets in the way and causes problems depends on the application - so it should be a choice of the application.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

In that context, ARTICLE_TEXT is a sensible type, as it forces you to define what that looks like and how you need to interact with in a more targeted manner. A general JSON type wouldn't make a distinction between the quite different usage profiles you might need for ARTICLE_TEXT, CACHED_OBJECT, USER_SETTINGS, etc. even though they could all be represented as JSON (which, itself, could be represented as LONGTEXT or equivalent).

If this were for a general abstraction layer, I would build it so that the mappings are defined (or overridable at least) at the application layer, so whilst there might be a default representation of an EMAIL field as the most appropriate string type, the application could choose to use an alternative representation. Therefore I'm not even sure that this is a problem in the general case.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

I believe that stratification is a good thing. A lot of problems in the MediaWiki code base arise from the fact that we don't gave any distinction between storage layer, application logic, and presentation/interaction. This makes the software extremely hard to maintain.

I'd rather manage the representation of domain entities by defining a schemaon the level of a storage service, than as data types in the database layer.

If this were for a general abstraction layer, I would build it so that the mappings are defined (or overridable at least) at the application layer

That's exactly what a storage service does. The database abstraction layer shouldn't know about it. Keeping the DB abstraction oblivious of the domain model makes it much easier to implement and maintain. When introducing new concepts into core, having to maintain compatibility with DB abstractions that would now all need to support the new concept would be problematic.

Ottomata added a subscriber: Ottomata.EditedJul 15 2019, 1:45 PM

Just came across this ticket after reading the TechCom radar email.

This sounds pretty awesome. I like the idea of JSON to specify the schemas. I wonder if this could be done using JSONSchema. If so, Analytics (and others) would be able to more easily integrate mediawiki schemas and event schemas. (We chose JSONSchema as part of an RFC last year.)

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

JSON is just a subset of YAML. We use YAML for event schemas, and put what would be table comments in the description field of the JSONSchema element, but still use code comments for developer readability.

In case useful: we are building tooling around auto versioning and dereferencing JSONSchemas. This would allow you to do things like declare a common field or subschema in a separate definitions file, and reuse it in the real schema documents. We're also planning to build CI for ensuring that the schemas conform to any conventions we choose.

  1. Tracking of which migration has run (and when) in a migrations table

In general, the model MediaWiki has followed in the past is to detect whether a particular change is needed (e.g. column is missing => add it) rather than trying to track whether things have been run. But we do use the "has been run" model for many maintenance scripts. We've run into issues both ways.

  1. There is a --step option to allow you to do migrations 1 by 1, instead of bulk

I'm not sure just how much benefit there would be in this one, particularly since the changes aren't likely to be strictly ordered when it comes to extensions.

  1. All migrations have a rollback function that you can implement allowing you to undo your change.

This likely means writing each change twice (forward and backward),[1] and generally only using one of them. Personally I'd not bother.

[1]: There may be some really simple cases where that could be done automatically, but usually you'd want to repopulate data in one of the two directions.

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)

I don't think this will be sufficient. This sounds like only the "updates" key from what I listed at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema#Schema_change_format when drafting this RFC.

Here's a use case to consider: Because making schema changes in the past has historically been such a pain, gerrit:357892 added just one patch file for MySQL, which has since caused problems like T227662. But not for PostgreSQL, since the syntax of that file makes per-field updates much easier to write. Ideally we'll have that property for all changes in the future, and one of the ways I proposed doing that was to define the format such that we can easily bundle multiple check+update sets into one "patch" file.

Everything else stays the same including the way mediawiki checks for current schema to determine whether it would apply the patch or not. This narrows focus of the RFC a bit.

It would narrow the scope to the point where the RFC doesn't actually accomplish the things that it is intended to accomplish, unfortunately. We don't want to have to write the logic to check whether an update is needed multiple times just as we don't want to write each updates' SQL multiple times.

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

We could easily enough use MediaWiki's FormatJson::stripComments() or something very like it to work around that problem.

The current schema has no table or column comments. [...] Luckily both the current system's problem and the problem you mention here have the same solution: put comments in the columns and tables themselves. There should be a place in the JSON for these, as they're part of standard create table statements.

I note that https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema#Schema_format does include that. I'm not entirely sure it's a good idea, though, since it would mean that updates to the documentation would also require schema changes.

The nice thing we can now have with DBAL is that in our CI we can make it our abstract schema be tested against all DBMSes by trying to turn the abstract php objects to SQL commands in all of supported platforms (which now we can add even more DBMSes like SQLAnywhere, SQLAzure, etc. to the list) and check if they error out because DBAL can catch most of the edgecases as I exampled above.

Note that we want to enforce limits in the abstract, not just for the specific database. Personally I'd prefer that limits are defined and tested for explicitly rather than being a union of requirements applied by whichever set of DBAL backends happen to be being used.

Speaking of CI, I'd also like to have a way to load a schema file from an old version of MediaWiki, apply the updates to it, and then extract the schema and test whether it's equivalent to the data in the current schema file. And ideally do that entirely on the abstract level, rather than translating it into some DB's SQL and back.

I would hope and expect to see an average of about 5 lines worth of documentation comments per field, and probably an average of about 10 lines per table. I'm aware that the documentation is not currently at this level, but I would hope that improving that is something we all aspire to.

That seems excessive to me. While in some cases a field may need 5 lines of documentation, most likely need much less. Same for tables.

A data format that allows proper comments, of any length, please! YAML may be a good choice, or a custom JSON + comments format if YAML is too heavy.

Parsing JSON in PHP is a fairly standard thing to do, while PHP's YAML extension is not bundled with PHP by default.

I actually wish we've had more possibilities to diverge, for example to use native Internet Protocol address data types for ipb_address etc...
So abstractions would have been done up in the application layer, not just on some single storage layer.

In general, you can either support multiple databases or you can require just one. MediaWiki has decided to support multiple, and changing that to support only MySQL is outside the scope of this RFC.

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.

The problem in T203850 is that the current MySQL schema doesn't differentiate between storage of UTF-8 text and storage of binary data, while others do and therefore cause problems when you try to shove binary data into a Unicode text field.

While some JSON abstraction at the IDatabase level probably would have avoided that specific issue, it doesn't touch the underlying problem and provides no other real benefit.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

I disagree with both of your alternatives. We should define types with useful properties while leaving it up to the database abstraction layer to choose a specific type that satisfies our requirements. The FoobarDB implementation might have to use something other than "INT" to represent our integer type, and that's ok. But on the other hand, there's likely no benefit at the database abstraction level to trying to separate "email" and other short strings, or "article text" and other long strings. Even if some database has an "email" type (or an "ip" type), it seems unlikely that we'd be able to really make use of any added features as long as we do want to support more than just one database.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

Note that MediaWiki's database abstraction code is in includes/libs/, which implies that we hope to someday publish it as a separate library that other projects can use as well.

I wonder if this could be done using JSONSchema.

Not very likely. JSONSchema is about defining the structure of a JSON-based data structure. Here we're talking about an SQL DDL, and only storing the data structure as JSON.

We might create a JSONSchema to define the structure of our JSON files that in turn define the SQL DDL, but that transitive relationship seems unlikely to be useful for the use cases you're thinking of.

In a private email, @Ladsgroup wrote:

Also it's possible to make the schema change abstraction pluggable so in core we use Doctrine DBAL and in Oracle extension, they need to write something like that from scratch.

I like that idea. If we wrap Doctrine DBAL in our own interface so the implementation could be easily changed if we find shortcomings at some point, I'd have much less concern about it. That makes the question of whether we use DBAL or not somewhat irrelevant since MediaWiki won't directly depend on it. We can use it for the initial implementation, and if it turns out to be too much trouble to work around we can change the implementation without having to redo a bunch of the rest of the work.

I wonder if this could be done using JSONSchema.

Not very likely. JSONSchema is about defining the structure of a JSON-based data structure. Here we're talking about an SQL DDL, and only storing the data structure as JSON.

Yeah, I can't quite imagine how an ALTER part of a migration would look in JSONSchema. Maybe what I'm envisioning isn't so much about migrations, but about abstract table schemas. We use JSONSchema to create (and auto migrate) Hive tables. Right now we do this with some custom Spark schema glue, but might one day do a similar thing with Kafka Connect, which itself integrates with many source and sink databases.

Having Mediawiki tables represented as JSONSchemas could potentially ease integration of Mediawiki data into OLAP (and other) systems. Everyone would love if we could convert the regular XML dumps into JSON. If there were JSONSchemas describing MW tables, the records in a JSON MW dump would conform to them.

On the other hand, supporting abstract migrations between RDBMS's is hard enough; adding constraints OLAP stuff in there is crazy :p Just some food for thought I guess, carry on! :)

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - […]
Otherwise we will end up one day with everything being varbinary or blob.

For cases where the application has no shared use case at all other than to store and retrieve binary, that seems appropriate. Specifying it differently than that costs only maintenance time to specify, maintain and migrate as its coincidental format changes over time, with the run-time overhead of constraining it - at no benefit.

No rule is perfect for everything of course, and that includes this one.

Having Mediawiki tables represented as JSONSchemas could potentially ease integration of Mediawiki data into OLAP (and other) systems.

Defining fields and their types is only one aspect of defining the database schema. Another critical aspect is defining indexes, collations, etc. As far as I understand, such things have no place in JSONSchemas.

Perhaps the inverse could be possible - generate JSONSchema from the "DB schema JSON".

Everyone would love if we could convert the regular XML dumps into JSON. If there were JSONSchemas describing MW tables, the records in a JSON MW dump would conform to them.

I doubt this would be possible, and if it was, I would strongly recommend against that. We would be exposing a lot of internal detail, making the dumps had to use and unstable. XML and JSON are document style formats that make use of nesting for context. Relational databases do not support that, they use relation tables instead. The two approaches are different enough to make it extremely annoying to represent one using the other (it's technically possible, but really annoying). The idea of the dump format is that it follows our conceptional data model and is independent of the database schema.

Dumps that basically represent the table structure are only nice when importing into the *same* table structure. When trying to import into a different structure (say, a different version of MW) or when doing stream processing, it's a real pain.

With the current format, you get a page as one "thing", with all revisions nested into that "thing", with the meta-data and content nested into that. That makes it really easy to process the dump as a stream of pages or of revisions, with all the meta-data and content immediately available.

With a table oriented dump structure, you'd get a stream of pages, and a stream of revision meta data, then a stream of slot associations, a stream of content meta data, and then a stream of blobs, a stream of content model names, etc. You'd have to write each stream into a database so you can then associated the different entities with each other. That would be impractical for most use cases.

<snip>

With a table oriented dump structure, you'd get a stream of pages, and a stream of revision meta data, then a stream of slot associations, a stream of content meta data, and then a stream of blobs, a stream of content model names, etc. You'd have to write each stream into a database so you can then associated the different entities with each other. That would be impractical for most use cases.

I hve to agree with Daniel here. If you want something easy to import into a database, you use a tool to generate sql (or whatever you need) from the xml. Similarly, if you want json-formatted dumps with the same sort of structure as the existing ones, then a conversion tool ought to be developed for that. That structure is very useful so I do want to keep itl; per-page processing of the dumps is quite common in bot land. But maybe we could talk about json format needs and use cases on another ticket.

Another critical aspect is defining indexes, collations, etc. As far as I understand, such things have no place in JSONSchemas.

They could! JSONSchema doesn't just define structure, it can also define other metadata, like allowed formats, etc. We plan to implement a custom JSONSchema extension to allow for annotating fields as either dimension (low cardinality, like Prometheus labels) or measure (values, e.g. page load timings) for ingestion into Druid. The same could be done for indexes, etc.

XML and JSON are document style formats that make use of nesting for context.

XML does this, but JSON doesn't necessarily. This is actually a big difficulty in parsing the XML dump; it is impossible to know the full context of a revision without parsing an entire XML page document. The data is large enough now that parsing the full dump can be pretty slow or difficult in a non distributed context, and XML doesn't distribute well. We wouldn't want a JSON dump as an array of revision objects for every page; each DB record would be its own JSON object, i.e. JSON Lines.

Anyway, yeah I get your point, you are right. A JSON version of a MySQL dump isn't more useful than a plain ol' SQL dump. Analytics plans to release (possibly)JSON dumps of the Mediawiki History data set, which will be much more useful.

I mostly started commenting about JSONSchema at all because we have tools that can use JSONSchemas to load from and into various data stores. Having a JSONSchema for Mediawiki databases might make some future integration tasks simpler.

Ladsgroup updated the task description. (Show Details)Jul 17 2019, 2:27 PM
Ladsgroup moved this task from Request IRC meeting to Inbox on the TechCom-RFC board.

I updated the description per the discussion in the TechCom IRC meeting and we agreed to move this to last call.

Having a JSONSchema for Mediawiki databases might make some future integration tasks simpler.

I can well imagine that being useful, and it would probably be easy enough to generate JSONSchema from the DB specs, just like we generate SQL. Maybe there could even be a JSONschema backend for DBAL. Not sure that's the best way to do this, though.

Ladsgroup updated the task description. (Show Details)Jul 18 2019, 12:57 PM
daniel updated the task description. (Show Details)Jul 18 2019, 1:53 PM

I went over the proposal to clarify it a bit more.

This RFC was discussed on IRC on July 18. Several points have been discussed and clarified, which are now reflected by the task description. There was consensus that this RFC was ready to go on last call with the updated description. It will be reviewed at the next TechCom meeting.

IRC meeting minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2019/wikimedia-office.2019-07-17-13.05.html
Full Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2019/wikimedia-office.2019-07-17-13.05.log.html

I'm a bit confused here, cause updated task description seems to collide (again) with https://phabricator.wikimedia.org/T113831

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

I can't access IRC logs behind my company's firewall, so please ignore this comment if I'm missing something.

daniel added a subscriber: Hexmode.Jul 18 2019, 4:30 PM

I'm a bit confused here, cause updated task description seems to collide (again) with https://phabricator.wikimedia.org/T113831

That is correct. The main reason that was declined was the fact that it would require an abstraction layer for schema changes. Thich would be provided per this RFC.

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.
I can't access IRC logs behind my company's firewall, so please ignore this comment if I'm missing something.

This was briefly discussed during the meeting. I haven't looked at the code, but @Hexmode apparently made a proof of concept for that, and was able to make it work without too much trouble.

You are right that there should be a ticket about the "making DB support pluggable". The creation of the abstraction layer isn't blocked on that, but the ability to provide db support in extensions is, so it should at least be mentioned here.

I'm a bit confused here, cause updated task description seems to collide (again) with https://phabricator.wikimedia.org/T113831

Effectively we revisited T113831 in light of the abstract schema changes proposal here and decided that it's now feasible and desirable to do so.

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

We're planning on doing the installer bit, and from private emails it sounds like Skizzerz is on board with that plan. As a rough straw proposal, we could have the installer detect database-providing extensions (via a flag in extension.json) and load them early for selection at the database setup stage of the installer. The one selected would continue to be loaded from then on, even before LocalSettings.php is generated.

This was briefly discussed during the meeting. I haven't looked at the code, but @Hexmode apparently made a proof of concept for that, and was able to make it work without too much trouble.

Note I gave that patch a -1 because it seems to depend on what we decided against in T467: RfC: Extension management with Composer.

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

We're planning on doing the installer bit, and from private emails it sounds like Skizzerz is on board with that plan. As a rough straw proposal, we could have the installer detect database-providing extensions (via a flag in extension.json) and load them early for selection at the database setup stage of the installer. The one selected would continue to be loaded from then on, even before LocalSettings.php is generated.

Yep. Cindy and I were discussing this via email, and installer support is planned as part of these changes. As such, I see no downsides to pulling support out of core and into an extension. In case it's relevant, Cindy has my full permission to quote any part of my private email reply in here or other public spaces.

Looking into extension.json and exposing db-related extensions in the installer seems workable to me. I presume the interface would be some sort of abstract class that the extension extends (and provides a reference to via AutoloadClasses), and fills in protected methods for the functionality it needs to do like setting up or upgrading tables and exposing UI options in the installer. It seems like such an interface would be more foolproof than having hooks for all of those components.

daniel moved this task from Inbox to Last Call on the TechCom-RFC board.Jul 26 2019, 11:06 AM

Per the TechCom meeting on July 24, this RFC goes on Last Call for being approved. If no objections remain unaddressed by August 7, the RFC will be approved as proposed and amended.

If you care about this RFC, please comment - in support, or raising concerns. The Last Call period is not just for raising objections, but also for confirming consensus.

Tgr awarded a token.Aug 1 2019, 11:39 AM
Tgr added a subscriber: Tgr.Aug 1 2019, 11:44 AM
  • Existing schema change (json) files should not be changed to perform additional changes. - there will still be a main file representing the current schema (like tables.sql in core and extension_name.sql in most extensions) which is updated continuously, and used in the installer, right?
  • Btw. tables.sql and co. are valuable as easy-to-understand documentation of the current schema; will that be preserved in some way?
  • Not all schema changes can be represented in an abstract language (e.g. because you need to do two ALTERs and some DML to move the data in between); does the new system allow using manual SQL? (I guess the answer is yes, because worst case, you could write a custom update script?)
  • Existing schema change (json) files should not be changed to perform additional changes. - there will still be a main file representing the current schema (like tables.sql in core and extension_name.sql in most extensions) which is updated continuously, and used in the installer, right?

Right - JSON files that represent a schema change should not be grown to represent more changes. JSON files that represent a schema should be updated to reflect the current desired state.

  • Btw. tables.sql and co. are valuable as easy-to-understand documentation of the current schema; will that be preserved in some way?

Good point - I'd say we could generate them as documentation. Not sure if they should be checked in, though. If yes, they should go into docs/

  • Not all schema changes can be represented in an abstract language (e.g. because you need to do two ALTERs and some DML to move the data in between); does the new system allow using manual SQL? (I guess the answer is yes, because worst case, you could write a custom update script?)

It seems to me like allowing SQL in the JSON files would defy the purpose. The solution for complex migrations would indeed be to write a maintenance script that does the data shoveling. This is probably better anyway, since it allows the shoveling to be done in batches.

Tgr added a comment.Aug 1 2019, 12:18 PM

It seems to me like allowing SQL in the JSON files would defy the purpose.

It can't be done in the JSON files because the SQL would have to be different per engine. There could be some hook system, maybe. I agree that using maintenance scripts is probably better, though.

It seems to me like allowing SQL in the JSON files would defy the purpose.

It can't be done in the JSON files because the SQL would have to be different per engine. There could be some hook system, maybe. I agree that using maintenance scripts is probably better, though.

I would be very much against using hook or anything except maintenance script in this case because for example there is a plan to move support of oracle and mssql to extensions and let volunteers maintain it and those maintainers should not be responsible for adding support for individual schema change or any data migration that it would need.

This RFC has been approved as proposed per the TechCom meeting on 2019-08-07. @Tgr's comment appears to have been addressed.