Page MenuHomePhabricator

RFC: Abstract schemas and schema changes
Open, MediumPublic

Description

At some point this should be a TechCom-RFC, but at the moment it's still in the drafting stage.

Problem

MediaWiki claims to support five databases: MySQL/MariaDB, SQLite, PostgreSQL ("PG"), Microsoft SQL Server ("MSSQL"), and Oracle Database. For normal runtime-type queries, we have abstractions that make these all mostly work pretty well.

But at the DDL level it's a completely different story. One major piece of (and source of) technical debt is the fact that MediaWiki does not have a database schema, it has four. And most schema changes have to be written five times, one for each supported database. In practice, this means schema changes for the less-supported databases are often omitted, or when not omitted are often merged without being tested.

We can improve the situation by abstracting the schema and schema change definitions, with code per database to translate that into the actual DDL statements.

Approved solution

Create a PHP interface (or base class) for schemas and schema changes. We implement this in top of Doctrine DBAL. Schemas and schema changes will be defined in JSON files conforming to https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema, which will be read into Schema or SchemaDiff objects. DBAL will then take care of generating SQL for any supported RDBMS.

In addition, database support (in the form of a subclass of the Database base class) should be made pluggable. Extensions that want to provide support for a database backend would provide a Database subclass as well as a suitable implementation of the schema and schema change interface. This would be trivial for any database that is supported by DBAL.

Notes:

  • This means we drop support for MSSQL and Oracle RDBMS from MediaWiki core, since DBAL support for them is insufficient and/or the schema for these databases has diverged from the main line schema. WMF will not continue support for these database backends. Volunteers have shown interest in bringing back support for these backends in form of extensions.
  • For schema definitions, we go with JSON for now. But the we may want to switch to YAML later, for easier editing. JSON and YAML can easily be converted into one another.
  • If someone wants to introduce a schema change, there should be a new deployable file which can contain several schema changes. Existing schema change (json) files should not be changed to perform additional changes.

Old proposals

Proposal #1

We should write a schema and schema change abstraction layer to integrate with MediaWiki's existing runtime database abstraction. Details are on-wiki at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema and https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements, but in short:

  • We would have one schema, expressed as a structure in a JSON file. We would have one definition of each schema change, expressed as a structure in a JSON file.
  • Database-specific classes would exist to turn the schema or schema-change into SQL statements, much as we have database-specific subclasses of Wikimedia\Rdbms\Database.
  • We'd also tighten up some of the other database-level things: limited identifier lengths, index name uniqueness, data type consistency, charset consistency, etc.

The reason we didn't go with this:

  • It's lots of work to write a schema and schema change abstraction from scratch.

Proposal #2

Try to integrate Doctrine Migrations for schema creation and updates.

Pros (compared to Proposal #1):

  • We wouldn't have to implement all the database-specific logic ourself.
  • Probably a larger community fixing any bugs that exist.
  • Familiar system for (some subset of) PHP developers, simplifying onboarding of devs.

The reasons we didn't go with this:

  • We'd have to have code to translate MediaWiki's DB connection info to Doctrine's format, and otherwise translate between Doctrine conventions and MediaWiki conventions.
  • We may have to custom-implement a "mwtimestamp" type, or else standardize all DBs on using 14-byte strings.
  • We may still have to work around issues like MSSQL's different treatment of NULLs in unique indexes.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
saper added a comment.EditedJul 8 2019, 8:45 AM

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.

Otherwise we will end up one day with everything being varbinary or blob.

In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing. Once that happens and is well-supported rather than an afterthought, I’d be happy to re-evaluate moving less-supported DBMS into extensions or whatever system is devised for that.

I was able to get support for Percona as a selection in the installer by adding one function: Installer::addDBType().

Comments and +2 on above welcome! :)

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.

Otherwise we will end up one day with everything being varbinary or blob.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

If the application level has a concrete need for a data type that cannot be met directly by one of the available types, then we may want to look into defining such higher level types. But as soon as the representation of such types changes the schema structurally, e.b. by representing geo-coordinates in one field on some DBs, and in two or more fields on other DBs, then we get dangerously close to writing an ORM system. With all the complexity and performance issue that this implies. I'd be very, very careful about going in that direction.

Also, moxing application level concepts into the DB abstraction layer (e.g. ARTICLE_TEXT) sounds like a bad idea. In this example in particular, since article content may not be text - it may in fact be JSON, and should be using a completely different type internally. And the knowledge what type to use, and how, needs to reside in the code that defines the content model.

So, in summary - it would probably be nice to support a JSON type, but not an ARTICLE_TEXT type. Things like EMAIL or GEO are tempting to split into multiple fields or even tables - but structural choices about the schema should be left to the application logic and should not be implicit in the DB abstraction. This is tempting because it looks nice and clean (see ORM), but tends to lead to scalability problems. The opportunity of optimization based on knowedge of the data type and the capabilities of the DB tend to be less than the opportunity to optimize on application specific use cases and access patterns. E.g. EMAIL could be split into domain and user, with the domains normalized, and a way to query all email addresses per domain. Whether this is and advantage or just gets in the way and causes problems depends on the application - so it should be a choice of the application.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

In that context, ARTICLE_TEXT is a sensible type, as it forces you to define what that looks like and how you need to interact with in a more targeted manner. A general JSON type wouldn't make a distinction between the quite different usage profiles you might need for ARTICLE_TEXT, CACHED_OBJECT, USER_SETTINGS, etc. even though they could all be represented as JSON (which, itself, could be represented as LONGTEXT or equivalent).

If this were for a general abstraction layer, I would build it so that the mappings are defined (or overridable at least) at the application layer, so whilst there might be a default representation of an EMAIL field as the most appropriate string type, the application could choose to use an alternative representation. Therefore I'm not even sure that this is a problem in the general case.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

I believe that stratification is a good thing. A lot of problems in the MediaWiki code base arise from the fact that we don't gave any distinction between storage layer, application logic, and presentation/interaction. This makes the software extremely hard to maintain.

I'd rather manage the representation of domain entities by defining a schemaon the level of a storage service, than as data types in the database layer.

If this were for a general abstraction layer, I would build it so that the mappings are defined (or overridable at least) at the application layer

That's exactly what a storage service does. The database abstraction layer shouldn't know about it. Keeping the DB abstraction oblivious of the domain model makes it much easier to implement and maintain. When introducing new concepts into core, having to maintain compatibility with DB abstractions that would now all need to support the new concept would be problematic.

Ottomata added a subscriber: Ottomata.EditedJul 15 2019, 1:45 PM

Just came across this ticket after reading the TechCom radar email.

This sounds pretty awesome. I like the idea of JSON to specify the schemas. I wonder if this could be done using JSONSchema. If so, Analytics (and others) would be able to more easily integrate mediawiki schemas and event schemas. (We chose JSONSchema as part of an RFC last year.)

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

JSON is just a subset of YAML. We use YAML for event schemas, and put what would be table comments in the description field of the JSONSchema element, but still use code comments for developer readability.

In case useful: we are building tooling around auto versioning and dereferencing JSONSchemas. This would allow you to do things like declare a common field or subschema in a separate definitions file, and reuse it in the real schema documents. We're also planning to build CI for ensuring that the schemas conform to any conventions we choose.

  1. Tracking of which migration has run (and when) in a migrations table

In general, the model MediaWiki has followed in the past is to detect whether a particular change is needed (e.g. column is missing => add it) rather than trying to track whether things have been run. But we do use the "has been run" model for many maintenance scripts. We've run into issues both ways.

  1. There is a --step option to allow you to do migrations 1 by 1, instead of bulk

I'm not sure just how much benefit there would be in this one, particularly since the changes aren't likely to be strictly ordered when it comes to extensions.

  1. All migrations have a rollback function that you can implement allowing you to undo your change.

This likely means writing each change twice (forward and backward),[1] and generally only using one of them. Personally I'd not bother.

[1]: There may be some really simple cases where that could be done automatically, but usually you'd want to repopulate data in one of the two directions.

  • Nothing in mediawiki changes except we replace *.sql files with *.json ones that are DBMS-agnostic (so one per schema change, and one central tables.json for fresh installations)

I don't think this will be sufficient. This sounds like only the "updates" key from what I listed at https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema#Schema_change_format when drafting this RFC.

Here's a use case to consider: Because making schema changes in the past has historically been such a pain, gerrit:357892 added just one patch file for MySQL, which has since caused problems like T227662. But not for PostgreSQL, since the syntax of that file makes per-field updates much easier to write. Ideally we'll have that property for all changes in the future, and one of the ways I proposed doing that was to define the format such that we can easily bundle multiple check+update sets into one "patch" file.

Everything else stays the same including the way mediawiki checks for current schema to determine whether it would apply the patch or not. This narrows focus of the RFC a bit.

It would narrow the scope to the point where the RFC doesn't actually accomplish the things that it is intended to accomplish, unfortunately. We don't want to have to write the logic to check whether an update is needed multiple times just as we don't want to write each updates' SQL multiple times.

Can I re-iterate my previous concern about using JSON as the data format, given that JSON doesn't support comments.

We could easily enough use MediaWiki's FormatJson::stripComments() or something very like it to work around that problem.

The current schema has no table or column comments. [...] Luckily both the current system's problem and the problem you mention here have the same solution: put comments in the columns and tables themselves. There should be a place in the JSON for these, as they're part of standard create table statements.

I note that https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema#Schema_format does include that. I'm not entirely sure it's a good idea, though, since it would mean that updates to the documentation would also require schema changes.

The nice thing we can now have with DBAL is that in our CI we can make it our abstract schema be tested against all DBMSes by trying to turn the abstract php objects to SQL commands in all of supported platforms (which now we can add even more DBMSes like SQLAnywhere, SQLAzure, etc. to the list) and check if they error out because DBAL can catch most of the edgecases as I exampled above.

Note that we want to enforce limits in the abstract, not just for the specific database. Personally I'd prefer that limits are defined and tested for explicitly rather than being a union of requirements applied by whichever set of DBAL backends happen to be being used.

Speaking of CI, I'd also like to have a way to load a schema file from an old version of MediaWiki, apply the updates to it, and then extract the schema and test whether it's equivalent to the data in the current schema file. And ideally do that entirely on the abstract level, rather than translating it into some DB's SQL and back.

I would hope and expect to see an average of about 5 lines worth of documentation comments per field, and probably an average of about 10 lines per table. I'm aware that the documentation is not currently at this level, but I would hope that improving that is something we all aspire to.

That seems excessive to me. While in some cases a field may need 5 lines of documentation, most likely need much less. Same for tables.

A data format that allows proper comments, of any length, please! YAML may be a good choice, or a custom JSON + comments format if YAML is too heavy.

Parsing JSON in PHP is a fairly standard thing to do, while PHP's YAML extension is not bundled with PHP by default.

I actually wish we've had more possibilities to diverge, for example to use native Internet Protocol address data types for ipb_address etc...
So abstractions would have been done up in the application layer, not just on some single storage layer.

In general, you can either support multiple databases or you can require just one. MediaWiki has decided to support multiple, and changing that to support only MySQL is outside the scope of this RFC.

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - had we had an abstraction to handle JSON values in IDatabase, we wouldn't need fixes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TemplateData/+/521193 - JSON could end up as JSONB in Postgres and a BLOB in MySQL. Sometimes the abstractions would need to go much higher than this, providing a DB-specific implementation of it.

The problem in T203850 is that the current MySQL schema doesn't differentiate between storage of UTF-8 text and storage of binary data, while others do and therefore cause problems when you try to shove binary data into a Unicode text field.

While some JSON abstraction at the IDatabase level probably would have avoided that specific issue, it doesn't touch the underlying problem and provides no other real benefit.

This. A proper DB abstraction should not be about trying to define the lowest common denominator of, for example, an INT field (can't be signed, limited to 16 bits because we need to support FoobarDB, etc.). It should be about defining high-level data types (e.g. EMAIL, ARTICLE_TEXT, etc.) and allowing the individual abstraction layers to map them to the most appropriate type in the back-end system.

I disagree with both of your alternatives. We should define types with useful properties while leaving it up to the database abstraction layer to choose a specific type that satisfies our requirements. The FoobarDB implementation might have to use something other than "INT" to represent our integer type, and that's ok. But on the other hand, there's likely no benefit at the database abstraction level to trying to separate "email" and other short strings, or "article text" and other long strings. Even if some database has an "email" type (or an "ip" type), it seems unlikely that we'd be able to really make use of any added features as long as we do want to support more than just one database.

That may be true of a general abstraction layer, but we're talking about a specific abstraction layer for MediaWiki. Therefore, by definition, it is the choice of the application.

Note that MediaWiki's database abstraction code is in includes/libs/, which implies that we hope to someday publish it as a separate library that other projects can use as well.

I wonder if this could be done using JSONSchema.

Not very likely. JSONSchema is about defining the structure of a JSON-based data structure. Here we're talking about an SQL DDL, and only storing the data structure as JSON.

We might create a JSONSchema to define the structure of our JSON files that in turn define the SQL DDL, but that transitive relationship seems unlikely to be useful for the use cases you're thinking of.

In a private email, @Ladsgroup wrote:

Also it's possible to make the schema change abstraction pluggable so in core we use Doctrine DBAL and in Oracle extension, they need to write something like that from scratch.

I like that idea. If we wrap Doctrine DBAL in our own interface so the implementation could be easily changed if we find shortcomings at some point, I'd have much less concern about it. That makes the question of whether we use DBAL or not somewhat irrelevant since MediaWiki won't directly depend on it. We can use it for the initial implementation, and if it turns out to be too much trouble to work around we can change the implementation without having to redo a bunch of the rest of the work.

I wonder if this could be done using JSONSchema.

Not very likely. JSONSchema is about defining the structure of a JSON-based data structure. Here we're talking about an SQL DDL, and only storing the data structure as JSON.

Yeah, I can't quite imagine how an ALTER part of a migration would look in JSONSchema. Maybe what I'm envisioning isn't so much about migrations, but about abstract table schemas. We use JSONSchema to create (and auto migrate) Hive tables. Right now we do this with some custom Spark schema glue, but might one day do a similar thing with Kafka Connect, which itself integrates with many source and sink databases.

Having Mediawiki tables represented as JSONSchemas could potentially ease integration of Mediawiki data into OLAP (and other) systems. Everyone would love if we could convert the regular XML dumps into JSON. If there were JSONSchemas describing MW tables, the records in a JSON MW dump would conform to them.

On the other hand, supporting abstract migrations between RDBMS's is hard enough; adding constraints OLAP stuff in there is crazy :p Just some food for thought I guess, carry on! :)

A tiny example of what I mean is https://phabricator.wikimedia.org/T203850 - […]

Otherwise we will end up one day with everything being varbinary or blob.

For cases where the application has no shared use case at all other than to store and retrieve binary, that seems appropriate. Specifying it differently than that costs only maintenance time to specify, maintain and migrate as its coincidental format changes over time, with the run-time overhead of constraining it - at no benefit.

No rule is perfect for everything of course, and that includes this one.

Having Mediawiki tables represented as JSONSchemas could potentially ease integration of Mediawiki data into OLAP (and other) systems.

Defining fields and their types is only one aspect of defining the database schema. Another critical aspect is defining indexes, collations, etc. As far as I understand, such things have no place in JSONSchemas.

Perhaps the inverse could be possible - generate JSONSchema from the "DB schema JSON".

Everyone would love if we could convert the regular XML dumps into JSON. If there were JSONSchemas describing MW tables, the records in a JSON MW dump would conform to them.

I doubt this would be possible, and if it was, I would strongly recommend against that. We would be exposing a lot of internal detail, making the dumps had to use and unstable. XML and JSON are document style formats that make use of nesting for context. Relational databases do not support that, they use relation tables instead. The two approaches are different enough to make it extremely annoying to represent one using the other (it's technically possible, but really annoying). The idea of the dump format is that it follows our conceptional data model and is independent of the database schema.

Dumps that basically represent the table structure are only nice when importing into the *same* table structure. When trying to import into a different structure (say, a different version of MW) or when doing stream processing, it's a real pain.

With the current format, you get a page as one "thing", with all revisions nested into that "thing", with the meta-data and content nested into that. That makes it really easy to process the dump as a stream of pages or of revisions, with all the meta-data and content immediately available.

With a table oriented dump structure, you'd get a stream of pages, and a stream of revision meta data, then a stream of slot associations, a stream of content meta data, and then a stream of blobs, a stream of content model names, etc. You'd have to write each stream into a database so you can then associated the different entities with each other. That would be impractical for most use cases.

<snip>

With a table oriented dump structure, you'd get a stream of pages, and a stream of revision meta data, then a stream of slot associations, a stream of content meta data, and then a stream of blobs, a stream of content model names, etc. You'd have to write each stream into a database so you can then associated the different entities with each other. That would be impractical for most use cases.

I hve to agree with Daniel here. If you want something easy to import into a database, you use a tool to generate sql (or whatever you need) from the xml. Similarly, if you want json-formatted dumps with the same sort of structure as the existing ones, then a conversion tool ought to be developed for that. That structure is very useful so I do want to keep itl; per-page processing of the dumps is quite common in bot land. But maybe we could talk about json format needs and use cases on another ticket.

Another critical aspect is defining indexes, collations, etc. As far as I understand, such things have no place in JSONSchemas.

They could! JSONSchema doesn't just define structure, it can also define other metadata, like allowed formats, etc. We plan to implement a custom JSONSchema extension to allow for annotating fields as either dimension (low cardinality, like Prometheus labels) or measure (values, e.g. page load timings) for ingestion into Druid. The same could be done for indexes, etc.

XML and JSON are document style formats that make use of nesting for context.

XML does this, but JSON doesn't necessarily. This is actually a big difficulty in parsing the XML dump; it is impossible to know the full context of a revision without parsing an entire XML page document. The data is large enough now that parsing the full dump can be pretty slow or difficult in a non distributed context, and XML doesn't distribute well. We wouldn't want a JSON dump as an array of revision objects for every page; each DB record would be its own JSON object, i.e. JSON Lines.

Anyway, yeah I get your point, you are right. A JSON version of a MySQL dump isn't more useful than a plain ol' SQL dump. Analytics plans to release (possibly)JSON dumps of the Mediawiki History data set, which will be much more useful.

I mostly started commenting about JSONSchema at all because we have tools that can use JSONSchemas to load from and into various data stores. Having a JSONSchema for Mediawiki databases might make some future integration tasks simpler.

Ladsgroup updated the task description. (Show Details)Jul 17 2019, 2:27 PM
Ladsgroup moved this task from Request IRC meeting to P1: Define on the TechCom-RFC board.

I updated the description per the discussion in the TechCom IRC meeting and we agreed to move this to last call.

Having a JSONSchema for Mediawiki databases might make some future integration tasks simpler.

I can well imagine that being useful, and it would probably be easy enough to generate JSONSchema from the DB specs, just like we generate SQL. Maybe there could even be a JSONschema backend for DBAL. Not sure that's the best way to do this, though.

Ladsgroup updated the task description. (Show Details)Jul 18 2019, 12:57 PM
daniel updated the task description. (Show Details)Jul 18 2019, 1:53 PM

I went over the proposal to clarify it a bit more.

This RFC was discussed on IRC on July 18. Several points have been discussed and clarified, which are now reflected by the task description. There was consensus that this RFC was ready to go on last call with the updated description. It will be reviewed at the next TechCom meeting.

IRC meeting minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2019/wikimedia-office.2019-07-17-13.05.html
Full Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2019/wikimedia-office.2019-07-17-13.05.log.html

I'm a bit confused here, cause updated task description seems to collide (again) with https://phabricator.wikimedia.org/T113831

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

I can't access IRC logs behind my company's firewall, so please ignore this comment if I'm missing something.

daniel added a subscriber: Hexmode.Jul 18 2019, 4:30 PM

I'm a bit confused here, cause updated task description seems to collide (again) with https://phabricator.wikimedia.org/T113831

That is correct. The main reason that was declined was the fact that it would require an abstraction layer for schema changes. Thich would be provided per this RFC.

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

I can't access IRC logs behind my company's firewall, so please ignore this comment if I'm missing something.

This was briefly discussed during the meeting. I haven't looked at the code, but @Hexmode apparently made a proof of concept for that, and was able to make it work without too much trouble.

You are right that there should be a ticket about the "making DB support pluggable". The creation of the abstraction layer isn't blocked on that, but the ability to provide db support in extensions is, so it should at least be mentioned here.

I'm a bit confused here, cause updated task description seems to collide (again) with https://phabricator.wikimedia.org/T113831

Effectively we revisited T113831 in light of the abstract schema changes proposal here and decided that it's now feasible and desirable to do so.

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

We're planning on doing the installer bit, and from private emails it sounds like Skizzerz is on board with that plan. As a rough straw proposal, we could have the installer detect database-providing extensions (via a flag in extension.json) and load them early for selection at the database setup stage of the installer. The one selected would continue to be loaded from then on, even before LocalSettings.php is generated.

This was briefly discussed during the meeting. I haven't looked at the code, but @Hexmode apparently made a proof of concept for that, and was able to make it work without too much trouble.

Note I gave that patch a -1 because it seems to depend on what we decided against in T467: RfC: Extension management with Composer.

Regarding Skizzerz said: "In any case, splitting DBMS out of core is an absolute non-starter until the installer can make use of them somehow without LocalSettings.php existing." it still looks like a good point to consider.

We're planning on doing the installer bit, and from private emails it sounds like Skizzerz is on board with that plan. As a rough straw proposal, we could have the installer detect database-providing extensions (via a flag in extension.json) and load them early for selection at the database setup stage of the installer. The one selected would continue to be loaded from then on, even before LocalSettings.php is generated.

Yep. Cindy and I were discussing this via email, and installer support is planned as part of these changes. As such, I see no downsides to pulling support out of core and into an extension. In case it's relevant, Cindy has my full permission to quote any part of my private email reply in here or other public spaces.

Looking into extension.json and exposing db-related extensions in the installer seems workable to me. I presume the interface would be some sort of abstract class that the extension extends (and provides a reference to via AutoloadClasses), and fills in protected methods for the functionality it needs to do like setting up or upgrading tables and exposing UI options in the installer. It seems like such an interface would be more foolproof than having hooks for all of those components.

daniel moved this task from P1: Define to P5: Last Call on the TechCom-RFC board.Jul 26 2019, 11:06 AM

Per the TechCom meeting on July 24, this RFC goes on Last Call for being approved. If no objections remain unaddressed by August 7, the RFC will be approved as proposed and amended.

If you care about this RFC, please comment - in support, or raising concerns. The Last Call period is not just for raising objections, but also for confirming consensus.

Tgr awarded a token.Aug 1 2019, 11:39 AM
Tgr added a subscriber: Tgr.Aug 1 2019, 11:44 AM
  • Existing schema change (json) files should not be changed to perform additional changes. - there will still be a main file representing the current schema (like tables.sql in core and extension_name.sql in most extensions) which is updated continuously, and used in the installer, right?
  • Btw. tables.sql and co. are valuable as easy-to-understand documentation of the current schema; will that be preserved in some way?
  • Not all schema changes can be represented in an abstract language (e.g. because you need to do two ALTERs and some DML to move the data in between); does the new system allow using manual SQL? (I guess the answer is yes, because worst case, you could write a custom update script?)
  • Existing schema change (json) files should not be changed to perform additional changes. - there will still be a main file representing the current schema (like tables.sql in core and extension_name.sql in most extensions) which is updated continuously, and used in the installer, right?

Right - JSON files that represent a schema change should not be grown to represent more changes. JSON files that represent a schema should be updated to reflect the current desired state.

  • Btw. tables.sql and co. are valuable as easy-to-understand documentation of the current schema; will that be preserved in some way?

Good point - I'd say we could generate them as documentation. Not sure if they should be checked in, though. If yes, they should go into docs/

  • Not all schema changes can be represented in an abstract language (e.g. because you need to do two ALTERs and some DML to move the data in between); does the new system allow using manual SQL? (I guess the answer is yes, because worst case, you could write a custom update script?)

It seems to me like allowing SQL in the JSON files would defy the purpose. The solution for complex migrations would indeed be to write a maintenance script that does the data shoveling. This is probably better anyway, since it allows the shoveling to be done in batches.

Tgr added a comment.Aug 1 2019, 12:18 PM

It seems to me like allowing SQL in the JSON files would defy the purpose.

It can't be done in the JSON files because the SQL would have to be different per engine. There could be some hook system, maybe. I agree that using maintenance scripts is probably better, though.

It seems to me like allowing SQL in the JSON files would defy the purpose.

It can't be done in the JSON files because the SQL would have to be different per engine. There could be some hook system, maybe. I agree that using maintenance scripts is probably better, though.

I would be very much against using hook or anything except maintenance script in this case because for example there is a plan to move support of oracle and mssql to extensions and let volunteers maintain it and those maintainers should not be responsible for adding support for individual schema change or any data migration that it would need.

This RFC has been approved as proposed per the TechCom meeting on 2019-08-07. @Tgr's comment appears to have been addressed.

Verdy_p added a comment.EditedNov 6 2019, 8:21 PM

Note that when closing T194125 (charset issue) this new task still does not address the backend conformance or emulation level we'd need to check (possibly with additional extensions to load in the backend SQL server). Each backend must still be able to pass a conformance test and fail, or enable an emulation layer to be used to allow compatibility or background migration of schemas (without causing a server upgrade to be imemdiately shutdown for hours... or days).

When upgrading the schema, an upward compatiblity test should be done, in order to cancel the upgrade and avoid breaking the existing server, until a compatiblity layer is implemented or a background migration tool is developed that allows the wiki to be restarted (even if it will temporarily perform more slowly while the migration is ongoing, meaning that there may be the need to rename some existing tables, while populating another newer table and rebuilding the necessary indexes).
As well, the triggers and constraints must be adapted. But I don't really see my MySQL/MariaDB is so different from Sybase/MSSQL; after all Sybase has made the job to allow integrating MySQL in a farm of SQL servers, MySQL (and MSSQL or ESE) working through adapters, and MySQL/MariaDB being seen as a lower-end SQL engine.
Oracle as well as integrated a connector for MySQL in its enterprise-level Oracle SQL engines. PostgreSQL is the most versatile engine. Many things can be adapted into Oracle or Sybase/MSSQL using a set of triggers and constraints. And years ago I worked on a large enterprise project that started in Sybase, then was adapted to Informix, IBM DB2, MS SQL, MySQL, Oracle, without having to rewrite the appliclation that depended on a very large set of stored procedures. even the large set of stored procedures were automatically converted to equivalents in Oracle, Informix, DB2, MSSQL. The DDL was also converted, and the application contained a self-written adapter that dynamically rewrote the queries (icnlujding the most complex ones using outer joins and custom subselections, temporary tables, and multilevel transactions (synchozed between local databases and a remote central database). Despite of the emulation used in Oracle and Informix, they performed faster than their equivalent in Sybase. The key was in using stored procedures/functions, allowing specific tuning for specific engines that could run natively in each engine wihtout transfering much data between the server and client not necessarily running on the same host: joins had to be resolved locally, possibly using the local procedural language (TransacSQL converted to PL/SQL, initially automatically by a build tool, then tuned manually). Then the applications were just calling procedures with simple parameters to return a result set, and the layer of adaptation used a transactional system controled by the application but using the specific transactional system.
All datatypes were converted (including the support for textual or binary blobs). And all this allowed the lcient to work through slow remote connections. All applications were interactive and the central server was managing hundreds of users simultaneously across a large region or a whole country from dozens of branches. At that time it was not a web application, the constraints were much more sever than for Mediawiki that jsut has to work though local networks and all is behing a front webserver.

But it's very surprizing that you can't handle the special case of the behavior of unique indexes on null columns: this is probably a symtom of bad conceptual design of the schema (where you may have used NULL as a unique value when it should never be, NULL being better reserved for 0to-1 or 0-to-N outer joins or optional attributes, and normally having NO "unique" constraints in these joins; if you use 0-to-N, you need computed columsn or aggregates, or an additional intermediate relation table in your schema)

This comment was removed by bd808.
bd808 added a subscriber: bd808.Jan 16 2020, 11:58 PM

Change 595240 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Start wiring automatically generated sql schemas to installation

https://gerrit.wikimedia.org/r/595240

I found a php-based formatter of SQL code, it may come handy to format automatically-generate SQL code: https://github.com/jdorn/sql-formatter (haven't tested it).

Apparently doctrine includes its own one (but wasn't used), so that will be used instead.

Reedy added a subscriber: Reedy.May 9 2020, 2:50 PM

Apparently doctrine includes its own one (but wasn't used), so that will be used instead.

Well, they are one and the same, as per https://github.com/doctrine/sql-formatter#history :)

This package is a fork from https://github.com/jdorn/sql-formatter

Change 595269 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Add doctrine/sql-formatter to pretty print generated SQL files

https://gerrit.wikimedia.org/r/595269

Change 595269 merged by jenkins-bot:
[mediawiki/core@master] Add doctrine/sql-formatter to pretty print generated SQL files

https://gerrit.wikimedia.org/r/595269

Change 595240 merged by jenkins-bot:
[mediawiki/core@master] Wire empty abstract schema into installer

https://gerrit.wikimedia.org/r/595240

Krinkle updated the task description. (Show Details)May 9 2020, 9:47 PM
Krinkle removed a subscriber: Krinkle.May 11 2020, 3:46 PM

@Ladsgroup I'm unsure if this was already discussed or implemented, but perhaps we could add a PHPUnit test to ensure that the contents of the tables-generated.sql files match the output generated by generateSchemaSql.php. Something along the lines of AutoLoaderStructureTest. What do you think?

@Ladsgroup I'm unsure if this was already discussed or implemented, but perhaps we could add a PHPUnit test to ensure that the contents of the tables-generated.sql files match the output generated by generateSchemaSql.php. Something along the lines of AutoLoaderStructureTest. What do you think?

I have been thinking about it but I keep forgetting to start a ticket or possibly implement it (my unconscious is pretty lazy). I do it now.