Page MenuHomePhabricator

Replace LoadExtensionSchemaUpdates hook with static data in extension.json
Open, Needs TriagePublic

Description

As I've written elsewhere today, the LoadExtensionSchemaUpdates hook is really fragile and the way its executed in the web installer is counter to the way most hooks work. For most extensions, we should be able to replace it with static data. See https://www.mediawiki.org/wiki/User:Legoktm/LoadExtensionSchemaUpdates for the current proposals.

Each row would correspond to a line in addExtensionUpdate syntax, and ExtensionProcessor will do path expansion. We'll need to add some syntax in for addPostDatabaseUpdateMaintenance as well. Once the abstract schema system is in place, the JSON changes would go under the abstract key rather than an individual rdbms.

Event Timeline

Quick survey of the most popular extensions (I might need to look broader as this didn't contain as many extensions with database tables as I thought it would):

Bundled extensions:

  • OATHAuth: Runs addExtensionUpdate for mysql & sqlite (which really should be loggedupdatemaintenance scripts). Otherwise appears compatible

ExtensionDistributor top #15:

  • Math: varies table creation on MathRenderer::getValidModes(). Otherwise appears compatible. (throws exception on non mysql/sqlite/postgres installs)
  • LDAPProvider: Appears compatible
  • Echo: (hook implementation) is incredibly complex.
    • Does nothing if $wgEchoCluster is enabled.
    • An index was renamed twice, so it checks if the middle name exists before applying patch-alter-type_page-index.sql. Couldn't that be done in the patch file?
    • Runs addExtensionUpdate and a post update maintenance script
    • SQLite and MySQL patches differ

So, looks pretty doable. I think this will likely be a 95% solution, and Echo might be on the border. Though we do eventually want to bundle Echo so we will need to adapt it to make it work...

Probably a good time to disentangle installation from upgrade a bit? Installation only requires a predefined SQL schema 99% of the time; it would be easy to require extensions which want to be part of a tarball bundle, so they can be enabled at install time, to provide a static schema. The very few extensions that can't would still be possible to enable via the updater.

In terms of database schema, there isn't actually any concept of "installation", it's always treated as updates. I do think as part of T258852: Replace LoadExtensionSchemaUpdates hook with static data in extension.json we could implement that split - we have one declared schema definition, on install we run that, and then on updates we iterate through all the updates.

Maybe then something like:

{
	"SchemaUpdates": {
		"mysql": {
			"schema": "sql/linter.sql",
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "sql/patch-linter_foo.sql" ]
			]
		},
		"sqlite": {
			"schema": "sql/linter.sql",
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "sql/patch-linter_foo.sql" ]
			]
		},
		"postgres": {
			"schema": "sql/linter.postgres.sql",
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "sql/patch-linter_foo.postgres.sql" ]
			]
		},
	}
}

An in a future abstract schema world (so we're not stuck nesting under an "abstract" key forever):

{
	"AbstractSchemaUpdates": {
		"schema": "schema/linter.json",
		"updates": [
			[ "addExtensionField", "linter", "linter_foo", "schema/patch-linter_foo.json" ]
		]
	}
}

I moved my proposals to https://www.mediawiki.org/wiki/User:Legoktm/LoadExtensionSchemaUpdates so I can keep tweaking the JSON without posting long comments here.

@Ladsgroup currently some extensions have their tables all in one SQL file (similar to core), while others have it in one SQL file per table, which works better for $databaseUpdater->addExtensionTable I think. What's the recommendation in an abstract schema world? One JSON file per extension or one per table?

I moved my proposals to https://www.mediawiki.org/wiki/User:Legoktm/LoadExtensionSchemaUpdates so I can keep tweaking the JSON without posting long comments here.

@Ladsgroup currently some extensions have their tables all in one SQL file (similar to core), while others have it in one SQL file per table, which works better for $databaseUpdater->addExtensionTable I think. What's the recommendation in an abstract schema world? One JSON file per extension or one per table?

I have looked at it and there's no preference on abstract side of things in matter of technical implementation, I don't know what you would like better, We can start recommending one way or the other if discussion here reaches consensus.

I'd mildly prefer that we make the abstract schema system the default under the SchemaUpdates key, and use ManualSchemaUpdates for the legacy system, but otherwise this sounds great.

I like proposal two in https://www.mediawiki.org/wiki/User:Legoktm/LoadExtensionSchemaUpdates with the twist James suggested. We can also merge all of them together. Something like

{
	"SchemaUpdates": {
		"mysql": {
			"tables": {
			    "linter": "sql/linter.sql"
			},
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "sql/patch-linter_foo.sql" ]
			]
		},
		"sqlite": {
			"tables": {
			    "linter": "sql/linter.sql"
			},
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "sql/patch-linter_foo.sql" ]
			]
		},
		"postgres": {
			"tables": {
			    "linter": "sql/linter.postgres.sql"
			},
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "sql/patch-linter_foo.postgres.sql" ]
			]
		},
		"abstract": {
			"tables": {
			    "linter": "schema/linter.json"
			},
			"updates": [
				[ "addExtensionField", "linter", "linter_foo", "schema/patch-linter_foo.json" ]
			]
		},
	}
}

cc: @Umherirrender

For big extension the schema updates could be many text, which needs to be read (but not cached) during the extension registration process. Not sure if such big arrays, only needed by installer and updater (+tests), should be there or just mention another file to contain the schema updates.

Some schema updates does not apply to all rdbms or only to one (like changing the length of a varying-length char is not needed in sqlite or postgres), needs some kind of "only" key in the concept

For complex hooks there could be a "callback" key to execute php like already exists for the whole extension.json.

+1 for breaking it out into a separate file; extension.json is already pretty unwieldy. For most things there would be a performance impact in splitting it into multiple files, but for DB updates that's not an issue.

I disagree a bit on sql updates taking a lot of space. On two grounds:

Callback can be good but for the first step, let's go with the simple design and any extension that's complex, it can fallback to the terrible hook we currently have.

I'm considering whether this is a subtask of T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core which I'm working on now.

I don't like the callback array syntax. Parameters should have names.

I'm sure we all know that addField really means "apply patch if field does not exist". If we have two fields to add at the same time, then there will just be a single addField, with the field name in the declaration chosen at random. It's clunky but we're accustomed to it.

There's no such thing as abstract schema changes at runtime -- doctrine/dbal is a dev dependency. I need to refresh my memory on why it was done like that. If we can have Doctrine at runtime then we don't even need extensions to tell us what fields they are adding, that information is available in TableDiff::getAddedColumns(). If the conditions are not obvious from the JSON, then the change can just be applied once and logged. A lot of changes are already logged in this way via modifyExtensionTable and modifyExtensionField. Or we can extend the JSON schema change file format, adding the relevant data.

With abstract schemas there are two kinds of JSON file, which you might call schemas and schema changes. I think they could be separate top-level attributes in extension.json. This makes sense from the perspective of separating installation from update.

Consider object|string shorthands like we have in ResourceModules:

{
	"Schemas": [ "sql/tables.json" ]
}

as a shorthand for

{
	"Schemas": [ { "file": "sql/tables.json" } ]
}

With named parameters, it's no big deal to mix abstract schemas with manual schemas. A manual schema:

{
	"Schemas": [ { "sqlFile": "sql/$type/tables-generated.sql" } ]
}

An abstract schema change:

{
	"SchemaChanges": [ "sql/abstractSchemaChanges/patch-add_gels_anchor.json" ]
}

A manual schema change:

{
	"SchemaChanges": [
		{
			"op": "renameIndex",
			"table": "spoofuser",
			"oldIndex": "su_normalized",
			"newIndex": "su_normname_idx",
			"sqlFile": "sql/$type/patch-spoofuser-index-su_normname_idx.sql"
		},
	]
}

I'm considering whether this is a subtask of T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core which I'm working on now.

Probably not, because T352113 will need backwards compatibility with LoadExtensionSchemaUpdates. If backwards compatibility is possible and implemented then this task does not block T352113.