Page MenuHomePhabricator

Wire abstract schema and schema changes for extension
Open, LowPublic

Description

Extensions should be able to define their schema and schema changes in the abstract way that we are building for core so extensions easily would be able to support different DBMS-es

Event Timeline

Extensions can define its schema in an abstract way and T259374 is there to migrate all of them.

Extensions can define its schema in an abstract way and T259374 is there to migrate all of them.

I know but the point of this ticket is a bit further. It's to be able to add paths to abstract schemas in extension.json. That would allow us to wire those abstract schemas to core's tests (that we check if it has a valid json, if prefixes are followed, etc.) and that would also allows drift tracker to discover those schemas automatically to check in production.

Change 816262 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] abstractSchema: Support 'AbstractSchema' attribute in extension.json

https://gerrit.wikimedia.org/r/816262

@Ladsgroup I was going to create a task about having a structure test that ensures that the autogenerated schema is up-to-date with the JSON definition. I assume core already has something like that, but extensions don't seem to. Is that something that will be done as part of this ticket, or should I create a separate one?

Not that it is done as part of the ticket, it literally is implemented already and the patch is up (see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/816262/10/tests/phpunit/structure/AbstractSchemaValidationTest.php)

Ohhhh nice! Sorry, you can tell that I didn't check the patch before asking :)

Just to store some ideas on how the json could looks like when only the files are provided to validate the abstract json file and to check if the generated sql file matched the json file. This still requires php code in the hook (T258852)

No updates, one sql file
	"SchemaUpdates": {
		"platforms": {
			"*": {
				"tables": "sql/*/tables-generated.sql"
			}
		},
		"abstract": {
			"tables": "sql/tables.json"
		}
	},
No updates, one sql file, but legacy path for mysql
	"SchemaUpdates": {
		"platforms": {
			"*": {
				"tables": "sql/*/tables-generated.sql"
			},
			"mysql": {
				"tables": "sql/tables-generated.sql"
			}
		},
		"abstract": {
			"tables": "sql/tables.json"
		}
	},
Simply case with one full schema file and an update
	"SchemaUpdates": {
		"platforms": {
			"*": {
				"tables": "sql/*/tables-generated.sql",
				"updates": {
					"fix-timestamps": "sql/*/patch-timestamp.sql"
				}
			}
		},
		"abstract": {
			"tables": "sql/tables.json",
			"updates": {
				"fix-timestamps": "sql/abstractSchemaChanges/patch-timestamp.json"
			}
		}
	},
Complex, mapping between platforms and abstract with named keys
	"SchemaUpdates": {
		"platforms": {
			"*": {
				"tables": {
					"globalblocks": "sql/*/tables-globalblocks-generated.sql",
					"whitelist": "sql/*/tables-global_tables-global_block_whitelist-generated.sql"
				},
				"updates": {
					"add-gb_by_central_id": "sql/*/patch-add-gb_by_central_id.sql",
					"global_block_whitelist-timestamps": "sql/*/patch-global_block_whitelist-timestamps.sql",
					"globalblocks-gb_anon_only": "sql/*/patch-globalblocks-gb_anon_only.sql",
					"globalblocks-timestamps": "sql/*/patch-globalblocks-timestamps.sql"
				}
			}
		},
		"abstract": {
			"tables": {
				"globalblocks": "sql/tables-globalblocks.json",
				"whitelist": "sql/tables-global_block_whitelist.json"
			},
			"updates": {
				"add-gb_by_central_id": "sql/abstractSchemaChanges/patch-add-gb_by_central_id.json",
				"global_block_whitelist-timestamps": "sql/abstractSchemaChanges/patch-global_block_whitelist-timestamps.json",
				"globalblocks-gb_anon_only": "sql/abstractSchemaChanges/patch-globalblocks-gb_anon_only.json",
				"globalblocks-timestamps": "sql/abstractSchemaChanges/patch-globalblocks-timestamps.json"
			}
		}
	},
Complex, mapping between platforms and abstract with unnamed keys (order of numeric index)
	"SchemaUpdates": {
		"platforms": {
			"*": {
				"tables": [
					"repo/sql/*/term_store.sql",
					"repo/sql/*/wb_changes_subscription.sql",
					"repo/sql/*/wb_changes.sql",
					"repo/sql/*/wb_id_counters.sql",
					"repo/sql/*/wb_items_per_site.sql",
					"repo/sql/*/wb_property_info.sql"
				],
				"updates": [
					"repo/sql/*/archives/patch-wb_changes-change_object_id-index.json",
					"repo/sql/*/archives/patch-wb_changes-change_timestamp.json"
				]
			}
		},
		"abstract": {
			"tables": [
				"repo/sql/abstract/term_store.json",
				"repo/sql/abstract/wb_changes_subscription.json",
				"repo/sql/abstract/wb_changes.json",
				"repo/sql/abstract/wb_id_counters.json",
				"repo/sql/abstract/wb_items_per_site.json",
				"repo/sql/abstract/wb_property_info.json"
			],
			"updates": [
				"repo/sql/abstractSchemaChanges/patch-wb_changes-change_object_id-index.json",
				"repo/sql/abstractSchemaChanges/patch-wb_changes-change_timestamp.json"
			]
		}
	},

But that is all complex to validate (needs to ensure all files are there and the key used under tables/updates between platforms and abstract are the same)
This seems also to make it complex to integrate the php-less solution which needs more information to avoid the sql file applies on every update.

I went back and forth between different options a lot and finally settled on discovery based on standardization. That would also handle T258852: Replace LoadExtensionSchemaUpdates hook with static data in extension.json as well.

Something along the lines of:

{
	"SchemaUpdates": {
        "baseDir": "sql/",
        "platforms": {
            "*": {
                "schema": [ "linter" ],
                "updates": [
                    [ "addExtensionField", "linter", "linter_foo", "patch-linter_foo" ]
                ]
            },
            "mysql": {
                "schema": [ "linter" ],
                "updates": [
                    [ "addExtensionField", "linter", "linter_foo", "patch-linter_foo" ],
                    [ "modifyExtensionField", "linter", "linter_foo", "patch-linter_foo-varbinary" ]
                ]
            },
            "sqlite": {},
            "postgres": {},
            "abstract": {}
	    }
    }
}

What does this mean? You'd have default * that other platforms can override. And "patch-linter_foo" automatically gets translated to "sql/mysql/updates/patch-linter_foo.sql" before passing to the updater.

The directory structure would be something like this:

sql
├── abstract
│   ├── linter.json
│   └── updates
│       ├── patch-linter_foo.json
│       └── patch-linter-add-template-tag-fields.json
├── mysql
│   ├── linter.sql
│   └── updates
│       ├── patch-linter_foo.sql
│       └── patch-linter_foo-varbinary.sql
├── postgres
│   ├──linter.sql
│   └── updates
│       └── patch-linter_foo.sql
└── sqlite
    ├── linter.sql
    └── updates
        └── patch-linter_foo.sql

If any extension doesn't follow this standard, they need to move the files which is not a big deal but it also makes everything tidier and easier to understand to everyone.

The whole concept of more rdbms support from extensions is quite complex. Most complicating factor is how we can add support for let's say oracle to linter extension. My current thought is that if linter supports it, they can add "oracle":{} to the schema updates platforms but core completely ignores it in updater or schema tests and it would be responsibility of the oracle support extension to write tests making sure the sql and json is the same or pass it properly to the OracleUpdater class, etc.

How does that sound?

I’m not sure I understand the platform part correctly, but it looks like it might require a lot of duplication? Since there’s no explicit mapping between the individual updates in each platform, it looks like any platform-specific mapping would have to completely replace the "*" one – so as soon as I need one MySQL-specific update, I have to copy the whole rest of the "*" updates to MySQL, and always remember to add changes to both "*" and "mysql"

The alternative I have in mind would be something like:

{
	"*": {
		"schema": [ "linter" ],
		"updates": {
			"linter_foo": [ "addExtensionField", "linter", "linter_foo", "patch-linter_foo" ],
			"linter_bar": [ "addExtensionField", "linter", "linter_bar", "patch-linter_bar" ],
			"linter_qux": [ "addExtensionField", "linter", "linter_qux", "patch-linter_qux" ]
		}
	},
	"mysql": {
		// schema inherited from *
		"updates": {
			// linter_foo inherited from *
			"linter_foo_varbinary": [ "modifyExtensionField", "linter", "linter_foo", "patch-linter_foo-varbinary" ], // extra, MySQL-specific
			"linter_bar": null, // remove, does not apply to MySQL
			"linter_qux": [ "addExtensionField", "linter", "linter_qux", "patch-linter_qux-mysql" ], // override
		}
	}
}

That would work for me as well. I think for the first iteration we should go with duplication bit, adding support for de-duplication would be pretty easy (and feel free to implement it). Also note that there won't be much duplication if old schema changes get cleaned up after LTS releases.

+1 for the associative array approach, it allows some level of self-documentation (which is always a pain point with JSON structures), and not having to add new items in multiple places is less error-prone.
Also, it would allow a structure test to see which tables / updates have been customized, and it could then check that for non-customized tables / updates the abstract schema matches the SQL files.

Nit but maybe more comfortable (for human browsing of the files) to store the abstract files at the top level of the SQL directory?

Maybe this would be a good time to look at T314908: MediaWiki database schemas should have a way to indicate the DB/cluster? Shared table handling is a huge pile of hacks currently, it would be nice to be able to fix that.

Nit but maybe more comfortable (for human browsing of the files) to store the abstract files at the top level of the SQL directory?

It could become confusing, e.g. abstract updates will be in a directory called updates/ next to mysql/

Maybe this would be a good time to look at T314908: MediaWiki database schemas should have a way to indicate the DB/cluster? Shared table handling is a huge pile of hacks currently, it would be nice to be able to fix that.

It's not a very common case and I feel adding this to the db structure here could make the array overly complicated. Having a separate config/extension value for this makes more sense to me.

It's not a very common case and I feel adding this to the db structure here could make the array overly complicated.

I don't think adding one extra line to a small subset of the tables complicates things that much.

Having a separate config/extension value for this makes more sense to me.

The key thing here is that you'd want the extension author to indicate in some way which tables belong together. For something like CentralAuth where lots of tables go to the shared DB, you don't want to force the wiki owner to configure them one by one. Also if the next version of CentralAuth adds a new table, the wiki owner should not be required to update the configuration to make the table go to the right place; it should simply go automatically to where the other tables are. I guess this is a bit of a bikeshed but I think that's more elegantly achievable with an extra table field than purely via extension config. (Also extension config is already super complicated and hard to parse, while the table config isn't.)

Change 816262 abandoned by Umherirrender:

[mediawiki/core@master] abstractSchema: Support 'SchemaUpdates' attribute in extension.json

Reason:

Was designed as POC only

https://gerrit.wikimedia.org/r/816262

Umherirrender removed a project: Patch-For-Review.
Umherirrender unsubscribed.

The initiative is over and these are improvements that are welcome but not really any blocker to call the project done.