Page MenuHomePhabricator

[RFC] Community configuration 2.0: Validating site configuration
Closed, ResolvedPublicNov 8 2023

Description

Project introduction

The Growth-Team is currently working on the Community configuration 2.0 project, which allows wiki administrators to configure MediaWiki on-wiki, without having to make a Phabricator request. This is a continuation, and improvement of the Community configuration 1.0 project, which is implemented within the GrowthExperiments as one of its integral parts.

Detailed information about the project, issues identified in the CC1.0 implementation, as well as a high-level overview of its architecture, can be found in the Community configuration 2.0 Product Requirements Document. Feel free to review that document as well (feedback can be submitted as comments within the linked PRD document).

If possible, all feedback should be provided by the end of November 08, 2023.

Introduction of the problem

When allowing on-wiki administrators to configure MediaWiki, one needs to ensure that the configuration is of an expected format. Otherwise, it would be possible to take down the site via an on-wiki change, which is not desirable.

The first (Growth-specific) version of Community configuration 1.0 has a basic data type validation, which was engineered by the Growth team. It does not scale well with new and new data stored, and it is only capable of catching the most basic errors.

Task scope

The scope of this task is to determine the validation approach that should be used within the Community configuration 2.0 project. Namely, the following questions should be answered:

  1. What should be the interface of a validator?
  2. How should extensions/skins register their configuration providers?
  3. Which JSON Schema validation library should be used to enforce the validation rules?
  4. How to ensure backwards compatibility when the format of a configuration value changes?
  5. What other considerations should be kept in mind when implementing validation?

Proposed solution

(1) What should be the interface of a validator?

Community configuration 2.0 backend extension introduces a concept of a configuration provider (a class with the ability to read/write on-wiki config). Each provider has an assigned validator, which is responsible for (a) validating the configuration (b) determining where the underlying schema lies.

Validators need to be implemented within the Community configuration extension, and confirm to the following interface (source link):

interface IValidator {

	/**
	 * Validate passed config
	 *
	 * This is executed by WikiPageConfigWriter _before_ writing a config (for edits made
	 * via GrowthExperiments-provided interface), by ConfigHooks for manual edits and
	 * by WikiPageConfigLoader before returning the config (this is to ensure invalid config
	 * is never used).
	 *
	 * @param array $config Associative array representing config that's going to be validated
	 * @return StatusValue
	 */
	public function validate( array $config ): StatusValue;

	/**
	 * Return list of supported top level keys
	 *
	 * This is useful for IConfigurationProvider implementations; this information can be used to
	 * decide whether a certain configuration request asks for an information that can be
	 * present (but is missing from the store at the moment), or whether the information cannot
	 * exist at all (and thus the request is invalid). Example is deciding whether Config::get
	 * should throw, or return a default value.
	 *
	 * @return array List of top level keys names
	 */
	public function getSupportedTopLevelKeys(): array;
}
(2) How should extensions/skins register their configuration providers?

Community configuration 2.0 introduces the concept of “configuration provider”, which represents an entity providing configuration for a certain feature or featureset (such as, variables needed for Growth’s Suggested edits to work might be one provider). Each provider is represented as a set of the storage backend and the validation backend, which define where the configuration is stored and how it is validated. All supported storage/validation backends need to be implemented within Community configuration 2.0 itself.

In the initial version, jsonschema would be the only provided validator, which would take a path to the JSON Schema describing the configuration file. In this file, each of the supported configuration variables is described, to ensure configuration returned by Community configuration 2.0 is in the format the extension expects. Community configuration 2.0 ensures that returned configuration is valid.

In the initial version of Community configuration 2.0, we focus on configuration providers that are outside of MediaWiki Core. Extensions or skins can register their own providers via extension.json or skin.json respectively:

{
    "CommunityConfigurationProviders": {
        "example": {
            "storage": {
                "type": "wikipage",
                "args": [
                    "MediaWiki:Example.json"
                ]
            },
           "validator": {
                "type": "jsonschema",
                "args": [
                    "relative/path/to/json/schema.json"
                ]
            }
        }
    }
}
(3) Which JSON Schema validation library should be used to enforce the validation rules?

There are already a fair amount of MW components doing JSON validation for different use cases. The goal would be to select a library that can cover this project use cases and also be reused by existing components. In that direction three libraries have been evaluated and the current proposal is to use Opis/json-schema. That is because of its repository code health, schema versions support and because it is already in use in MW. A summary of the libraries trade-offs can be found in T332847#9228639.

A remaining challenge is to provide localization support for the validation error messages. None of the researched libraries has i18n feature built-in. However it is possible to build a localization layer around it or collaborate with the maintainers to make it part of the library.

(4) How to ensure backwards compatibility when the format of a configuration value changes?

Occasionally, there is a need to make a breaking change within the configuration schema. Those situations should be resolvable via a migration interface, which would allow developers to convert their data from older format to a newer format. This is done by introducing migration classes that would be registered as a step from one schema to the directly newer one. Backwards compatibility is accomplished by:

  1. Each wiki-page validated by a jsonschema validator has a $schema special variable, which includes the name of the schema (including a version identifier) that the page adheres to.
  2. The extension that registered the configuration page also registers a list of migration classes that are capable of converting a configuration from a directly preceding schema to a new one.
  3. Upon invocation of a maintenance script, Community configuration 2.0 converts the content of the wiki page.
(5) What other considerations should be kept in mind when implementing validation?

If there is anything we didn’t consider and we should know about, please tell us in the comments in this task. If you have any questions that are not related to the validation, feel free to ask them on the MediaWiki.org project talk page.

Proof of concept implementation

To better understand the proposed solution, we have prepared a proof of concept implementation, which is hosted at the Wikimedia Gitlab. Currently, it consists of three different parts:

Instructions on how to set up the PoC version of Community configuration 2.0 locally are available in the backend repository’s README.

Resources

Details

Due Date
Nov 8 2023, 10:59 PM

Related Objects

Event Timeline

Urbanecm_WMF changed Due Date from Nov 28 2023, 10:59 PM to Nov 8 2023, 10:59 PM.
Urbanecm_WMF changed the subtype of this task from "Task" to "Deadline".

In that direction three libraries have been evaluated and the current proposal is to use Opis/json-schema

Core already depends on justinrainbow/json-schema. I'm not opposed to changing that, but I'd suggest to not use different validators in different parts of the code.

I see that WikiLambda is using the Opis validator. It would be good if we could agree on one that we use everywhere.

Upon invocation of a maintenance script, Community configuration 2.0 converts the content of the wiki page.

Will that script be invoked automatically be the updater? Or can conversion be automatic on the next edit?

If there is anything we didn’t consider and we should know about, please tell us in the comments in this task.

When using the justinrainbow validator with core config, we came across a couple of quirks, e.g. for numeric keys like namespace IDs in objects (maps). See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/Settings/Config/ConfigSchemaAggregator.php#454

Experience has also shown that is its beneficial to maintain the schema as constants in a PHP class, from which we can then generate a JSON Schema. This is nice because we can use constants for things like namespace IDs, and can make use of (e.g. nullable types like "?string", or use "list" and "map" rather than "array" and "object"). Have a look at how [https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/Settings/Source/ReflectionSchemaSource.php|ReflectionSchemaSource] is used by GenerateConfigSchema to generate a JSON schema from the MainConfigSchema class.

Even if you don't want to maintain the schema as in PHP class, I suggest you look into using core's ConfigSchemaAggregator and ConfigBuilder classes.

Required changes in MediaWiki Core: sandbox/urbanecm/community-configuration in Gerrit, see Gerrit

Core shouldn't have to know about CommunityConfiguration in this way, and I don't think it needs to. You can just declare "attributes" that can be set by other extensions: https://www.mediawiki.org/wiki/Manual:Extension.json/Schema#attributes

(The "attribute" mechanism should itself make use of ConfigSchemaAggregator and use JSON schema at some point...)

From a technical point of view: Where should the semantic validation happen? Say, the community defines a wikipage as the value for some config, then we may want to check if that page exists or is in the right namespace or similar things. I understand that this would not be part of the json-schema validation step that the RFC outlines so far.

While I'm just a mid-level engineer, I could imagine Product/UX wanting to communicate somehow: "There are problems with this configuration."
Such problems could be both due to a mistake (e.g., typo) in the configuration being saved as well as due to something in the wiki changing, like the configured page being deleted. I wonder if it would make sense to think about a list of "suggested patterns" (both in the UX/Product sense and on the technical side) for dealing with semantic validation steps?

A few random ideas/comments I had while reading this proposal:

Core already depends on justinrainbow/json-schema. I'm not opposed to changing that, but I'd suggest to not use different validators in different parts of the code. [...] I see that WikiLambda is using the Opis validator. It would be good if we could agree on one that we use everywhere.

Agreed, consistency would be nice.

Experience has also shown that is its beneficial to maintain the schema as constants in a PHP class, from which we can then generate a JSON Schema. This is nice because we can use constants for things like namespace IDs, and can make use of (e.g. nullable types like "?string", or use "list" and "map" rather than "array" and "object").

While I'm unsure about the implementation details, I strongly agree with the idea of using PHP things like constants and types. This is one of the main disadvantages of extension.json files.

None of the researched libraries has i18n feature built-in. However it is possible to build a localization layer around it or collaborate with the maintainers to make it part of the library.

In my opinion, this should be a strong requirement, i.e., part of the initial implementation. Maybe that's already the plan, I just wanted to point this out explicitly.

How to ensure backwards compatibility when the format of a configuration value changes?

The proposal with the $schema version seems fine; however, I'm wondering if we have any data on the frequency and scope of such breaking changes in the existing config. In particular, I'm imagining a scenario where we have a hypothetical single config provider which defines lots of variables. Bumping the version of the whole schema every time any individual setting goes undergoes a breaking change might potentially increase the burden on the community to keep things up-to-date. You may argue that this might be avoided by preferring multiple providers, each defining just a small number of variables. But this would shift the problem to having to keep track of multiple providers.

I guess my (vague) idea was to have some sort of per-setting versioning, so that it's easier for users to understand what needs to be changed. OTOH, I guess this would have its own issues, because you would have to either require a version identifier for each setting, or find a way to automatically detect what version is used, or automagically add a version identifier only when the breaking change happens; none of which is ideal.

At any rate, I think this would be solved by providing an interface where users can see what needs to be updated and how, i.e., what @Michael said above. If we have some interface saying something like "the schema is outdated: setting X now needs to be Y", then any backend versioning would work, because it would be much easier to manage on the user side.

The latest iteration of the proposal LGTM. I have a few suggestions that might make it better, for you to consider as you wish:

  • Relative schema file reference.

The current draft patches specify the schema relative to the extensions directory, thus requiring hardcoding of the extension subdirectory itself. I'm guessing this is unintentional, but in case it is not, I would suggest making this relative to the extension.json. The attribute is not a plain array since it involves file paths. The question is not if, but when to process (before or after), and how to process (expand to extensions dir, or expand to extension subdir). We usually pick "before" and "subdir", which seems more ergonomic, and has the benefit of working with all supported MediaWiki setups.

  • Non-global name.

As of https://gitlab.wikimedia.org/repos/growth/community-configuration-example/-/blob/82c99a9810c3efe8830b87e82bf26a60daabca94/src/ServiceWiring.php

The logic currently assigns a name for each config source (e.g. MediaWiki:ExampleConfig.json is registered as provider FooBar), but the example currently involves a CommunityConfigurationExample\Config\WikiPageConfig class and CommunityConfigurationExample.WikiPageConfig service. Are these required? Do we want to encourage each user of this to create its own class and service for this?

It seems like no fundamental decisions or knowledge are currently encoded in this. It feels like boilerplate. But perhaps there's a use case for it?

If feasible, I would love to see a simpler approach where all a provide needs to do is define an extension attribute (CommunityConfigurationProviders) and then immediately be able to access it from the centrally-defined factory service (defined in the CommunityConfig extension). In the SpecialPage example, you would then be able to directly set $this->wikiConfig = CommunityConfigFactory->getProvider('FooBar');.

Upon invocation of a maintenance script, Community configuration 2.0 converts the content of the wiki page.

Will that script be invoked automatically be the updater? Or can conversion be automatic on the next edit?

Good question! I think it should be in both places: in the updater to facilitate removal of handling for the older format, and in the next edit to simplify saving. Would that make sense?

In that direction three libraries have been evaluated and the current proposal is to use Opis/json-schema

Core already depends on justinrainbow/json-schema. I'm not opposed to changing that, but I'd suggest to not use different validators in different parts of the code.

I see that WikiLambda is using the Opis validator. It would be good if we could agree on one that we use everywhere.

Long-term, certainly. Since we depend on both in production already, I don't think using either makes the problem significantly worse.

Required changes in MediaWiki Core: sandbox/urbanecm/community-configuration in Gerrit, see Gerrit

Core shouldn't have to know about CommunityConfiguration in this way, and I don't think it needs to. You can just declare "attributes" that can be set by other extensions: https://www.mediawiki.org/wiki/Manual:Extension.json/Schema#attributes

(The "attribute" mechanism should itself make use of ConfigSchemaAggregator and use JSON schema at some point...)

Oh, wonderful! Thanks for the info. I filled T351227: Community configuration 2.0: Rewrite the CommunityConfiguration PoC extension so it does not require extension.json schema changes to incorporate the change.

A few random ideas/comments I had while reading this proposal:

Thank you!

Core already depends on justinrainbow/json-schema. I'm not opposed to changing that, but I'd suggest to not use different validators in different parts of the code. [...] I see that WikiLambda is using the Opis validator. It would be good if we could agree on one that we use everywhere.

Agreed, consistency would be nice.

Agreed, but I don't think this proposal introduces the inconsistency. We can certainly consider changing one library with the other though!

Experience has also shown that is its beneficial to maintain the schema as constants in a PHP class, from which we can then generate a JSON Schema. This is nice because we can use constants for things like namespace IDs, and can make use of (e.g. nullable types like "?string", or use "list" and "map" rather than "array" and "object").

While I'm unsure about the implementation details, I strongly agree with the idea of using PHP things like constants and types. This is one of the main disadvantages of extension.json files.

Good point! I don't think we've considered this aspect. I filled T351232: Community configuration 2.0: Consider generating JSONSchema from PHP classes rather than committing them directly for us to consider.

None of the researched libraries has i18n feature built-in. However it is possible to build a localization layer around it or collaborate with the maintainers to make it part of the library.

In my opinion, this should be a strong requirement, i.e., part of the initial implementation. Maybe that's already the plan, I just wanted to point this out explicitly.

Thank you for doing that. While I agree an i18n layer should exist, I'm not sure whether it's 100% required, as during normal operation, you'd only see the error message when doing direct edits to the JSON file. However, we're trying to include the i18n layer if at all possible, just wanted to note this aspect as well, in case you have any thoughts on this.

How to ensure backwards compatibility when the format of a configuration value changes?

The proposal with the $schema version seems fine; however, I'm wondering if we have any data on the frequency and scope of such breaking changes in the existing config. In particular, I'm imagining a scenario where we have a hypothetical single config provider which defines lots of variables. Bumping the version of the whole schema every time any individual setting goes undergoes a breaking change might potentially increase the burden on the community to keep things up-to-date. You may argue that this might be avoided by preferring multiple providers, each defining just a small number of variables. But this would shift the problem to having to keep track of multiple providers.

I guess my (vague) idea was to have some sort of per-setting versioning, so that it's easier for users to understand what needs to be changed. OTOH, I guess this would have its own issues, because you would have to either require a version identifier for each setting, or find a way to automatically detect what version is used, or automagically add a version identifier only when the breaking change happens; none of which is ideal.

At any rate, I think this would be solved by providing an interface where users can see what needs to be updated and how, i.e., what @Michael said above. If we have some interface saying something like "the schema is outdated: setting X now needs to be Y", then any backend versioning would work, because it would be much easier to manage on the user side.

Thanks for sharing those concerns. I don't believe we have any hard data on how often backwards-incompatible changes were made, but I can say that within Growth, we've run into troubles with this more frequently than desirable (and than what I'd expect beforehand).

You've mentioned that bumping the schema version each time there is any backwards incompatible change "might potentially increase the burden on the community to keep things up-to-date". I'm not sure I understand how that would be. Would you mind clarifying? In theory, all changes should be migrated by the responsible team – Community configuration 2.0 should merely make it possible for them.

Thank you for doing that. While I agree an i18n layer should exist, I'm not sure whether it's 100% required, as during normal operation, you'd only see the error message when doing direct edits to the JSON file. However, we're trying to include the i18n layer if at all possible, just wanted to note this aspect as well, in case you have any thoughts on this.

That's a good point. I still think it would be fairly important given how much we care about i18n, but clearly this is just a personal opinion :)

You've mentioned that bumping the schema version each time there is any backwards incompatible change "might potentially increase the burden on the community to keep things up-to-date". I'm not sure I understand how that would be. Would you mind clarifying? In theory, all changes should be migrated by the responsible team – Community configuration 2.0 should merely make it possible for them.

I think I just missed the part about migrations being done by developers. I was just assuming that users would be required to keep track of what's been deprecated and change things themselves.

Thank you for doing that. While I agree an i18n layer should exist, I'm not sure whether it's 100% required, as during normal operation, you'd only see the error message when doing direct edits to the JSON file. However, we're trying to include the i18n layer if at all possible, just wanted to note this aspect as well, in case you have any thoughts on this.

That's a good point. I still think it would be fairly important given how much we care about i18n, but clearly this is just a personal opinion :)

I agree with that general sentiment, I just happen to challenge it at the same time.

You've mentioned that bumping the schema version each time there is any backwards incompatible change "might potentially increase the burden on the community to keep things up-to-date". I'm not sure I understand how that would be. Would you mind clarifying? In theory, all changes should be migrated by the responsible team – Community configuration 2.0 should merely make it possible for them.

I think I just missed the part about migrations being done by developers. I was just assuming that users would be required to keep track of what's been deprecated and change things themselves.

Ah, I see. Definitely no user migrations would be required, that'd be terrifying :).

Thank you for doing that. While I agree an i18n layer should exist, I'm not sure whether it's 100% required, as during normal operation, you'd only see the error message when doing direct edits to the JSON file. However, we're trying to include the i18n layer if at all possible, just wanted to note this aspect as well, in case you have any thoughts on this.

That's a good point. I still think it would be fairly important given how much we care about i18n, but clearly this is just a personal opinion :)

I agree with that general sentiment, I just happen to challenge it at the same time.

FYI: This was filled as T351879: Support i18n in validation errors.

In that direction three libraries have been evaluated and the current proposal is to use Opis/json-schema

Core already depends on justinrainbow/json-schema. I'm not opposed to changing that, but I'd suggest to not use different validators in different parts of the code.

Right, I acknowledge and agree with the suggestion but I find out of the scope of the project to try to consolidate the library and possibly different schema versions. However our proposal does not include to make core depend on a different library but to use Opis in the CommunityConfiguration context, as WikiLambda does.

I see that WikiLambda is using the Opis validator. It would be good if we could agree on one that we use everywhere.

On that intent we're trying to align WikiLambda's and CommunityConfiguratio Opis versions (T319054), and to fit the PHP single package restriction.

Thank you everyone for the comments here. As far as I can see, all of the suggestions gathered in this discussion were logged as Phabricator tasks. With that in mind, I'm resolving the task. Any future feedback is welcomed at the project MW.org talk page.