Page MenuHomePhabricator

Define variant Wikimedia production config in compiled, static files
Open, NormalPublic

Description

Arising from James's previous musing, and discussions at the 2019 Hackathon.

What

  • InitialiseSettings.php (and much of CommonSettings.php) is replaced with per-wiki inheritable YAML files (to allow comments).
  • Actually-variant config goes into a much slimmer CommonSettings.php (or re-worked to not vary).
  • On merge, the YAML files are converted into one JSON file per wiki, for the currently-deployed version(s) of MW, which are stored in git.
  • This replaces the opportunistic cache in /tmp that we current have.

Inheritance tree:

allwikis.yaml
| Default values for all wikis (e.g. wgNamespacesWithSubpages which is over-ridden, or wgEnableCanonicalServerLink which isn't)
|
+- wikipedias.yaml
   | Standard values for Wikipedias, where they differ from defaults (e.g. wgSitename or the fallback logo) and special inheritances
   |
   +- dewiki.yaml 
        Bespoke values for the German Wikipedia (e.g. the logo, or FlaggedRevisions configuration) and other special inheritances

Comparison:

TaskCurrent situationFuture state
Config authored inInitialiseSettings.phpwikipedias.yaml etc.
Config build stepRuntime cache, in /tmp/Build time static file, in /srv/mediawiki/
mw-config mergeTrivial rebaseFull production build of on JSON static file per wiki
Config read stepFrom cache or computed liveAlways read from built static file

Pros

  • Variant configuration will be static, making it more plausible to inject into docker images.
  • It will be much clearer exactly which wikis' config is changing, so deployers have more confidence.
  • YAML configuration files explicitly set the inheritance pattern.
  • Easier to compare one wiki's config with another's (e.g. "how different is dewiki from frwiki?").
  • Clear when the rump of CommonSettings refers to undefined variables; variant config forced to be merged first.

Cons

  • Merging is harder (and slower?).
  • Harder to audit all wikis' config for settings that "shouldn't" be over-ridden.
  • Production branch pruning, currently just a disc operation and a sync, now needs a commit to mw-config as well as a deploy to delete.
  • First time we're reading YAML files in PHP prod.

Open questions

  • Deterministic sort of output files to avoid noise. Is alphasort sufficient?
  • How do we do splicing in private settings at run time?
  • How does the CI work for this?
  • Need to check on build time that a vanilla MediaWiki install (i.e., DefaultSettings) doesn't set any config that isn't represented in allwikis.yaml.
  • Syntax for specifying config, and that a document inherits from another.
  • Syntax for specifying that descendent config can't over-ride (e.g. wgContentHandlerUseDB)
  • Do we need to vary on the PHP run-time still? (once HHVM is undeployed can this go away, or are there reasons beyond PHP serialisation format that we think this might vary?)

Planned steps

Event Timeline

Production branch pruning needs a commit to delete.

Can you elaborate? The link to branch pruning isn't clear to me.

First time we're reading YAML files in PHP prod.

We could potentially use a very strict subset of YAML (like I did for ForeignResourceManager). Which might be easier to reason about in terms of security. (enforced with a composer-test script).

Alternatively, we could standardise on JSON, and use YAML only in Git for wmf-config (letting Scap convert it to JSON). I'm curious how my home-brewed mini-YAML parser compares to the native C component in PHP for parsing JSON.

Production branch pruning needs a commit to delete.

Can you elaborate? The link to branch pruning isn't clear to me.

Done; is this clear enough?

First time we're reading YAML files in PHP prod.

We could potentially use a very strict subset of YAML (like I did for ForeignResourceManager). Which might be easier to reason about in terms of security. (enforced with a composer-test script).
Alternatively, we could standardise on JSON, and use YAML only in Git for wmf-config (letting Scap convert it to JSON). I'm curious how my home-brewed mini-YAML parser compares to the native C component in PHP for parsing JSON.

The idea was that the human-written files are in YAML, with comments, and the compiled files are in JSON; as well as allowing comments, this way it's trivially obvious if a user is touching the wrong files.

Production branch pruning needs a commit to delete.

Can you elaborate? The link to branch pruning isn't clear to me.

Done; is this clear enough?

Kind of :) I understood what you meant by "branch prune" and "commit", but not why it would now require a commit to wmf-config to remove an unused wiki version from prod.

Looking at the top paragraph. Do you envision that the "on merge, convert from YAML to JSON" would happen within Git? Like a Jenkins job amending the commit from CI run-time and then self-merging? Or, do you mean from Scap in production? I assume the latter given the point about "YAML in production", but then - what is it in the Git repo that would need to be removed with a commit when pruning branches?

The idea was that the human-written files are in YAML, with comments, and the compiled files are in JSON; as well as allowing comments, this way it's trivially obvious if a user is touching the wrong files.

Cool. We're saying the same thing. I missed that you already said this in the top paragraph ("on merge, convert from YAML to JSON"). That means we won't need to repeatedly parse YAML during production requests. Note, however, that we should not be parsing JSON repeatedly in production, either. Parsing much of anything in every request without a cache is going to form a bottleneck. We currently store the result of the InitialiseSetting/wgConf merge operation in a static PHP array file in /tmp, which is then parsed once by require'ing in and the naturally cached in RAM by opcache.

We could apply that here as well, as secondary compilation layer. Or, alternatively, we could use APC instead, which is what we did for EtcdConfig (where we had a similar conversation already and choose not to spread the /tmp/.php + opcache pattern further.

Joe added a subscriber: Joe.May 29 2019, 8:57 AM

I don't think we can really read a yaml file at runtime. As php has no way to share information between requests, we'd need to parse the file at every request, or to do something similar to EtcdConfig, where we store the data in APC and then check every N seconds if they've changed.

This would have the attractive advantage we're not actually changing code every time we change a setting, which would greatly help reduce the risk and trouble connected to our deployment methods on php7, see T224491.

There is one relatively scary aspect of this, which is php does APC keys eviction; in a situation with a lot of evictions, we might end up having a performance hit from this. I don't think it's really a risk we're running into right now.

Btw, the hierarchical structure for the override of values you've described above resembles what hiera does for puppet. I'm not 100% sure that's a model I'd recommend in general but it seems to be pretty similar to what we actually do in production now.

Joe added a comment.May 29 2019, 8:58 AM

we could even think of just keeping these files served statically by http from a centralized service, instead of having to distribute them to all instances of mediawiki.

I don't think we can really read a yaml file at runtime.

I agree. However, I'm not proposing that. This would use JSON cached files instead of serialised PHP cached files; to pre-generate them at deploy time rather than opportunistically at run time; and to move them to be stored in git rather than in /tmp.

This would have the attractive advantage we're not actually changing code every time we change a setting, which would greatly help reduce the risk and trouble connected to our deployment methods on php7, see T224491.

Yes, that's the main objective for this work – reducing the amount of "code" changes that we would want to inject into production.

There is one relatively scary aspect of this, which is php does APC keys eviction; in a situation with a lot of evictions, we might end up having a performance hit from this. I don't think it's really a risk we're running into right now.

The current serialised PHP files are ~60 KiB, and nominally read on every execution, so I'd be surprised if it'd get evicted a lot, unless I'm missing something.

Btw, the hierarchical structure for the override of values you've described above resembles what hiera does for puppet. I'm not 100% sure that's a model I'd recommend in general but it seems to be pretty similar to what we actually do in production now.

Oh, interesting.

we could even think of just keeping these files served statically by http from a centralized service, instead of having to distribute them to all instances of mediawiki.

Yes, that could also work; the deploy server would be the config-server too?

Jdforrester-WMF triaged this task as Normal priority.Jun 4 2019, 8:56 PM

Here's my two cents:

  • I have done a similar thing with ores, in couple of months we made a huge Makefile that contains instructions for extracting and building the models for ores into per-wiki config files that turned into a generated Makefile at the end. I have some lessons learned from that action.
  • We need to have a migration path. There's always lots of edge cases that can't be migrated in one go. Migrating model instructions took us months to finish and it was way simpler than InitialiseSettings.php. The way that we did it was we introduced the generated Makefile as "Makefile.automated" and started to include that empty file into the main one and per-wiki we moved the manual instructions to the new one (we also did an audit of each wiki and we found one error per wiki on average, because automation helped us check for inconsistencies). And after we were almost done, we renamed the Makefile to Makefile.manual and Makefile.automated to Makefile. The Makefile.manual still exists for weird edge cases (like wikidata)
  • A similar course of action can be done here. Before getting things fully moved, we can define "InitialiseSettings-generated.php" and load it before InitialiseSettings.php (so values in InitialiseSettings.php override this one) and then in rather large batches move content and horrors of InitialiseSettings.php to per-wiki config files.
  • The "InitialiseSettings-generated.php" can be generated before each deployment, we can put to .gitignore but the script that generates the file needs to show the diff and asks the user if the changes made make sense.
  • Once everything is moved to InitialiseSettings-generated.php, then we can decide on how to properly do config management in mediawiki in production. The script can get all of those config files and put them somewhere else (even etcd directly) if needed. My idea basically narrows down the focus of this solution and make it very local to add space for breathing for future changes.
  • I'm personally in favor of using pretty json for two reasons: 1- There's no native/built-in support for yaml in python or php. I've been dealing with yaml and it's a big headache even with python. 2- json is sorta standard in mediawiki these days, we have extension.json (and similar things like composer.json), etc. 3- (It's not a real reason): We can add support for comments the way we added support for config variables in extension.json. We can ignore any key value that starts with _ or @ in generating the result. (Which also brings the question of should we output them as comment in the result or we should omit comments).

Sorry for long comment and I can have lots of wrong assumptions. Take it with lots of grains of salt.

@Ladsgroup what sort of issues have you had with yaml in python? How complicated are the structures you needed to represent?

@Ladsgroup what sort of issues have you had with yaml in python? How complicated are the structures you needed to represent?

1- There's no built-in support for yaml in python, you need to install pyyaml and sometimes security issues would have been exposed, like T214560: [CVE-2017-18342] pyyaml vulnerability in netbox deploy repository 2- indentation is tricky and problematic sometimes. You know, like this.

I'm not strongly against yaml TBH, I'm just too lazy :)

2- indentation is tricky and problematic sometimes. You know, like this.

Let me drill down on this a bit. The most annoying part is when you make a mistake in json, it explodes, it gives you error "Unable to parse json" but when you make a mistake in yaml, it works but the output is changed in a sneaky way and it explodes when you deploy it :)

Instead of splitting this per-wiki, I'd rather see similar config settings grouped together in the same file (potentially using the same groupings that DefaultSettings.php already uses), with their per-wiki overrides. I think that's more similar to the direction that config in MediaWiki core is eventually (hopefully) going to go.

I would be overwhelmed by a directory with 84 little config files in it. If we're talking some sort of hierarchical 'here's the defaults file, here's the few wikis with exceptions' and it really is a few, not hundreds, then it would make sense to split them out. Just my .02€

hashar added a subscriber: hashar.EditedJun 12 2019, 9:25 PM

@Ladsgroup what sort of issues have you had with yaml in python? How complicated are the structures you needed to represent?

1- There's no built-in support for yaml in python, you need to install pyyaml and sometimes security issues would have been exposed, like T214560: [CVE-2017-18342] pyyaml vulnerability in netbox deploy repository 2- indentation is tricky and problematic sometimes. You know, like this.

PyYAML has a load function which is would unserialize YAML to an object and would happily execute arbitrary code, which is not safe. The proper way to load the YAML file is to instead use the safe_load function. It is not an issue with YAML itself, more about the default proposed by the PyYAML implementation. load is unsafe. Luckily in latest versions that now emits a warning and folks are instructed to use safe_load instead.

The CVE got filled to expose the problem and its being addressed to have defaults "sane" to humans.

Regardless, I would rather not introduce python as a required software for operations/mediawiki-config.git. It sounds nicer to use PHP and we already use some library to add YAML support (in Translate via a composer dependency iirc). So that YAML support is a solved problem already.

https://github.com/krakjoe/uopz/commit/e94fe6cace79edc700e520045b485f97605b24d2 is an indentation trouble. Surely Json does not have those troubles since it does field delimitation based on comma. That has its own trouble as well:

{ wgFoo: [
  'default' => 1,
 'enwiki' => 2,
]
}

That is not valid due to the trailing comma on the third line. I also argue that json is nice for software consumption but is far from being human friendly, primarily due to lack of support for comments.

I'm not strongly against yaml TBH, I'm just too lazy :)

+1 I am lazy as well :-] Then json has its own problem as well!

EDIT: or we could just use some plain good old PHP files which have the benefit to be in the bytecode cache!

Change 525698 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/mediawiki-config@master] Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator"

https://gerrit.wikimedia.org/r/525698

hashar removed a subscriber: hashar.Tue, Aug 20, 9:21 AM
Jdforrester-WMF updated the task description. (Show Details)

Change 507729 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/mediawiki-config@master] [WIP] Variant configuration: Pre-calculate config for each wiki and store it in config.git

https://gerrit.wikimedia.org/r/507729

Change 533592 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/mediawiki-config@master] Variant configuration: Write to static (JSON) as well as serialised cache

https://gerrit.wikimedia.org/r/533592

Change 533593 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/mediawiki-config@master] Variant configuration: Read from JSON, not serialised PHP

https://gerrit.wikimedia.org/r/533593

Change 533594 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[operations/mediawiki-config@master] Variant configuration: Never write to serialised PHP, drop support

https://gerrit.wikimedia.org/r/533594

Yeah, that's presumably also a blocker for pre-calculate locally. Unless we remove a lot of features and complexity, that's going to be fragile and heavy at best.
A few ideas to make it work:

  • InitialiseSettings.php to return an array (easy, no longer interact with wgConf directly in that file, pass in CS.php instead)
  • Remove use of classes, functions and constant in its value computation (easy-ish, not something we do much. Main one is NS constants, which we could maybe turn into class constants within wmf-config e.g. Wikimedia\MWConfig\Namespaces::FILE = 4). Hardcoding that is ugly, but necessary to make it independent. Should be safe anyway, given we don't allow changing these. We also kind of do it already, given we have a copy of defines.php here for testing. We'd be able to rid that.
  • Remove any reliance on the '+xyz' array-merge feature of SiteConfiguration. This is the main issue and reason why we rely on having access to MW-core's DefaultSettings and all the registered extensions' settings - it's to be able to merge complex values.

After that we'd be left with a standalone computable config.

It would compute for each wiki the globals that need to be set over top of the core/ext defaults (same as we do today).

Change 525698 merged by jenkins-bot:
[operations/mediawiki-config@master] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator

https://gerrit.wikimedia.org/r/525698

Tagging scap as this implies changes to scap's code.

@Jdforrester-WMF Do you want to try incorporating some of those ideas and requirements necessary to be able to build the expanded config ahead of time? (for the use cases of local/CI/Git etc).

There's not an urgency on the intermediary progression from serialised PHP to JSON as far as I know, and I'd rather get that in place at the same time, possibly rebased ahead of it even so as to have full test coverage before going into these other changes as well (such as the multi-file config and e.g. YAML).

Change 533592 merged by jenkins-bot:
[operations/mediawiki-config@master] Variant configuration: Write to static (JSON) as well as serialised cache for testwiki

https://gerrit.wikimedia.org/r/533592

Mentioned in SAL (#wikimedia-operations) [2019-09-10T17:31:12Z] <jforrester@deploy1001> Synchronized wmf-config/CommonSettings.php: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki T223602 (duration: 01m 02s)

@Jdforrester-WMF Do you want to try incorporating some of those ideas and requirements necessary to be able to build the expanded config ahead of time? (for the use cases of local/CI/Git etc).
There's not an urgency on the intermediary progression from serialised PHP to JSON as far as I know, and I'd rather get that in place at the same time, possibly rebased ahead of it even so as to have full test coverage before going into these other changes as well (such as the multi-file config and e.g. YAML).

Yeah, I've been thinking about that. Will update this task.

Change 536321 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] beta: Remove unused wgConf feature of '+beta' tag

https://gerrit.wikimedia.org/r/536321

Change 536321 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: Remove unused wgConf feature of '+beta' tag

https://gerrit.wikimedia.org/r/536321

Change 536323 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] beta: Move the only dynamic config from IS-labs to CS-labs

https://gerrit.wikimedia.org/r/536323

Change 536323 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: Move the only dynamic config from IS-labs to CS-labs

https://gerrit.wikimedia.org/r/536323

Change 536345 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] Remove dependency on wgConfig from wmf-config/InitialiseSettings-labs.php

https://gerrit.wikimedia.org/r/536345

Change 533593 merged by jenkins-bot:
[operations/mediawiki-config@master] Variant configuration: Read from JSON, not serialised PHP, for testwiki

https://gerrit.wikimedia.org/r/533593

Mentioned in SAL (#wikimedia-operations) [2019-09-12T23:18:00Z] <jforrester@deploy1001> Synchronized multiversion/MWConfigCacheGenerator.php: T223602 Add ability to read config from JSON, not serialised PHP (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-09-12T23:19:32Z] <jforrester@deploy1001> Synchronized wmf-config/CommonSettings.php: T223602 Read config from JSON, not serialised PHP on testwiki (duration: 01m 03s)

jijiki added a subscriber: jijiki.Fri, Sep 13, 5:44 AM