Page MenuHomePhabricator

Feedback request: SettingsLoader: move DefaultSettings.php to YAML
Closed, ResolvedPublic

Description

Goal of this task is to solicit opinions on storing MW config in YAML instead of JSON

TLDR: we want to store DefaultSettings.php to YAML. Depending on the feedback, we might consider storing more things in YAML.

The overall goal of the SettingsLoader project is to be able to load the majority of our configuration from completely static sources (JSON files, YAML files, PHP arrays, etc) instead of executing PHP code currently residing in CommonSettings.php

In the current prototype we define settings files which can contain config key definitions (in the form of json schema for each key) and the config values as a simple map.

Example:

config-schema:
  GroupPermissions:
    type: object
    additionalProperties:
      type: object
      additionalProperties:
        type: boolean
    mergeStrategy: array_plus_2d
    default:
      sysop:
        edit: true
    description: |-
      This is the description of the config variable.

In this approach wiki configuration is split into layers of settings files, like 'default', 'common', 'group1', 'enwiki' with every further layer overriding (or merging according to the merge strategy) the more specific values into already existing configs. The system is very similar to extension loading. The result is then cached in an injectable cache (APCu at this point of thinking), so if the config files are only re-read again if they change. Loading of the settings files is controlled using a SettingsBuilder class and can be freely interleaved with current PHP variable setting.

We think that the primary format for the settings files should be YAML.

Benefits of YAML:

  • Better readability. Compare JSON with YAML
  • Support for comments. In a static config system we need some place to store code comments, like a ticket number for why the setting was changed, etc.
  • Support for multiline string blocks - for config definitions, currently residing in DefaultSettings.php we need a place to write the documentation for every config key. Currently the documentation is in PHPDoc blocks, in a new system the natural place for that is 'description' key in the schema, but the docs have to be multiline to be readable.
  • Better integration with helm. YAML is the preferred format in Helm charts, and at least some of our settings files will probably be generated using Helm from values.yaml

Disadvantages of YAML

  • Slower parsing? This should not matter to us since the in configs the parse result is always cached, and only reloaded when config was changed.
  • ...???

Please tell us your opinions on the YAML vs JSON question. We will be holding additional consultations on the project overall, but we would like to know your opinions on this specific question earlier cause we think it will likely be a point of contention and want to address it before we have the complete prototype to present.

Event Timeline

JSON definitely is not an ideal format for this kind of stuff, but it's well standardized has plenty of implementations, including being bundled with PHP. YAML has different spec versions, different parser quality, and many parsers allow for object (de)serialization by default. The spec itself has weird oddities that are designed to be nicer for humans but IMO end up being more confusing, like a plain no actually being boolean false rather than the string "no".

My preference these days is https://toml.io/en/ which I think meets all the shortcomings of JSON and YAML. It's in broad use with both Python (pyproject.toml) and Rust (Cargo.toml) adopting it. If TOML were around 5+ years ago and had such adoption back then, I think I would've pushed for extension.toml and not extension.json :)

I have not looked at the PHP TOML implementation's code quality but I don't imagine it would be too difficult to update/fix as needed.

@Legoktm I've tried out converting our existing DefaultSettings into TOML. With introduction of config schemas the general structure for each key would be something like this in YAML:

config-schema:
  MyConfigVariable:
    // Json-schema goes here
    type: [ array, null, boolean ]
    mergeStrategy: array_plus_2d
    default:
      Test: test
      bla: 1
      other_stuff: null

in TOML it would look somewhat like this:

[config-schema]

[config-schema.MyConfigVariable]
"config-schema.MyConfigVariable.type" = ["array", "null", "boolean"]
"config-schema.MyConfigVariable.mergeStrategy" = "array_plus_2d"

[config-schema.MyConfigVariable.default]
"config-schema.MyConfigVariable.default.Test" = "test"
"config-schema.MyConfigVariable.default.bla" = 1

Some of the problems I've encountered right from the start:

  1. TOML does not allow nulls and recommends just omitting the value. However in PHP there is a big difference between absent array key and null. For actual configuration variables there shouldn't a a difference, but currently some of our configs are very complex multi-level arrays. And simply omitting the values deep in the nested structures will be dangerous.
  2. TOML doesn't allow mixed types in arrays. Again, this is a good thing, but we use mixed arrays in our configs right now.
  3. TOML doesn't allow key to be '0'. Not sure if that's the issue with the only existing PHP package, that empty( '0' ) === true or an issue with the spec. We use that extensively.

All of these problems are actually indicative of our problems, our configs are just overly complicated and not well-enough defined. However, moving them to TOML doesn't seem feasible in this particular project. WE could consider doing it later - when all the configs are static files and have a schema hopefully the task of cleaning them up will be much more manageable.

@Legoktm I've tried out converting our existing DefaultSettings into TOML. With introduction of config schemas the general structure for each key would be something like this in YAML:
<snip>
Some of the problems I've encountered right from the start:

  1. TOML does not allow nulls and recommends just omitting the value. However in PHP there is a big difference between absent array key and null. For actual configuration variables there shouldn't a a difference, but currently some of our configs are very complex multi-level arrays. And simply omitting the values deep in the nested structures will be dangerous.

I...did not realize that and on this alone TOML probably won't work.

  1. TOML doesn't allow mixed types in arrays. Again, this is a good thing, but we use mixed arrays in our configs right now.

From https://toml.io/en/v1.0.0#array "Values of different types may be mixed."

  1. TOML doesn't allow key to be '0'. Not sure if that's the issue with the only existing PHP package, that empty( '0' ) === true or an issue with the spec. We use that extensively.

This does seem like a limitation in the PHP implementation, https://toml.io/en/v1.0.0#keys even says that empty strings are allowed as keys.

All of these problems are actually indicative of our problems, our configs are just overly complicated and not well-enough defined. However, moving them to TOML doesn't seem feasible in this particular project. WE could consider doing it later - when all the configs are static files and have a schema hopefully the task of cleaning them up will be much more manageable.

Agreed. I appreciate you taking the time to evaluate TOML and am hopeful in the future we can re-consider this.

Pchelolo claimed this task.

We've consulted with stakeholders identified in the Tech forum proposal and haven't seen too much opposition to the idea of YAML. It probably will not be the final destination, but it seems like an ok intermediate step in cleaning up the configs. I'm resolving this ticket.

Notes from Perf-PET meeting on this, for transparency and future reference:

Thinking of read mtime from disk (OS fscache), use as apcu key, store array-parsed JSON.

wmf-config currently caches the expanded wgConf set in /tmp as JSON file, with mtime based on a single file only. Scap touches it to ensure atomicity and to reduce number of file checks.

We are thinking of removing this as the file read and JSON parse are basically as expensive or more expensive than calling wgConf to do the merge at run-time. And would reduce amount of data read and parsed into memory overall (all wikis combined). T169821

Apart from file read/parse cost, there is also significant php-apcu overhead with recursively copying large array structures into the current process because php-apcu does not use shared memory for access (it only uses it as a copy source, copy on fetch, not copy-on-write). This is because of the garbage collector. PHP runtime can't let apcu_fetch callers hold a reference to the shared memory since the GC may delete or overwrite things there at any time. Opcache on the other hand does use shared memory directly, this is why it's so fast, but also why it doesn't have a GC, keeps around every file-mtime pair indefinitely until it runs out of space and then performs a risky reset/restart on a live server which corrupts stuff. T187154

HHVM was better in this regard in that it used direct shared memory for apcu. It did this by choosing not to have a garbage collector, and it didn't even have a risky self-reset logic, it just kept growing until it ran out of memory (OOM), and so we managed it by having regular semi-automated restarts. Kind of like we have for opcache now. It works as long as your application only uses APCU to store stuff that you know will all fit into memory, instead of using it like an LRU best effort cache, which is how we use php-apcu. We put more in it that can fit, like memcached, and rely on apcu's GC to manage space by removing old/unused stuff.

There is an idea floating around between Perf and SRE-ServiceOps to expose the local memcached we have on each appserver (yes, we have that now! it's used as light 10-second localhost cache for big ParserCache entries before hitting the parsercache-sql backend and main memc cluster). If we do that, we could potentially re-wire things such that anyhthing that can grow arbitrarily uses local-memc and for things that have reasonable upper bounds use apcu.

ExtensionRegistry supports an mtime mode as well ($wgExtensionInfoMTime), e.g. by point to the same file as the one that scap already touches. (Docs)

Status quo, with incremental change idea:

  • current wikiversion DefaultSettings (now PHP, later cached YAML?)
  • + current wiki Config class (GlobalVarConfig/LocalSettings, YAML settings loader?, etc)
  • = current wiki Config service
  • + wmf-config (now PHP, later cached YAML?)
  • = initialise wgConf
  • --> extract and apply from wgConf to Config service (currently implicilty, by proxy of globals to GlobalVarConfig, maybe explicit via a services-getMainConfig-set method of sorts?)

Private doc:
https://docs.google.com/document/d/1XjypA5wxwWJq46_2cFjuJxNLRxF1P3kL01KFypLGaBE/edit