Page MenuHomePhabricator

Schema validation for EventStreamConfig
Open, MediumPublic

Description

Background

2028-09-17 MinT for Readers AA test missing subject IDs incident report includes a recommendation that would have prevented the cause (a subtly misconfigured stream) of the issue:

To safeguard against future incidents like this, we highly recommend for wgEventStreams to have a schema and for there to be a test that validates the stream configuration against that schema, preventing patches from being merged if the resulting stream configuration does not pass schema validation.

Another example of a subtle stream misconfiguration is:

The schema_title does not have a '/' at the beginning. schema_title must match exactly the title of the JSONSchema.

in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1187413/4/wmf-config/ext-EventStreamConfig.php#2558

In that case, since we know that schemas in the primary and secondary repositories do not have a '/' at the beginning, we could have a regular expression to catch certain mistakes (e.g. starts with '/', includes a version number)

Additional notes

Acceptance criteria

Event Timeline

Is there a location in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master where the schema can live?

It seems like since EventStreamConfig extension owns the $wgEventStreams config variable, perhaps the schema could live there? Hm, but...

There are only a few hardcoded EventStreamConfig settings, but most are pretty dynamic and only relevant to certain clients.

I'd be reluctant to add e.g. ExP settings to a schema in the generic EventStreamConfig extension. Hm.

Hm. I suppose...if the schema did indeed live in mediawiki-config and was validated by logic in MediaWiki config, then it would be already to add client specific settings to the schema.

Okay! Sounds good! :)

I suppose EventStreamConfig could have the logic for validating a provided schema, and the schema could live in mediawiki-config? EventStreamConfig could have a bare bones default schema that allowed additionalProperties, and validated its few hardcoded fields? And then additional schemas with which to validate could be provided? Hm!

I suppose EventStreamConfig could have the logic for validating a provided schema, and the schema could live in mediawiki-config? EventStreamConfig could have a bare bones default schema that allowed additionalProperties, and validated its few hardcoded fields? And then additional schemas with which to validate could be provided? Hm!

This sounds great, I think that's the right approach.

Ottomata triaged this task as Medium priority.Oct 1 2025, 4:20 PM

I drafted a schema: https://gitlab.wikimedia.org/bearloga/eventstreamconfig-schema-prototype/-/blob/main/eventstreamconfig/1.0.0.yml?ref_type=heads

and have tested it with a copy of the config returned by https://meta.wikimedia.org/w/api.php?action=streamconfigs&format=json

Feel free to download the repo and try it out yourself. If you want to check the live stream config (rather than the api-result.json copy in the repo), you need to comment out pytest.skip in test_remote_api_result in test_schema.py

However!

Learning 1: null is not a valid value for topic_prefixes and EventStreamConfig should be changed to render empty topic_prefixes as [] in the API result, e.g.

"eventlogging_PrefUpdate": {
  "topic_prefixes": [],
  "canary_events_enabled": true,

Learning 2: mediawiki.product_metrics.checkuser_ip_auto_reveal_interaction stream is currently subtly misconfigured

Learning 2: mediawiki.product_metrics.checkuser_ip_auto_reveal_interaction stream is currently subtly misconfigured

Patch to fix it: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1194733

COOL!

topic_prefixes

I wonder if we could just make this a 'private' config and never expose it in API response. It is used to generate auto generate topics if it is not explicitly set.

I wonder if we could just make this a 'private' config and never expose it in API response. It is used to generate auto generate topics if it is not explicitly set.

How would EventGate or Kafka or whatever obtain in (in cases where topics is not explicitly set) if it's omitted from the API response? Do they not query the API?

How would EventGate or Kafka or whatever obtain in (in cases where topics is not explicitly set) if it's omitted from the API response? Do they not query the API?

topics is automatically set in the API response if topic_prefixes is set (and topics is not explicitly set).