Page MenuHomePhabricator

wgEventStreams (EventStreamConfig) should support per wiki overrides
Closed, ResolvedPublic

Description

When first designing EventStreamConfig and its MW config var wgEventStreams, we decided to make wgEventStreams a numerically indexed rather than associative array, event though the EventStreamConfig API will return string keyed objects.

The main motivation for doing this was to have deterministic results when querying for stream configs that use 'regex stream names'. This feature allows for declaring a pattern of stream names that all have the same configs. This was needed to support the existent job queue use case, where all the mediawiki.job.* stream names use the same schema. We wanted to be able to use the numerically indexed array to guarantee that if a stream name that matches a query is encountered early in the array, it will take precedence over a later one. In this way, perhaps some mediawiki.job.specific_stream could be declared earlier in the array, and if queried for, would be used rather than a dynamically generated stream name that matches the mediawiki.job.* pattern.

This decision to use numerically a numerically indexed array now has two pretty strong downsides.

  1. We cannot discover streams that use a regex stream name. We have no way of using stream config to determine what streams exist that match a pattern, and thus cannot use that information to ingest them into Hadoop.
  1. We cannot use the StaticSiteConfig array merging feature to set per wiki overridden settings, as numerical arrays are not recursively merged, and comparison for any merging is done by key...and in this case a key of e.g. 0 in the 'testwiki' override settings will not match with key 0 in the 'default' settings.

This task is about solving 2.

While we are at it, we should also try to make it easy to declare streams in beta (in InitialiseSettings-labs.php) without messing with the defaults declared in InitaliseSettings.php too.

See also: https://docs.google.com/document/d/1dpCo33RpZAbQG15nM_GcZ_zqA3sj0S4h_0CQ4Fsahkg/edit?ts=5d2d44c2#

Event Timeline

Solutions?

A. Restructure wgEventStreams to be keyed by stream name. I think doing this would not be so difficult, but we'd lose the deterministic query results for pattern matched streams. Perhaps we can just search the entire stream config for exact matches first, before searching potential regex pattern stream names.

B. Move stream config to a different file or repository, and dynamically generate wgEventStreams with our own code, instead of a statically declared array in wmf-config/InitializeSettings.php

Other ideas?

Could we key the config by stream name, and have an extra key "regexp_streams" (better name to be found) that contains an integer-indexed list of all regex stream configs?
No idea if that fulfills the requirements for stream config discovery, but I imagined that whoever is trying to discover a stream config, could first try by key, and then try accessing the "regexp_streams" section and traverse that?

No idea if that fulfills the requirements for stream config discovery

I dunno, maybe we should just get rid of the regex stream name thing altogether. There are a lot of jobs, but we could declare them all.

17:12:19 [@kafka-main1001:/home/otto] $ kafka topics --list | grep -E '^eqiad.mediawiki.job' | wc -l
111

Oof, 111 is a lot.

If we don't do this then we can't change the setting for different wikis.

Ottomata triaged this task as Medium priority.Jun 11 2021, 6:19 PM

@Ottomata I think it would be interesting to do this in the next quarter or two, because of mediawiki_client_session_tick.
If we are able to change the sampling rate per wiki, then we can drastically reduce the volume of incoming events.

Solutions?

A. Restructure wgEventStreams to be keyed by stream name. I think doing this would not be so difficult, but we'd lose the deterministic query results for pattern matched streams. Perhaps we can just search the entire stream config for exact matches first, before searching potential regex pattern stream names.

Strong +1 from me to keying by stream name. Actually, It's not obvious to me that it should be the API's responsibility to decide which config the client wants in the event that there are multiple pattern matches. But assuming we do want to preserve the current behavior of preferring stream-specific configs to wildcard matches, IMO that should just be implemented directly, e.g. in the manner you describe, rather than relying on its ordinal position within the list of configs.

I agree that it's tempting to get rid of the regex matching stuff altogether, but I'm guessing that jobs are added and removed frequently enough that keeping the list used for generating configs in sync would quickly become burdensome, unless there's some canonical source of truth about active job names that we could easily get at from a PHP script.

Change 717580 had a related patch set uploaded (by Mholloway; author: Michael Holloway):

[mediawiki/extensions/EventStreamConfig@master] Convert wgEventStreams test fixtures to be associative arrays

https://gerrit.wikimedia.org/r/717580

Change 717581 had a related patch set uploaded (by Mholloway; author: Michael Holloway):

[mediawiki/extensions/EventStreamConfig@master] Expect wgEventStreams to be an associative array

https://gerrit.wikimedia.org/r/717581

Here's a pleasant PHP surprise: associative (i.e., string-keyed) arrays in PHP are ordered, so we don't have to choose between keying by stream name/pattern and having stream configs defined in a preferred ordering. In fact, as this patch demonstrates, it would be totally safe to update $wgEventStreams today to be keyed by stream name/pattern with no adverse effect on its functioning.

I still don't like the reliance on ordering, myself, but this at least makes it an easy decision to switch to an associative array.

Change 717589 had a related patch set uploaded (by Mholloway; author: Michael Holloway):

[operations/mediawiki-config@master] Convert $wgEventStreams to be an associative array

https://gerrit.wikimedia.org/r/717589

To expand on my dislike of the reliance on ordering: Before I realized that PHP associative arrays were ordered and would support the behavior currently relied upon, I started out by splitting out the pattern-based stream configs out into their own dedicated config setting, and I still think that would be a good idea for performance in its own right. The implementation could first see if a key matching the requested name exists in the (now-string-keyed) $wgEventStreams, and if not found, iterate over the elements of $wgEventStreamsByPattern to see if any match by regex pattern. Then searches for pattern matches would only have to iterate over a small number of elements (currently only 2: /^swift\.(.+\.)?upload-complete$/ and /^mediawiki\\.job\\..+/) rather than a substantial chunk of the large and growing full list in $wgEventStreams.

mholloway-shell@deployment-mwmaint01:~$ mwscript shell.php --wiki testwiki
Psy Shell v0.10.5 (PHP 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63 — cli) by Justin Hileman
>>> count( $wgEventStreams )
=> 113

Change 717580 merged by Ottomata:

[mediawiki/extensions/EventStreamConfig@master] Convert wgEventStreams test fixtures to be associative arrays

https://gerrit.wikimedia.org/r/717580

Change 717581 merged by jenkins-bot:

[mediawiki/extensions/EventStreamConfig@master] Expect wgEventStreams to be an associative array

https://gerrit.wikimedia.org/r/717581

DAbad moved this task from Sign-off to Done on the Metrics-Platform-Planning board.

Looks like this is deployed and can now be closed. Closing

Not quite! We've deployed the code, but we now need to restructure wgEventStreams config in mediawiki-config, and then test and make sure we can actually support the per-wiki overrides now.

Change 717589 merged by Ottomata:

[operations/mediawiki-config@master] Convert $wgEventStreams to be an associative array

https://gerrit.wikimedia.org/r/717589

Mentioned in SAL (#wikimedia-operations) [2021-10-18T18:45:31Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Convert $wgEventStreams to be an associative array - T277193 (duration: 00m 57s)

Change 731797 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Test wgEventStream config merging in beta

https://gerrit.wikimedia.org/r/731797

Change 731797 merged by Ottomata:

[operations/mediawiki-config@master] Test wgEventStream config merging in beta

https://gerrit.wikimedia.org/r/731797

Change 731804 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] DO NOT MERGE - Test wgEventStreams overrides for testwiki

https://gerrit.wikimedia.org/r/731804

Change 731804 merged by Ottomata:

[operations/mediawiki-config@master] Comment changes for wgEventStreams about config merging

https://gerrit.wikimedia.org/r/731804

Change 731833 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] wgEventStreams - remove redundant stream setting

https://gerrit.wikimedia.org/r/731833

The merging in beta only sort of works. 'default' is not merged, so you can only override settings in beta for specific wikis, or groups (tags) of wikis. This is sufficient for our use cases, where we might want to test declaring streams and settings in beta before deploying them to prod, but we just have to specific which wikis they are overridden for. I guess setting '+metawiki' is the way to go? Actually, that will not work in all cases, since eventgate (in beta and elsewhere) always uses metawiki to get stream configs, but EventLogging JS uses the current wiki loaded in the browser.

Related: T132274: Allow to add items to an array of array in InitialiseSettings-labs.php

But, in prod it works. per-wiki overrides are now supported. I have another config patch to merge (tomorrow) and i'll update docs accordingly before closing this.

Change 731833 merged by Ottomata:

[operations/mediawiki-config@master] wgEventStreams - remove redundant stream setting

https://gerrit.wikimedia.org/r/731833

Change 731999 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Re-add stream setting in wgEventRelayerConfig

https://gerrit.wikimedia.org/r/731999

Change 731999 merged by Ottomata:

[operations/mediawiki-config@master] Re-add stream setting in wgEventRelayerConfig

https://gerrit.wikimedia.org/r/731999

Mentioned in SAL (#wikimedia-operations) [2021-10-19T15:40:04Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: wgEventStreams - remove now redundant stream setting - T277193 (duration: 01m 04s)