Page MenuHomePhabricator

Modern Event Platform: Stream Configuration: Implementation
Closed, ResolvedPublic

Description

(NOTE: This task description was taken and modified from the Event Platform Product Usage design document.)

Wikimedia’s Mediawiki deployments already have a config distribution solution: mediawiki-config and ResourceLoader. Changes to mediawiki-config can be pushed at least twice daily via the usual SWAT schedule.
Stream config will be added as a ResourceLoader module via package modules. This will expose mediawiki-config settings as JSON. For Mediawiki JavaScript clients, the stream config will be loaded along with any other registered ResourceLoader modules. However, ResourceLoader can only return full module JavaScript snippets for use in Mediawiki JavaScript. For remote clients like mobile apps, we will develop a new MW API endpoint that will serve the stream config as JSON directly. Stream config will then available via a URIs like

GET /api.php?action=streamconfigs
GET /api.php?action=streamconfigs&streams=analytics.virtual_pageview|analytics.link_click

We already have an ‘EventServiceStreamConfig’ defined for the EventBus extension. This config specifies the name of the ‘Event Service’ to which the EventBus extension should send a particular event stream. The EventServiceName is used to look up a URL endpoint in (Production|Labs)Services.php. This should also be used by the EventLogging extension JavaScript to lookup the URL to which it should send events. eventgate-analytics will be configured to also get the EventServiceStreamConfig from the stream_config API endpoint This will allow it to dynamically configure the stream -> schema mapping it uses to restrict events of certain types to specific streams.

By using mediawiki-config and ResourceLoader we get a number of benefits:

  • Automated cache invalidation (via last modified timestamps)
  • Config changes go through usual code review and CI process
  • No new service or config repositories to maintain
  • Ability to configure different settings for different wikis

Example config:

$wgEventStreams = [
    # virtual_page_view events
    [
         'stream' => 'analytics.virtual_pageview'
         // Must validate with a schema that has this schema title
         'schema_title' => 'mediawiki/page/virtual-view',
         // Client side should only produce this event 50% of the time
         'sample_rate' => '0.5',
         // If sample_token is not set, then just sample random.  Otherwise,
         // use the session_key value as the 'token' for sampling.
         // See:
         // https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/503f5922c9ff0b6fc2c5279c95f0101126ad99de/modules/ext.eventLogging/core.js#L179-L200
         // (populationSize == 1/sample_rate)
         // We need this to support 'cross-schema stitching':
         // https://phabricator.wikimedia.org/T205569
         'sample_token' => 'session_id'
         // Used to get EventServiceUrl from ProductionServices.php
         'EventServiceName' => 'eventgate-analytics-public'
    ],


    // link clink experiment, should only run for 1 month exactly.
    [
         'stream' => 'analytics.link-click',
         'schema_title' => 'link-click',
         // Client producers should only send events if current
         // dt is in this range
         'time_range' => [
            'begin' => '2019-10-01T00:00:00Z',
            'end' => '2019-10-31T11:59:59Z'
         ],
         // Used to get EventServiceUrl from ProductionServices.php
         'EventServiceName' => 'eventgate-analytics-public'
    ],

    // All mediawiki job streams use the same schema, use a pattern to match their config
    [
         'stream' => '/^mediawiki\.job\..+/',
         'schema_title': 'mediawiki/job',
         'EventServiceName' => 'eventgate-main'
    ]
]

(NOTE: a new REST route handler is in the works...should we somehow use it for this?)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Current idea:

Create a new 'ConfigExporter' MW extension that can be configured to expose whitelisted MediaWiki Configs in several ways:

  • as a ResourceLoader package module config file
  • as a ResourceLoader module that has JS that sets mw.config, like this example.
  • as a MW API endpoint to serve a config, e.g. GET /api.php?action=getconfig&configkeys=streamconfig[analytics.virtual_pageview]

I'd like to do this as a generic ConfigExporter extension, rather than something specific to a stream config so we can be flexible in the future with what we expose for configuration. I think a generic ConfigExporter extension could be useful elsewhere, and would be mostly the same as building it for a single config key. We just need to make sure that only certain whitelisted configs are exposed with a wgConfigExportsWhitelist configuration for the ConfigExporter extension.

I'm pretty new in working with both ResourceLoader and creating MW Extensions, so I'd appreciate advice on this one! Am trying to make this work in mw-vagrant now...ResourceLoader docs are good but I haven't gotten too far yet. :)

Making this generic will be cool. What if we wanted to serve a configuration file that was not MediaWiki config, but some other file? Supposing in the future that MW config stops being a JSON blob, or we want to start doing CR on a separate cadence or something for the stream config? Or, e.g., will there be stream config for the platform event side of the house, that they want to keep separate from analytics? Will they all live in the MW config?

  • as a MW API endpoint to serve a config, e.g. GET /api.php?action=streamconfig&streams=analytics.virtual_pageview,analytics.link_click

Should the client always have to know what streams it is interested in? While this will prevent clients grabbing absolutely every stream configuration, it will add another deployment step, and also mean that deploying a new stream will always require alteration of the application code. There are situations where that would be undesirable. I'm not sure about the best way to work on it, just worth keeping in mind as we feel out how to do this.

Should the client always have to know what streams it is interested in?

No, I don't think so, in the description the other GET example has it grabbing the full config.

What if we wanted to serve a configuration file that was not MediaWiki config,

I think with a generic extension like this, anything is possible. ResourceLoader provides a way to generate the config file from a PHP callback, so the code could potentially grab the config from anywhere.

I've got a few things working, but I'm not so sure I can use ResourceLoader and vary the config keys that are requested. I mean, I can if I hit the load.php URL directly and pass in my own query parameters, but I as far as I can tell ResourceLoader doesn't have the ability for the client/page to pass parameters to the ResourceLoader module request.

To get ResourceLoader to load my module code on a page, I had to register it somewhere (I did it as a module dependency of the ext.eventLogging module`). I can't find a way to make a page dynamically vary the modules it requests on load. Perhaps there is a way to make the client side do an additional ResourceLoader request after page load, but I'm not sure that is allowed for performance reasons.

@Krinkle would love some input here. Basically I'm investigating if a particular event(logging) instrumentation on a page could specify which configs it wants via ResourceLoader.

If this isn't possible, perhaps we don't need the ability to do this? I expect the full block of EventStreamConfig to be not small. https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1 shows 330 active schemas, and each of these will likely need some configuration. Assuming an around a 200ish average char size of config per stream, that's on the order of 50-100K of config, which...is not nothing, right? If we don't have the ability to select which configs a page wants, we'd have to send the whole block on most page loads.

[…] https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1 shows 330 active schemas, and each of these will likely need some configuration. Assuming an around a 200ish average char size of config per stream, that's on the order of 50-100K of config, which...is not nothing, right?

There are currently 71 active schemas on enwiki (counted from RL's schemaRevision object client-side), this is based on an extension setting EventLoggingSchemas in their extension.json file and the extension in question being enabled on a wiki. This is currently about 2K of data, bundled with the ext.eventLogging module. This seems reasonable but I would indeed recommend it not grow much beyond this. For how things work currently that means we effectively are on a budget with regards to many active schemas there can be. And I wouldn't be surprised if there's some of these that have been forgotten about and can be turned off (e.g. the code is no longer using it, or the relevant team has concluded the analysis the schema was for). That could be a way to balance out addition of new schemas.

As for configuration data like sample rates or end-dates, these are not currently part of EventLogging. Rather, this is currently the responsibility of the instrumenting code that calls out to EventLogging. Typically as inline JavaScript code, or sometimes with ad-hoc mw.config variables. I would recommend against centralising this in the delivery pipeline as indeed this would be expensive. Especially since most of it is not relevant at the time of loading a page. I think this kind of data should remain bundled with the feature and the instrumentation it is for. For example, the VisualEditor payload contains the code that says "emit event X" and should also bundle with that code the relevant schema configuration.

Of course, the configuration data can be centralised before delivery. It just means that these various bundles fetch it from a central place, which I believe is the intent.

I imagine it like this (made up example):

Today
# extension.json
EventLoggingSchemas:
   VESaveEdit: 1234 (rev ID)

ResourceModules:
   ext.visualEditor.ui: {
      "packageFiles": [
         "index.js",
         …,
         "data.json": { VESaveEditSampleRate: conf->get( "VESaveEditSampleRate"  )  }
   }

# saveEdit.js
func doStuff() {
   if EventLogging.inSample(data.VESaveEditSampleRate)
      EventLogging.log( VESaveEdit, … )

# EventLogging.js
func log(schema, event) {
  send(event + EventLoggingSchemas[schema])
Future?
# extension.json
ResourceModules:
   ext.visualEditor.ui: {
      "packageFiles": [
         "index.js",
         …,
         "data.json": { VESaveEdit: EventLogging::getConfFor( "VESaveEdit"  )  }
   }

# saveEdit.js
func doStuff() {
   EventLogging.log( data.VESaveEdit, … )

# EventLogging.js
func log(conf, event) {
  if … something… conf.sampleRate:
     send(conf.schema + event)

To get ResourceLoader to load my module code on a page, I had to register it somewhere (I did it as a module dependency of the ext.eventLogging module`). I can't find a way to make a page dynamically vary the modules it requests on load. Perhaps there is a way to make the client side do an additional ResourceLoader request after page load, but I'm not sure that is allowed for performance reasons.

Yeah, balancing payload size can be counter-intuitive at times. Especially when the individual chunks of data are small. There are definitely ways this can work, but I won't bother detailing that here as the bandwidth spent on registering all available chunks, and then specifying which to load, and the bundling-overhead of actually loading those pieces would outweigh any gain of loading less data. (See also T187207)

The above is just a rough sketch based on what I know today, but there's definitely lots of other (good) ways this could go, so let me know if you'd prefer a different way, so I can help think in that direction instead.

There are currently 71 active schemas on enwiki (counted from RL's schemaRevision object client-side),

Hm! Indeed the grafana dash was wrong, I just changed it to count the number of eventlogging_* topics in kafka that have had any data in the last hour. It is now showing 123. Not 71 but is at least closer.

This is currently about 2K of data, bundled with the ext.eventLogging module.

Am a little confused...we don't send schemas to client side anymore since T187207...right? What is this 2K of data, just the EL code? Why would more active schemas cause the EL code to grow?

That could be a way to balance out addition of new schemas.

FYI In Event Platform, new 'schemas' will be irrelevant. It is new streams that matter. A schema is like a datatype for a datastream. Anyway, the config we are talking about shipping does things like specify what schema a particular stream name has, but more relevantly for client side, will allow dynamic configuration of client side instrumentation without code deployments. The intention is to use the stream config to allow for quick changes in how client side code behaves. There are ideas about potentially using it to turn on and off certain experiments, e.g. perhaps growth wants to sample 100% on edit events for newly registered users for 1 week. The configuration in this case might look something like

StreamConfig:
  edit.all:
    sample_rate: 0.01
    schema_title: analytics/instrumentation/edit
  edit.new_users:
    sample_rate: 1.0
    schema_title: analytics/instrumentation/edit
    condition: 
      user_account_age: 1 week

The EventLogging codebase wouldn't be smart enough to know what to do with the condition, but supposedly the instrumentation code would. It would loop through configs for all edit.* streams, check if the condition is true, and call mw.log('edit.new_users', edit_event);. EventLogging would examine the edit.new_users config for common (non instrumentation) stream configs like sample_rate and then emit the event.

(Note this use case isn't 100% vetted, but it gives an idea of the types of things folks want to do.)

As for configuration data like sample rates or end-dates, these are not currently part of EventLogging.

Using the generic Event(Logging) client library to understand things like sample rates and end dates are what we are trying to do! Fancier things (like this straw man condition config) belong in instrumentation, but everyone wants the ability to do quick experimentation for instrumentation that already exists.

Of course, the configuration data can be centralised before delivery. It just means that these various bundles fetch it from a central place, which I believe is the intent.

Hmmm! I think I understand this idea. So, the configuration settings can still be centralized in e.g. mediawiki-config, but which configs are loaded would be determined by individual extension/instrumentation usages, as specified in extension.json ResourceModules?

"data.json": { schemaConf: EventLogging::getConfFor( "VESaveEdit" ) }

How does this work? Would this result in a secondary HTTP request to get the config for VESaveEdit?

@Ottomata I think loading at once config for all streams could never happen on 1st pageload (which is when we want to log eventlogging code). It is likely to be bloated and if we had a "perf budget" per feature we will make users of "lean feature x" pay for weight that is not their responsibility. I agree with @Krinkle in that every feature should be responsible of loading their own configs. For instrumented features such as "banners" that appear in all pages we would want to load no config if at all possible.

And just a note that Eventlogging code does not need to read the config to act on generic sampling settings. You can do that by passing them along on the log call, in your example:

mw.log('edit.new_users', edit_even, edit_sampling_ratio)

or

mw.log(mw.log('edit.new_users', edit_even, edit_config_obj)

(Tagging to remember next week)

@Ottomata I'll elaborate next week, but the 2K is not the EL client, it actually is per-schema data (revision ID).

OK I think I got something working. How's this?

ConfigExports extension will be configured using two keys: ConfigExportsKeysWhitelist which is an explicit whitelist of config keys that are allowed to be exposed, and an extension attribute ConfigExportsKeys, a list of config keys that an extension wants exported into mw.config. Then, an [[ https://www.mediawiki.org/wiki/Manual:Hooks/MakeGlobalVariablesScript | onMakeGlobalVariablesScript ]] hook will do:

public static function onMakeGlobalVariablesScript( array &$vars, \OutputPage $out ) {
    $config = $out->getConfig();
    $extRegistry = ExtensionRegistry::getInstance();

    // Get the list of desired keys to export by global config and by
    // all extensions that have the attribute ConfigExportsKeys set.
    $desiredKeys = array_merge(
        $config->get("ConfigExportsKeys"),
        $extRegistry->getAttribute( 'ConfigExportsKeys' )
    );
    // We only export desired keys that are allowed by the ConfigExportsKeysWhitelist.
    // TODO: warn or error if desired key is not allowed.
    $keysToExport = array_intersect($desiredKeys, $config->get("ConfigExportsKeysWhitelist"));

    foreach ($keysToExport as $key) {
        $vars[$key] = $config->get($key);
    }
}

In this way, extensions can explicitly declare what configs they need in their extension.json files, but the configs themselves are still set using regular Mediawiki config. An extension.json might do:

"attributes": {
    "ConfigExports": {
        "Keys": ["StreamConfig", "OtherConfig"]
    }
}

I think we could event build in smarter config selection using subkeys in a config like "Keys": ["StreamConfig.navigation-timing"].

@Krinkle, q for you:

Is it better to use MakeGlobalVariablesScript or ResourceLoaderGetConfigVars? MakeGlobalVariablesScript is easier because I have an OutputPage with which I can do $out->getConfig(), but I believe I could accomplish the same thing by hacking around with [[ https://www.php.net/manual/en/reserved.variables.globals.php | $GLOBALS['wg' . $keyToExport] ]].

@Ottomata I'll elaborate next week, but the 2K is not the EL client, it actually is per-schema data (revision ID).

Curious to understand this better too -- I thought that the client payload is no longer connected to schema since client-side validation was cut out.

@Ottomata I think loading at once config for all streams could never happen on 1st pageload (which is when we want to log eventlogging code).

For me at least, this is just a base case to understand/measure how fancy we'll need to be.

I agree with @Krinkle in that every feature should be responsible of loading their own configs.

This approach is better arranged for benign neglect and/or lack of information sharing between agents deploying software, which are all good properties. But at the same time, part of this effort from the product perspective is to get more disciplined about the instrumentation lifecycle. In that sense, a centralized source of configuration along with a well-understood notion of "budget" that can be allocated and policed through CI and code review, would help stimulate upkeep/grooming and could push teams to actually retire instrumentation once it's no longer needed, rather than letting it run endlessly.

Like @Krinkle has alluded, these two approaches don't have to be mutually exclusive. Dumping the entire stream config early in every page load is (probably) asking too much. But in the prototype clients we have (same as with the EL extension, pretty much), stream configuration data is not required to begin recording events. It can be delayed for an arbitrary amount of time; events (with a timestamp and all their data) will just sit in the input buffer until the proper configuration can be obtained. Once it's obtained, the events will be processed and placed in the output buffer. So -- with the exception perhaps of some high-priority stream configuration (e.g. client-side errors) that we would want to hardcode into the payload so those events could be processed ASAP -- we can be lazy about loading the stream config.

As @Nuria points out, you don't strictly need to have stream configuration be fetched from a remote for everything -- but in those cases I think the convention of having the stream configuration somehow be "hardcoded" locally (in the same format as the remote stream configs) is better, since the API can stay consistent and so that developers won't be tempted to "shortcut" around using the stream config.

I'm curious what precisely we mean by "each feature loading its own config." Are we talking about secondary HTTP requests, like @Ottomata mentioned? Would those be compiled somehow and fetched all at once? If requests are so cheap that we don't need to batch them that way (they can't be that cheap, right?) then we could just lazy-load the config for every stream when an event for that stream gets logged:

event("mystream", { key: "foo", val: "bar" });
event(stream_name, data)
{
    if (stream_name not in stream_config_cache) {
        get_stream_config(stream_name);
    }
    /* Process event */
}

I need to spend more time getting to know ResourceLoader, which I'm doing this week, so I will be able to contribute more soon.

It can be delayed for an arbitrary amount of time; events (with a timestamp and all their data) will just sit in the input buffer until the proper configuration can be obtained.

We should probably not ask mobile clients to download a bunch of kilobytes that they will probably will not use, even lazily. Now, @Krinkle can ultimately decide if that is acceptable. From a perf perspective it certainly seems unadvisable. Also, from a pure development point of view my experience at WMF is that instrumentation and its config gets left without owners like once a year or so (to be fair this would happen everywhere, really) and thus chances of a file with all configs to "bloat" are pretty large. We just removed a bunch of code and extra-loading of files from EL so it could be included in all pages, to have a dependency on an external config file to be able to "work" (send events) seems like it is going backwards. I think that (ideally) eventlogging should be able to work w/o config and features that use the config (say VE) can load that when the javascript for that feature is loaded.

I'm curious what precisely we mean by "each feature loading its own config."

A feature that to work needs to load its own javascript files (like the example VE above) can load the config as part of that request, there must be some mediawiki terminology for this.

We should probably not ask mobile clients to download a bunch of kilobytes that they will probably will not use,

The approach we are going with now is that each extension will be able to explicitly ask for the configs it needs. ResourceLoader will bundle up all the configs requested by extensions and make them available client side in mw.config.

I think that (ideally) eventlogging should be able to work w/o config

Seems fine to me, as long as instrumenters know that if they do this (by not specifying that their extension needs some particular stream config) they won't be able dynamically modify anything, and the situation will be as it is today: changing anything about the instrumentation will require a code deployment.

Ok, https://github.com/ottomata/mediawiki-extensions-ConfigExports now supports filtering on config (top level) subkeys with glob patterns e.g.

GET /w/api.php?action=config_exports&format=json&configs=StreamConfig.page_*

Ok @Krinkle we talked a bit more after the meeting today. I think we understand how EventLogging is using packageFiles to ship the EventLoggingSchemas extension attribute. However, I think we don't understand how we can do the same thing for this use case. AFAICT, $extRegistry->getAttribute( 'EventLoggingSchemas' ) will get the attributes for all extensions, not just the ones that any given page might use/need.

You said that previously, EventLogging was dynamically generating ResourceLoader modules with validator functions for each registered schema, right? So, something like ext.EventLogging.Schema.NavigationTiming? Then each extension would declare its ResourceLoader module dependencies explicitly, so e.g. in NavigationTiming/extension.json: "dependencies": [..., "ext.EventLogging.Schema.NavigationTiming", ...] ?

From what I can tell, we'd have to do something similar if we wanted instrumentation code to declare which configs needed to be shipped to the client, right? But, even if we did this, would we still run into serious config caching problems? I think we can live with a config change taking an hour or less to apply, but taking days doesn't suit the use case. You also said that with old EventLogging, the list of dynamic ResourceLoader modules got so large that just the ResourceLoader start up code with the list of available modules began to get bloated.

Is there another way I'm missing here, or should we go back down this route of dynamic ResourceLoader modules?

I'll schedule a meeting where we can talk more.

Conclusions from today's meeting:

  • TODO: Make RL package file callbacks support args from extension.json (Andrew will try a patch)
  • Each extension exports its own desired stream config using the callback with stream name args provided by EventLogging.
  • EventLogging JS API will get stream config from caller directly (either shared object with stream name, or config passed in)
  • Default behavior: if no config for stream, just return empty config. EL will always send event (no sampling).
    • could use ext attributes to specify defaults in extension.json if this turns out to be needed.

Q I just thought of for myself later. Production MW Config will always have some stream config defined, but the required configs there are irrelevant to EventLogging client side (EventServiceName, possibly allowed schema_title). Should I strip these keys from config before returning to client side? Probably.

Default behavior: if no config for stream, just return empty config. EL will always send event (no sampling)

Note that sampling is still possible in this scenario, it is just driven by instrumentation code not config that can be changed w/o a deployment

Default behavior: if no config for stream, just return empty config. EL will always send event (no sampling)

Note that sampling is still possible in this scenario, it is just driven by instrumentation code not config that can be changed w/o a deployment

I'm not against this, but just to be clear, if possible, I would prefer, at least for analytics events, to standardize product teams on using stream configurations for this sort of thing rather than allowing three different options (config, no/default config, or instrumentation-defined behavior) with various precedence and so on. The reason for this is that we've experienced all manner of different sampling strategies in various pieces of instrumentation which were often arbitrary and sometimes downright strange. The issue isn't so much that platform A's engineers aren't sharp enough to do sampling like platform B's engineers, it's more that platform A's engineers never read platform B's instrumentation code, they roll things a little differently, and it's fine inside their vertical, but not when analysts want to integrate together multiple verticals.

So I see stream configuration as a way we might put a stop to this kind of harmful fragmentation. Product will always have a problem with standardizing since product teams work in different verticals. We need to be modular enough that different teams can work autonomously, but still maintain enough shared resources that they don't end up silo'ed, which is where we are now. I see stream configuration as a potential way out because it could (maybe, we'll see) serve as a hub where engineers see what the other engineers are doing, even if they never read each other's instrumentation code. But in order to get that win, we need people to actually use stream config.

So while letting instrumentation code provide these things in is fine on one level, I think that frankly, if we make it too easy for folks to keep doing what they are doing, they will end up doing that. And there are problems with the way they're doing things now. Also, we'll make it too easy on ourselves, because if things get hairy, we may think "well, at least there is this other way they can use...". But if we pretend that stream configuration is mandatory for all analytics events, we'll end up a stronger and more widely adopted system in the end, even if ultimately we do still support instrumentation passing these parameters to cover edge-cases.

@jlinehan I think we will need to live with a use case in which no config is required if loading the config detracts from performance cause banners and navigation timing instrumentation need to be able to work as "performantly" as possible as they might be deployed to every page (note these are instrumented by FR and perf team, not product teams)

We might be able to satisfy both of you!

EventLogging JS API will get stream config from caller directly (either shared object with stream name, or config passed in)

If we do this as config passed into the mw.logEvent (or whatever) call, then the instrumentation code could pass in their own defined sampling config from code rather than getting it from centralized mw configuration. In this way, the code could still sample, but they'd have to specify the sampling in a way that EventLogging understands. The actual sampling algorithm would be within EventLogging, but the sampling parameters could come from centralized config or from instrumentation.

We might be able to satisfy both of you!

EventLogging JS API will get stream config from caller directly (either shared object with stream name, or config passed in)

If we do this as config passed into the mw.logEvent (or whatever) call, then the instrumentation code could pass in their own defined sampling config from code rather than getting it from centralized mw configuration. In this way, the code could still sample, but they'd have to specify the sampling in a way that EventLogging understands. The actual sampling algorithm would be within EventLogging, but the sampling parameters could come from centralized config or from instrumentation.

Yeah I'd love this. That way the API is consistent and people will have to be familiar with driving EL using the stream configuration options no matter how they get those options on deck. It would make life soo much easier for documenting and training. This is the way we have the API structured in our prototypes.

Change 544240 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/core@master] resourceloader: Support passing extra arguments to packageFiles callback

https://gerrit.wikimedia.org/r/544240

Change 544240 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Support passing extra arguments to packageFiles callback

https://gerrit.wikimedia.org/r/544240

Change 545400 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] [WIP] Stream Config loading

https://gerrit.wikimedia.org/r/545400

Change 545654 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventStreamConfig@master] [WIP] Add EventStreamConfig and ApiEventStreamConfig

https://gerrit.wikimedia.org/r/545654

Current status:

  • $wgEventStreams (name TBD) is a list of stream config entries, e.g.
$wgEventStreams = [
	[
		'stream' => 'button-click1',
		'schema_title' => 'mediawiki/button/click',
		'sample_rate' => 0.15,
		'EventServiceName' => 'eventgate-analytics',
	],
	[
		'stream' => 'mediawiki.virtual_page_view',
		'schema_title' => 'mediawiki/page/virtual-view',
		'sample_rate' => 0.1,
		'EventServiceName' => 'eventgate-main',
	],
	[
		'stream' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'sample_rate' => 0.8,
		'EventServiceName' => 'eventgate-main',
	],
];
  • EventStreamConfig extension has:
    • EventStreamConfig::get - looks in $wgEventStreams (name TBD) for $targetStreams names and returns an assoc array keyed by stream name:
EventStreamConfig::get($config, ["button-click1", "mediawiki.job.sendMassMessage"], true);
=> [
    'button-click1' => [
		'stream' => 'button-click1',
		'schema_title' => 'mediawiki/button/click',
		'sample_rate' => 0.15,
		'EventServiceName' => 'eventgate-analytics',
	],
    'mediawiki.job.sendMassMessage' => [
		'stream' => 'mediawiki.job.sendMassMessage',
		'schema_title' => 'mediawiki/job',
		'sample_rate' => 0.8,
		'EventServiceName' => 'eventgate-main',
	],
]

(The third boolean on EventStreamConfig::get tells it to includeAllSettings, not just ones relevant for client side functionality. If this is false (the default), only settings in EventStreamConfig::SETTINGS_TO_EXPORT will be returned.
I think this needs some thought as to what the proper default behavior is. Perhaps I'll move the SETTINGS filtering to the EventLogging ResourceLoader callback.)

  • ApiEventStreamConfig:
GET /api.php?action=stream_config&format=json&streams=button-click1|mediawiki.job.sendMassMessage'
  => Stream config for the listed streams.
GET /api.php?action=stream_config&format=json
  => Stream config for all configured streams.

I want the ability to get the full $wgEventStreams config block, but I'm not yet sure what format it should be returned in. When requesting specific stream names, the code currently returns an object keyed by stream name. This allows for regex stream name matching in the case where we want to more dynamically configure a group of streams by name pattern. What should I key by if no specific streams are requested? Hm. Perhaps this should be a different API endpoint and function? EventStreamConfig::getConfigs and action=stream_configs (instead of action=stream_config)?

  • EventLoggingHooks::registerEventStreamConfig

This uses a ResourceLoader packageFiles callback to allow extensions to register which streams they will use. The config for those streams will be set client side in mw.eventLog.eventStreamConfig. mw.eventLog can then use this config to vary its behavior when sending events.

BTW, once this settles, we should use the finalized wgEventStreams to complete T229863: Refactor EventBus mediawiki configuration

	[
		'stream' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],

@Pchelolo and @jlinehan, I'm trying to decide if we should re-use the stream setting here for both explicit stream names and for stream name regexes. Clients need to consume objects keyed by their stream names, so when e.g. a client requests a stream named 'mediawiki.job.foo' they will get:

[
	'mediawiki.job.foo' => [
		'stream' => 'mediawiki.job.foo',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],
]

Here, I'm mutating the original stream config object, changing the stream setting to the requested stream name that matched the original regex. This seems a little funky, especially for EventGate's use case where it wants the full un-altered and un-expanded stream config and settings. In that case, I just group by stream as is and return it:

[
	'/^mediawiki\.job\..+/' => [
		'stream' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],
]

Perhaps it would be less confusing if the stream names and stream patterns were stored in different setting fields? Something like:

	[
		'stream_name' => 'mediawiki.virtual_page_view',
		'schema_title' => 'mediawiki/page/virtual-view',
		'sample_rate' => 0.3,
		'EventServiceName' => 'eventgate-main',
	],
	[
		'stream_pattern' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],

Using both stream_name and stream_pattern in the same config would be disallowed. When requesting specific streams, I'd then not mutate the settings. A request for mediawiki.job.foo:

[
	'mediawiki.job.foo' => [
		'stream_pattern' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],
]

A request for all stream configs would just be grouped by either stream_name or stream_pattern.

[
	'mediawiki.virtual_page_view' => [
		'stream_name' => 'mediawiki.virtual_page_view',
		'schema_title' => 'mediawiki/page/virtual-view',
		'sample_rate' => 0.3,
		'EventServiceName' => 'eventgate-main',
	],
	'/^mediawiki\.job\..+/' => [
		'stream_pattern' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],
]

I'm not opposed to having stream_name and stream_pattern, but

Here, I'm mutating the original stream config object, changing the stream setting to the requested stream name that matched the original regex.

Why are you doing that? If the client depends on having the original stream name in stream property inside the stream config, then having a stream_pattern wouldn't really work either right? So, if you treat stream property as a part of the config and not mutate it at all, how would it be different from stream_name and stream_pattern?

If the client depends on having the original stream name in stream property inside the stream config

It doesn't. The client just needs to be able to map from stream name to config using the result.

Hm. Right now eventgate handles the regex in the stream config keys to map from incoming stream name to a config. I had planned to make eventgate request the full stream config (without knowing the stream names it will need configs for) at startup. It'd then do like it does now and look in for stream config keys that match incoming meta.stream either via exact match or regex.

Hm, Perhaps the Stream Config API could have 2 modes: 1. you request a list of specific stream names, and it gives you back an object keyed by stream name. 2. you request the full config, and it returns you the full config list as it has it instead of an object with keys.

In this way, the above example request would return:

[
	'mediawiki.virtual_page_view' => [
		'stream' => 'mediawiki.virtual_page_view',
		'schema_title' => 'mediawiki/page/virtual-view',
		'sample_rate' => 0.3,
		'EventServiceName' => 'eventgate-main',
	],
	'mediawiki.job.foo' => [
		'stream_pattern' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],
]

And a request for all stream configs would return

	[
		'stream' => 'mediawiki.virtual_page_view',
		'schema_title' => 'mediawiki/page/virtual-view',
		'sample_rate' => 0.3,
		'EventServiceName' => 'eventgate-main',
	],
	[
		'stream' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],

EventGate would have to transform the returned array into what it needs: an object keyed by either exact stream name or the regex pattern.

Or, I could the offload the stream name matching to MW stream config itself. I already implement both exact and regex matching for client side stuff. I don't love the idea of having to make an HTTP request to MW API for every new stream that eventgate sees, but I suppose I could do like I'm doing with schema_precache_uris: provide eventgate a list of stream names to pre-cache on startup. Unknown stream names at runtime would require an HTTP lookup.

In either case it seems like eventgate-analytics is probably going to need an expiring cache of stream config to be able to pick up config changes to things it already knows about.

Or also, I guess there's no need to mutate the stream value at all, even if I always return an object grouped by stream. So a request for all config could return

[
	'mediawiki.virtual_page_view' => [
		'stream' => 'mediawiki.virtual_page_view',
		'schema_title' => 'mediawiki/page/virtual-view',
		'sample_rate' => 0.3,
		'EventServiceName' => 'eventgate-main',
	],
	'/^mediawiki\.job\..+/', => [
		'stream' => '/^mediawiki\.job\..+/',
		'schema_title' => 'mediawiki/job',
		'EventServiceName' => 'eventgate-main',
	],
]

This seemed a little weird to me at first, but the more I think about it is the most consistent thing to do: Always return the same format, don't mutate config object, and always key by specific stream name when given, else just key by whatever stream is exactly, whether it is a string or regex.

I think this solves all my use cases.

mmm... if we convert the array into a key-value map, we loose the ordering of the array. Thus, we can't do that thing when stream matching works for the first match in the array.

Hm, yeah. I guess eventgate should then be modified to work with an array and iterate through to find the first match every time? It'd have to do this every event (right now its using an object key lookup...which i assume is more efficient!). I could keep a cache map of matched stream configs in eventgate?

BTW, notice I'm talking about the return result of EventStreamConfig lookup, not the array in MediaWiki Global Config. That will be an array, and EventStreamConfig itself will match against the stream for requested streams. It's just that I don't want eventgate to have to make that HTTP API request for every stream it works with, I'd rather do that on once startup.

yeah, I understand. I mean, once in eventuate you have the map, you'd need pretty much the same code as you'd need in the stream config extension to map the actual stream name to the config. Thus, loosing the array ordering will make it possibly inconsistent

I think some of the issues I'm having are caused by the combination of 3 users of stream config and their different use cases.

  • EventGate: needs schema_title to ensure a stream only has events of a specific schema lineage.
  • EventBus: needs EventServiceName to figure out which EventGate endpoint a stream should be sent to.
  • EventLogging: needs various settings like sample_rate to configure their behavior when choosing to send events. Each page load will only need only a few configs for streams, as well as only a limited set of settings, to keep data size shipped to remote clients as small as possible.

For some EventGate services like eventgate-main, the stream name -> schema_title mapping changes rarely enough that we can provide this in a static config file, and redeploy if we need to add new streams. For EventLogging (and EventBus too) this needs to be more dynamic. Folks need to be able to add new streams without restarting eventgate-analytics. Remote clients need to configure their sample rate without a code deploy.

I'd like if EventGate code worked the same in all deployment instances (analytics, main, logging, whatever), independent if the stream config comes from a config file or from an http API endpoint. eventgate-analytics will need to dynamically ask the API for stream config for streams it hasn't seen yet. eventgate-main will not have any new dynamic streams, so it can either get the existent config from the API on startup, or it can be given a static config file. If we don't mind a small coupling between eventgate-main, we could have it always request configs for stream names directly from MW API (and dynamically), but pre-cache a list of stream names on eventgate startup. Doing the eventgate + MW stream config API + precache would allow for all eventgate deployments to work exactly the same way. This does make the MW API a dependency of eventgate, which I don't love, I'm not sure how else to best resolve this.

@Milimetric @Pchelolo @jlinehan let's get in a hangout and brain bounce this problem soon eh?

Change 549652 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] [WIP] Stream config service support

https://gerrit.wikimedia.org/r/549652

Ok, I think I'm making some progress after a short chat with Petr on IRC today.

I'm going to make eventgate-wikimedia configurable to either use a stream_config_uri which has a map of stream configs (like it works today), or a stream_config_uri_template which will dynamically request (and cache) stream config for a given stream at runtime. We can then use a local static config file for eventgate-main, and use the remote schema config API endpoint for eventgate-analytics.

Still WIP, so we will see!

eventgate-wikimedia dynamic stream config usage patch ready for review: https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/549652.

It uses either a static stream config map object (as it does already), or a remote dyamic stream config uri. Which it uses is decided with the stream_config_is_dynamic config.

Settings stream_config_is_dynamic: true allows us to set e.g. stream_config_uri: https://meta.wikimedia.org/w/api.php?action=stream_config&all_settings=true&streams=<%= stream %>. eventgate-wikimedia will request (and cache) the stream config for streams at runtime. The (WIP) EventStreamConfig MW extension provides an action API endpoint which returns configs for requested streams.

@jlinehan we should start reviewing and testing this stuff together in MW vagrant with your client side code (in EventLogging extension?).

@jlinehan we should start reviewing and testing this stuff together in MW vagrant with your client side code (in EventLogging extension?).

The client code on prototype needs quite a few changes to be added to EL .
Maybe is better to retrofit into EL the part that would deal with stream configs, making sure that
that we do not couple eventlogging working to the config existing? That would make for a smaller code change that we can CR and send.

The client code on prototype needs quite a few changes to be added to EL.

It will be a patch for review, not going to be merged without extensive review. I don't know how it will be integrated in EL yet, but we'll work it out.

Hi @Ottomata, @phuedx is unavailable at the moment. What were the parts of the patchset where you were hoping for most attention? I ask so we can try to find an appropriate substitute to review.

It is a new extension and MW API endpoint. I've never created a new extension or made an API endpoint before, so I'm looking for best practices review. Am I missing something? Should I do things differently? Are the API endpoint and parameters names good? Etc.

Thanks @Ottomata. I see the note from @Krinkle on the patch about the other gating, where you're sure to get some additional good feedback.

But I gather you're looking more for feedback on conformance to good practices in MediaWiki PHP independent of the other gating.

@Jhernandez @LGoto looping you in here for your routing within PI or via the Product-Platform sync as appropriate.

People are in and out this week, so there may be a delay in workflowing.

Hi team, I found one reviewer somewhat serendipitously. @pmiazga will provide a review of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventStreamConfig/+/545654/ and this has been okay'd with @ovasileva provided it flows through the Web team's workboard per usual process. @pmiazga see T233634#5664331 for the context - much of the architectural specification has already been worked out over the past months, but @Ottomata is looking for conformance to good MediaWiki PHP conventions and sensible API endpoint and field names.

It may be several business days before @pmiazga can perform this review due to some team activities.

@Jhernandez I think up to you in case you want to pull in another reviewer. My sense is an additional one would be good given the broad forthcoming use of this tech, but I also know there are many contending priorities. CC @LGoto.

I would anticipate the code review is about 1-3 hours of work per reviewer.

Our goal this quarter is to get the general idea approved so we can make it work with all the other moving pieces in MediaWiki Vagrant. We'd plan to actually deploy (and go through the official extension deployment review queue process early next quarter, so having some MW devs think it is ok before then would help speed all those things along.

(Thanks so much for helping here Adam!)

Change 554512 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[integration/config@master] Add CI for (WIP) EventStreamConfig MW extension

https://gerrit.wikimedia.org/r/554512

Change 554512 merged by jenkins-bot:
[integration/config@master] Add CI for (WIP) EventStreamConfig MW extension

https://gerrit.wikimedia.org/r/554512

Mentioned in SAL (#wikimedia-releng) [2019-12-04T15:57:33Z] <James_F> Zuul: Add CI for EventStreamConfig MW extension T233634

Change 545400 abandoned by Ottomata:
[WIP] Event Stream Config loading

Reason:
Sqaushed into https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/ /554893

https://gerrit.wikimedia.org/r/545400

Change 549652 merged by Ottomata:
[eventgate-wikimedia@master] Stream config service support

https://gerrit.wikimedia.org/r/549652

Change 545654 merged by jenkins-bot:
[mediawiki/extensions/EventStreamConfig@master] Add StreamConfigs MW Service and ApiStreamConfigs

https://gerrit.wikimedia.org/r/545654

Change 574816 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Add stream_config_uri_options config

https://gerrit.wikimedia.org/r/574816

Change 574816 merged by Ottomata:
[eventgate-wikimedia@master] Add stream_config_uri_options config

https://gerrit.wikimedia.org/r/574816

Ottomata set Final Story Points to 21.