Page MenuHomePhabricator

MEP Client MediaWiki JS (MVP)
Closed, ResolvedPublic

Description

The new event client features will be folded into the existing EventLogging extension in phases. Hopefully this will streamline code review and make it easier for us to manage the work. Each phase has its own task, since each phase will be its own patch. Things may change. For now, here is an outline of the phases:

  • Phase 1 (Submit)
    • Events are sent using HTTP POST to an EventGate instance
    • Stream configuration data is loaded by the extension and used to format events
  • Phase 2 (Sampling)
    • Per-stream uniform sampling rate can be set by a stream configuration field.

A rough outline in JavaScript code is provided at: https://github.com/linehan/el-epc-patch-phases.

NOTE: Non-essential patch phases (phase 3: stream CC-ing, phase 4: association controller) have been moved off the critical path and will be tracked as part of T259704: BUOD: Upgrade MEP clients to full release status.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jlinehan renamed this task from Incorporate revised client into EventLogging extension to EventLogging MEP Upgrade.Feb 6 2020, 6:59 PM
jlinehan removed a project: Patch-For-Review.
jlinehan updated the task description. (Show Details)
jlinehan added a subscriber: mforns.

Updated task to reflect new strategy for getting this update done. Progress will continue on the relevant sub-tasks.

@Nuria @jlinehan @mforns Let's move discussion from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/573677 about what EL's behavior should be if no stream config is registered here.

Nuria added a subscriber: Krinkle.Feb 25 2020, 7:51 PM

pinging @Krinkle as well

On my opinion EL client should be as lean as possible, not requiring any additional dependencies to work. This keeps client simple but also aids with performance as page load is smaller and thus we avoid perf issues derived from page weight when putting EL in every single page. I think is useful to work with the "convention rather than configuration" paradigm. We choose our defaults for the most common case scenario and that one does not require any additional config to work so no additional code needs to be loaded. We are just making the common case simpler to reduce the page weight for as many pageviews as possible.

jlinehan added a comment.EditedFeb 25 2020, 8:43 PM

pinging @Krinkle as well

On my opinion EL client should be as lean as possible, not requiring any additional dependencies to work.

I agree with this, but in the new client, it does require at least Extension:EventStreamConfig. There is just no way around it, the client is tightly coupled to EventGate/MEP/etc. It requires mw.user as well, although that is probably already being loaded.

It does not make the client simpler if we allow clients to send data to any stream name they want. Since an event can be sent with or without configuration, the number of cases for troubleshooting or misconfiguration actually go up.

If by "simpler" you mean "smaller", then yes, sure, but how much smaller? So far it seems like it would prevent somewhere on the order of 5 small strings from being in the page. For scale, the /*! */ comment blocks, which are not removed by our minification pipeline, would save you just as much -- just remove the '!'.

It's about intentionality. After all this effort to be deliberately more intentional with our analytics -- schemas, streams, etc., why would we stop when it comes to actually sending the event?

I think is useful to work with the "convention rather than configuration" paradigm.

"Convention rather than configuration" makes software inexplicable. "Convention" does not live in code, it does not live in behavior. It can only be appreciated by either reading comments or examining the system in its totality and considering all of its possible cases. Configuration is explicit, it lives in the code, it can be manipulated by the code, it allows us to write tests or fail or warn in a consistent way.

We choose our defaults for the most common case scenario and that one does not require any additional config to work so no additional code needs to be loaded. We are just making the common case simpler to reduce the page weight for as many pageviews as possible.

There don't need to *be* defaults. I want to emphasize this because it seems like I keep having defaults added/suggested. It can just fail gracefully. If it shouldn't fail, it will be caught and fixed. If it isn't caught, it won't appear downstream in Kafka/Hadoop/etc. where it can pollute the properly configured data.

"Every stream has a stream config. If you want to produce events to a stream, that stream must be registered at some point in the RL pipeline. If it is not, the event will not be sent."

This is simple, easy to understand, forces misconfiguration errors to be caught early, and sacrifices no flexibility at all. The pageload cost introduced is an epsilon. I welcome evidence (actual numbers) that say it isn't.

This is premature optimization.

jlinehan added a comment.EditedFeb 25 2020, 8:58 PM

Contrast all of this stuff I said with, for example, error logging. You could argue that it does everything that a piece of instrumentation does, except it needs no stream configuration. It also literally has no dependency on EventLogging. It knows what it is producing and it knows how it wants to produce it. There is no middleware layer that needs to manipulate configuration on its behalf -- it does all of it "manually".

So let's imagine that "virtualpageview" etc. operate in a similar way. Perhaps there *is* a case to be made for a super thin module bit that just takes a stream name and a data blob and pops it onto the queue to be sent to EventGate. Then I would argue that this is a distinct piece of software from this client. The client (which would be coupled to EventStreamConfig and all that jazz) would use the config to both annotate the event data (with meta, etc.) and also to determine if it should be sampled or not, all of that stuff.

In that scenario, the client we are writing would, rather than call mw.eventLog.enqueue( ... ), call mw.eventLog.produceMEP( streamName, eventData ).

And then if you were a non-analytics event or something that really knew it didn't need any stream config, for example error logging, you too could also just call mw.eventLog.produceMEP( "error", errorLoggingData ).

@Nuria is that what you're seeing here? I'm not against that. Maybe that super thin part remains in EventLogging, and the client as we're thinking about it today, gets moved down to WikimediaEvents or something. Am I on the right track?

Ottomata added a comment.EditedFeb 25 2020, 8:59 PM

I agree with this, but in the new client, it does require at least Extension:EventStreamConfig.

It does right now, but we can make this an optional dependency. The PHP hook can just not do anything if it this extension is not enabled.

There is just no way around it, the client is tightly coupled to EventGate/MEP/etc.

Not really? It is coupled to an HTTP endpoint that accepts events in POST bodies, but that is pretty much it.

For scale, the /*! */ comment blocks, which are not removed by our minification pipeline

ò_ô

"Convention" does not live in code

In this case it does. The default configuration should live in code as a defaultStreamConfig object and be used and maybe even merged with any loaded stream config. The stream names are set in the code in the produce() call

"Every stream has a stream config. If you want to produce events to a stream, that stream must be registered at some point in the RL pipeline. If it is not, the event will not be sent."

I agree that every stream must have a stream config. The minimal config is that a stream must specify its event schema_title. EventLogging doesn't do anything with the stream's schema_title, so it isn't given it via RL (via EventStreamConfig). Without a stream config, if EL sends the event anyway, we'll get logged errors, the event will not make it into Kafka.

EL works as is now without stream config in both mediawiki-vagrant and in beta. If we force a dependency on EventStreamConfig, and also registering configs, it won't anymore.

jlinehan added a comment.EditedFeb 25 2020, 9:12 PM

There is just no way around it, the client is tightly coupled to EventGate/MEP/etc.

Not really? It is coupled to an HTTP endpoint that accepts events in POST bodies, but that is pretty much it.

The client, as written, knows what EventGate is expecting and also expects certain things in order to satisfy it (e.g. $schema). At some point in the abstraction chain I think we can instantiate this as a client for EventGate. Perhaps there is a need for a more generic layer that does not know what a stream is (see above comment), but doesn't that layer basically boil down to mw.eventLog.enqueue ?

In this case it does. The default configuration should live in code as a defaultStreamConfig object and be used and maybe even merged with any loaded stream config. The stream names are set in the code in the produce() call

Then it sounds like configuration. The class of issues I pointed out on the patch notes still exists though.

EL works as is now without stream config in both mediawiki-vagrant and in beta. If we force a dependency on EventStreamConfig, and also registering configs, it won't anymore.

This is a good reason why this client should probably be distinct from EventLogging. The dependencies are trying to tell us something...

If the game here is more about eliminating EventStreamConfig as a hard dependency of EventLogging than about the pageload cost of adding a few extra stream config keys, then I'd like to decide what exactly EventLogging *is*. Especially if we have third party users that have loaded EventLogging (see: https://wikiapiary.com/wiki/Extension:EventLogging) and are actually using it, I understand why this is pure downside from them, since they presumably won't have the MEP backend that makes EventStreamConfig worthwhile.

But then why are we building this client in EventLogging?

The client, as written, knows what EventGate is expecting and also expects certain things in order to satisfy it (e.g. $schema).

The client does do these things because EventGate expects it, but that does not mean it is coupled to EventGate. Any HTTP POST accepting endpoint could be used to accept the data.

Then it sounds like configuration.

Ya, configuration with defaults :)

they presumably won't have the MEP backend that makes EventStreamConfig worthwhile

Cool I did not know about wikiapiary!

They also presumably don't have the Python eventlogging server side backend that makes EventLogging worthwhile.

then I'd like to decide what exactly EventLogging *is*

Hm, good luck? :p JK kinda. I'd also like to see a clean break from the EL extension, but we are in the minority.

The client, as written, knows what EventGate is expecting and also expects certain things in order to satisfy it (e.g. $schema).

The client does do these things because EventGate expects it, but that does not mean it is coupled to EventGate. Any HTTP POST accepting endpoint could be used to accept the data.

I take your point here, but you're describing dependency and I'm only saying they are coupled. Yes, this client functions in the absence of EventGate. However it does things it would not otherwise do were EventGate not present, and that can't be explained in the absence of EventGate. Its entire purpose is to know about EventGate and make sure that EventGate has the right data in the right format to function as intended. It does this to protect callers from having to know about EventGate ($schema being the exception). This is the sense in which I consider it coupled to EventGate.

They also presumably don't have the Python eventlogging server side backend that makes EventLogging worthwhile.

Very true. So should EventLogging be treated as a WMF-specific extension?

Nuria added a comment.Feb 26 2020, 5:45 AM

However it does things it would not otherwise do were EventGate not present, and that can't be explained in the absence of EventGate

mmm, no, that is the way it would be used in environments such us vagrant, for example, right? devs should not require any type of config to start sending events from mediawiki code at all in their test environment. Events might go to nowhere as there is not an http listener and that is OK, in this case the advantage is not only performance but simplicity.

Side note that when taking about perf I was thinking not only page weight but additional requests which are normally more costly. @Krinkle can chime in and correct me if needed.

jlinehan added a comment.EditedFeb 26 2020, 1:24 PM

For scale, the /*! */ comment blocks, which are not removed by our minification pipeline ...

I want to correct this. This is not true -- I checked and in production there are no comments. Probably there is a development build flag that preserves these comments in Vagrant and removes them in production.

Ottomata added a comment.EditedFeb 26 2020, 3:38 PM

Side note that when taking about perf I was thinking not only page weight but additional requests which are normally more costly

@Nuria, in either case, there will be no additional requests; the stream configs are specified by mediawiki-config and/or extension config serverside, and the JS version of the stream config is generated and given to the page at load time via ResourceLoader.

Very true. So should EventLogging be treated as a WMF-specific extension?

Its entire purpose is to know about EventGate and make sure that EventGate has the right data in the right format to function as intended. It does this to protect callers from having to know about EventGate ($schema being the exception).

I'd differ here, but maybe that is just because 'EventGate' is not really relevant. What is relevant is some concepts. This library is trying to abstract some details about certain fields of the events that we at WMF have adopted as conventions, but the meaning of those fields is more about some architecture concepts than it is about EventGate. In a good Stream Data Platform, any event should be schemaed, and there should be a way to map from any event to its schema. Confluent does this by adding an opaque schema_id binary integer at the front of every message in Kafka (which also means you need special decoders to read from Kafka). We've chosen JSON to avoid needing a schema to read data, but that doesn't mean we don't need to the schema for other reasons.

You could make an argument that including the stream name in the event is not needed. That is the only 'EventGate' specific piece of data here. We really only include it so EventGate could handle POST batches of events destined to multiple streams. If we didn't need that feature, we'd probably put the stream name in the API endpoint itself.

ANNYYYYWAAAYYYYY

@Nuria even though I lean towards not needing a config to produce, there is a problem Jason identified with the current code at least:

Let's say there is a stream config defined with a sample rate for the button.click stream.
In mediawiki-config:

$wgEventStreams = [
   [
      'stream' => 'button.click',
      'schema_title' => 'mediawiki/button/click',
      'sample_rate' => 0.5,
   ]
   // ...
];

Extension A JS module has instrumentation code to send events to button.click stream, but does not register that button.click to get any stream config for it. It expects to send button.click events unsampled.
Extension B JS module has instrumentation code to send events to button.click stream, and does register button.click and gets stream config for it.

If both Extension A and Extension B modules are loaded on the same page, button.click will be configured and sampling will be applied for BOTH extension's instrumentation produce calls.

I suggested 3 solutions:

  1. Require that modules load stream config for every stream they will use. (This is what Jason wants).
  1. Isolate the config per module. Marcel was trying to do this with his earlier patches. I'm not sure how this would work in practice though; each caller would have to specify which config to use, based on some key or instance var. I don't think RL has a way of knowing which context it is currently executing in. I.e. callers couldn't just do mw.eventStream.produce(), they'd have to do something like mw.eventStream.getInstance('ext.Module.Name').produce(), and each instance would contain only the stream config that was requested for that module. (We'd have to pass this module name to the hook that registers the config, but we could do that)
  1. Or we could just not worry about it and say that multiple modules producing to the same stream is a bad idea and not supported and you might get weird behavior.
jlinehan added a comment.EditedFeb 26 2020, 5:24 PM

This library is trying to abstract some details about certain fields of the events that we at WMF have adopted as conventions, but the meaning of those fields is more about some architecture concepts than it is about EventGate.

I understand this but the logical conclusion of this argument is that our client should be:

function produce( event )
{
    if ( "$schema" in event ) {
        navigator.sendBeacon( intakeURL, JSON.stringify( event ) );
    }
}

So, at whatever level of abstraction we choose to begin acknowledging the reality that EventGate and the strategy of having a common schema whose stipulated properties EventGate depends on to function, blah blah -- that is the level at which the client I am talking about sits.

The plan is that all instrumentation written for the web platform for the purpose of analytics, will be made consistent through the use of a piece of software -- I'm calling it a client -- which enforces that consistency. The properties that it will reserve for itself to manage can be whatever we want -- but they are probably going to include the properties defined in the common schema inherited by all analytics schemas. It is also possible (as I've argued, desirable) to implement other requirements, such as 'events produced to streams for which no stream configuration exists, are discarded'.

Are these requirements a property of stream-data-platform-client-with-stream-configuration? No, they are properties of the thing we are building. Our thing extends -- if you insist -- the notion of stream-data-platform-client-with-stream-configuration. It has an independent purpose which is more restrictive than the thing it extends. If we want to pick names for these things, to prevent ourselves or others conflating them, we can. That's a documentation task that is independent of the software written and the requirements it fulfills.

  1. Require that modules load stream config for every stream they will use. (This is what Jason wants).

To clarify, it's not necessary for the module which uses a stream to register the stream config. It is sufficient if any module in its dependency tree registers it. (@Ottomata can fact check me) So this requirement should be less tedious than it sounds.

Nuria added a comment.Feb 26 2020, 6:14 PM

So, unless @Krinkle disagrees per our conversation we are going to require a basic stream config for events to be sent, since this config is plasted in the page upon loading (yours truly had not grasped that earlier) perf concerns are small.

I talked with @Krinkle today and refactored a bit of the code we've been working on. I submitted a patch on top of Marcel's. He understands it, and I will add more docs and info about it tomorrow! But I have to run now see yaaaa!

jlinehan added a comment.EditedFeb 26 2020, 10:55 PM

I talked with @Krinkle today and refactored a bit of the code we've been working on. I submitted a patch on top of Marcel's. He understands it, and I will add more docs and info about it tomorrow! But I have to run now see yaaaa!

Let's talk... being able to specify override stream config in the produce method is a non-starter. The entire point of stream config being specified off-client is that it allows us to centralize and canonicalize the runtime behavior of our clients. This eliminates that and also appears to allow configuration to mutate during runtime? Maybe that's not the intention.

I thought we were trying to lower the entropy of this piece of software but every default, optional argument, and alternative configuration path increases that entropy. I would rather have simple software that covers 95% of cases than complicated software that gets the full 100%. All of the complexity gets added chasing that last 5% and we should just make the cases in that 5% responsible for themselves rather than accomodate them in the core interface.

We just went through this with the stream config question today. We either have a default that increases the complexity of the whole program and makes its behavior uncertain, because ~5 events don't need stream config, or we tell those ~5 events they need to do a bit of extra work to get with the program and follow the convention. I prefer the latter approach as long as it doesn't sacrifice performance, etc. I don't know if the problem that motivated this patch has the same theme, but if it does, I'll likely advocate that approach again.

I also saw the snippet of code that @Krinkle put on the patch, and I think we need to take another look at how this is supposed to work. I'll sit down and read over EventStreamConfig and educate myself but this code looks freaky. Let us meld brains tomorrow...

Ottomata added a comment.EditedFeb 27 2020, 2:54 PM

Ya sorry, I only submitted this here as a PoC so that Marcel could work with it, not as a final draft. We are still WIP on this thing!

Let's talk realtime but I will respond here too.

being able to specify override stream config in the produce method is a non-starter.

It is overridable in the eventLog.produce method, which is now stateless. This makes testing easier, and also makes debugging easier too. However, this is not the 'public' API. EventLogging users are expected to use the function generated by the registerEventStreams hook, which does not allow for overriding configs.

the problem that motivated this patch

I originally was trying to avoid having EventLogging users write any code before they could use e.g. mw.eventLog.produce. I did this by having them create a new RL module that called the registerEventStreamConfig hook to generate a JS snippet that would be that RL module's main file. Only the main packageFile in a RL module is loaded and executed by default; any other packageFiles must be manually required. Timo told me that forcing everyone to make a new RL module just to use EventLogging wasn't going to fly. Apparently more modules result in more complexity for RL when it is managing the large dependency tree (which is done on the client side). So, no matter what we do, we are going to have to ask EventLogging users to require something.

Since that is the case, I figured we could make them require something that returned something useful to them, not just some file that ran some code that changed global state in mw.eventLog. I went with returning a produce function that closed in all of the stream config that a EventLogging module has registered. Each module only can produce only streams that it has directly registered.

Change 575326 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/WikimediaEvents@master] Convert ResourceLoader modules to packageFiles

https://gerrit.wikimedia.org/r/575326

Change 575326 abandoned by Ottomata:
Convert ResourceLoader modules to packageFiles

Reason:
AH! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/ /553356

https://gerrit.wikimedia.org/r/575326

jlinehan added a comment.EditedFeb 28 2020, 3:37 PM

What's worrying me here is that we aren't providing good enough metaphors/lies and the result is that people will start asking why, why, why, which will lead to frustration.

In order to not be confused, people will need to understand:

  • Instrumentation is done per-module
  • That EventStreamConfig provides a hook (a what?) that you must use to populate a packageFile in either Resources.php or extension.json
  • That the output of this hook is a blob of some kind that you must require
  • That you should require this blob in the main package of the module only
  • That this require has no side effects and is local only to your module
  • That it introduces a page-load cost and should be removed when you are not calling instrumentation
  • The relation between a stream, stream config, and this code I need to put in my module

There are so many weird gotchas here that people will be uncomfortable not knowing. I imagine most of the time this config will end up left in as boilerplate, once they get it working, and just change the strings with the stream names. Since instrumentation comes and goes, moves around, gets adjusted, and less senior developers are often asked to work on it, I'm more uncomfortable than I otherwise would be, asking people to tolerate a big mystery in the middle of their module. It sort of runs counter to goals like reducing mental load, increasing certainty, allaying concerns about performance, etc. I also don't really see anything similar to this in the other modules out there, which worries me a little.

Proposal

Right now, we have chosen to attach the stream config blob to the module that needs it, because that way they are cached together and you can compose modules together using RL in the way that we need to do.

But don't we already have a system for composing dependencies together? Isn't it called ResourceLoader? What stops us from having some extension, for example EventStreamConfig, contain a module for every stream config? Then, (application) modules that want to use that stream just do

dependencies: [
    'mw.whatever',
    'ext.eventStreamConfig.editAttemptStep'
]

in their extension.json or in Resources.php

We still have

  • No additional HTTP request
  • Stream config gets cached
  • Stream config is requested by the module that uses it

But now also have

  • No need to change application code in the module (i.e. no need for require)
  • The set of stream configs added to the page is the minimal set that satisfies the modules (ResourceLoader deduplicates like it would for any other dependency)
  • You can add the stream config anywhere in the RL dependency tree and everything in the loaded page could use it.
  • Adding instrumentation means adding a single string to your config

Does this mean we abandon using mediawiki config? Not really. I bet we could come up with some kind of way to do code generation to place the stream config in this extension as modules each time a config deploy went out or something. Once the stream has a module in the extension, it's actual content can be changed in mediawiki config. This is just a way to get it into RL. EventStreamConfig's extension.json would look like

resourceModules: {
    editAttemptStep: {
        packageFiles: [
            { "name": "streamconfig.json", "callback": "EventStreamConfig::get", "arg": "editAttemptStep" }
         ]
    }
}

and then the module would just be:

mw.eventStream.configure( require( './streamconfig.json' ) );

I'm curious why this approach wasn't something we went with -- was it explored when we started having conversations last year with Timo? Maybe it was rejected for a reason I'm not seeing? I would much rather solve this problem...

jlinehan added a comment.EditedFeb 28 2020, 4:48 PM

Other things to ponder that this would allow:

Per-stream dependencies

It's easy to see how we could construct a dependency chain

SomeModule --> EventStreamConfig.<stream>  --> EventLogging.thickClient

so that code that only needs lightweight stream configuration could be handled by a different dependency, like:

SomeModuleLoadedEverywhere --> EventStreamConfig.global.<stream> --> EventLogging.thinClient

Or simply depend on, and call, the thin client directly, without going through EventStreamConfig.

The thick client (or anything else) doesn't have to go out on any page that doesn't need any stream config this way. In general we could have classes of streams that require either more or less in terms of dependencies and that might be useful. The thin client (which may go out on all pages as it does currently) can be extremely thin, even thinner than it is today (imagine moving sampling to a thick client for example).

Per-client stream configuration reconfiguration

You could imagine some kind of A/B test stream config that would flip a coin (or perhaps do other things) only after it gets to the client. This is awkward to do currently and the logic would either need to be baked into the client itself, or the caller code would need to manipulate it. Neither of these is great, but if we have the config itself in its own closure, it can do things like

var conf = require( './streamconfig.json' );

mw.eventStream.configure( conf[ coin_flip() ] );`

Is this a good idea? I have no idea, but there's a clear path forward if we wanted to start doing more complicated things with the configs. Having it in a module named after itself clearly enforces the abstraction that this behavior is kind of a property of the stream configuration, isolated from either the caller or the client.

Ottomata added a comment.EditedMar 2 2020, 4:03 PM

That you should require this blob in the main package of the module only

BTW, this is not true. It doesn't really matter where you require it :)

That this require has no side effects and is local only to your module

This seems like a pro rather than a con, no?

That it introduces a page-load cost and should be removed when you are not calling instrumentation

We could reconsider the idea of stream config defaults, which I really don't mind.

The relation between a stream, stream config, and this code I need to put in my module

Heheh, be careful what you wish for: "Oh, you want dynamic stream config? Ok, but things will be more complex!"

I've only skimmed and got a general gist of understanding, so will read more later, but here are some problems:

EventStreamConfig, contain a module for every stream config

This is not really different (and I think worse) than having each instrumentation extension create a new stream config module like I originally was doing, e.g. 'ext.wikimediaEvents.streamConfig'. Timo told us we couldn't create new modules for each extension just for this, let alone for every stream. Creating new modules is not cheap. RL has to manage complex dependency chains of modules on the client.

What stops us from having some extension, for example EventStreamConfig, contain a module for every stream config?

I really don't like this, it couples the WMF and instrumentation specific usages in a library.

jlinehan added a comment.EditedMar 2 2020, 4:58 PM

That this require has no side effects and is local only to your module

This seems like a pro rather than a con, no?

That it introduces a page-load cost and should be removed when you are not calling instrumentation

We could reconsider the idea of stream config defaults, which I really don't mind.

That isn't a list of cons, it's a list of things that developers using this software (as it's currently written in the most recent patchset) would need to know in order to use it. The point is that this list is long and full of jargon and behavior.

EventStreamConfig, contain a module for every stream config

This is not really different (and I think worse) than having each instrumentation extension create a new stream config module like I originally was doing, e.g. 'ext.wikimediaEvents.streamConfig'.

The approach is similar but I think that this arrangement brings other benefits as outlined elsewhere rather than just being a means to an end. It seems like this is the way that this software wants to be written in the RL environment, and the key reason I'm hearing that it can't be done that way is because RL doesn't magically run in constant time. This is fine, but before we gerrymander this software in order to appease the RL gods, I want to make sure I understand our options, including whether there is a reasonable budget that can be operated in, or balanced in other ways.

Creating new modules is not cheap. RL has to manage complex dependency chains of modules on the client.

Is creating them the thing that isn't cheap, or is resolving their dependencies the thing that isn't cheap (or both?).

What stops us from having some extension, for example EventStreamConfig, contain a module for every stream config?

I really don't like this, it couples the WMF and instrumentation specific usages in a library.

Modules gotta live somewhere so I put the event stream configs in the one called EventStreamConfig. It doesn't matter if it's WikimediaEvents etc.

I don't mind finding some way to do this if @Krinkle approves, but I doubt he will. @Krinkle a quick summary of the idea (Jason, correct me if I am wrong) is:

  • 1 module per stream declared somewhere (maybe WikimediaEvents extension.json?). This module uses the RL packageFiles callback to load stream config (and possibly configure a global or return a produce closure.)
  • Instrumentation user modules depend on this module to be able to produce to the stream.
Nuria added a subscriber: phuedx.EditedMar 2 2020, 9:18 PM

adding @phuedx to conversation as a user of this system.
I want to +1 andrew's comment pointing out that doing dynamic config client side that is different per module (which was a requirement of stream config) is really not that simple of a proposition and it is understable that complicates matter, @phuedx can probably advice here what level is tenable for users

Milimetric added a comment.EditedMar 4 2020, 5:34 AM

While there may be some weird requirements like dynamic client side config (I'm not sure what that means or why it's a requirement), I am very much in favor of Jason's approach in T238544#5926829. Haven't fully thought through the specific solution, but the approach of optimizing for simple mental models is correct for platform work.

EDIT: Andrew explains below that "dynamic client side config" stands for being able to change config without deployment, which is great. I didn't know that's what we were calling it and I've been paying at least a little bit of attention. I think the current discussion about complexity is great and we should have it until we are done having it.

I think we all keep talking in private channels about this stuff. Let's try to keep discussion here so we all have the same info. Here's an excerpt from Timo in our IRC convo (I don't think he'll mind me posting this).

Re why adding N more RL modules is bad:

I'm referring to the overhead of registring modules itself. Every page view goes like this: GET /wiki/Bananas. that returns text/html with 2 scripts: 1. <script src=" modules=startup"> and 2. <script>RLQ/push/PAGEMODULES=[…]</script> where PAGEMODULES is the list of higher-level modules that should be loaded on this page, usually queued by PHP via addModules(). modules=startup ships two important things 1) the module manifest, and 2) the definition of mw and mw.loader itself. Once they arrive, they inspect PAGEMODULES, resolve it against the tree recursively and then decide what to download. So if it contains "ext.visualEditor.mw" it will turn that into something like

load.php?modules=mediawiki.util,mediawiki.Title,oojs-ui,ext.eventLogging,ext.visualeditor.base,,ext.visualeditor.core,ext.visualEditor.mw

Then, it looks in that big module manifest and takes all the version hashes of the modules in this flat list, hashes that again, and makes it into the 'version=' parameter.

Every module registered has to be in that tree, which is loaded upfront regardless of what is or isn't loaded, regardles of whether it is loaded directly or as a dependency.

That initial 'startup' payload should remain small (<21K compress is our target). In general a "module" should be thought of as a bundle, not a component or file.

like dynamic client side config (I'm not sure what that means or why it's a requirement),

Dan, this is the whole reason we are doing EventStreamConfig in the first place. See T205319: Modern Event Platform: Stream Configuration user stories.

the approach of optimizing for simple mental models is correct for platform work

Indeed, but we are prioritizing via ResourceLoader Principles. Here's what we are trying to do.

  1. Not send stream config to pages that don't need it. This means that we need to somehow have instrumentation modules themselves specify which streams they will use. We can't centralize this in any one place, any more than we can centralize which modules are loaded on which pages. (If this isn't a real requirement, we can ship the entire $wgEventStreams via regular RL config exports to every page.)
  2. Have an easy way for instrumentation developers to produce events that obey the config for their destination stream.

How we do 2. is pretty debatable, and I'm personally not that opinionated on. (I COULD BE IF SOMEONE MADE ME!). Global configs, per module produce + stream config closures, explicit passing of stream config to a stateless produce function, whatever.

However, doing 1 is not as easy. Exporting the stream configs to the client side needs to be done via ResourceLoader packageFiles in order to reduce the number of HTTP requests the page has to make. My original approach (as well as Jason's most recent suggestion) does this while adding new RL modules, so that the code/config needed is auto loaded by ResourceLoader (each module has only a single entry point). Timo has told us that adding new RL modules either per extension (my approach) or (even worse) per stream (Jason's approach) is not good for client side RL reasons.

Also, I don't think the mental model for 1. is much different in any of these approaches. The only difference from a instrumentation developer perspective is the addition of a require somewhere in their instrumentation module code.


For clarity, here's how my approach would work.

Stream config must be defined in mediawiki-config in $wgEventStreams. This is how eventgate knows what schemas are allowed in a stream. E.g.

$wgEventStreams = [
    [
        'stream' => 'button.click',
        'schema_title' => 'mediawiki/button/click',
        'sample_rate' => 0.5,
    ],
    [
        'stream' => 'link.hover',
        'schema_title' => 'mediawiki/link/hover',
    ],
    ...
]

RL Module setup step. This only has to be done once per extension/module, not each time an instrumentation developer wants to use a new stream.

In extension.json in the RL module's list of packageFiles

{
     "name": "eventstreams.js",
     "callback": "EventLoggingHooks::registerEventStreams",
     // A MW config variable (e.g. $wgMyExtensionStreamNames)
     // that contains the list of streams to register.
     // (Alternatively, this can be the list of stream names provided here directly.)
     "callbackParam": "MyExtensionStreamNames"
 }

Somewhere in one of the RL modules, (probably in the main file, which is usually thee first one in the list of packageFiles), the new eventstreams.js file must be required to make its code run.

# What happens inside of eventstreams.js is up for debate.
# In my approach, it makes e.g mw.track('eventstream.<stream_name>', event) work.
require('./eventstreams.js');

Once the above is done (probably by us as we port Extensions away from EventLogging legacy eventLog.logEvent over to this new code), instrumentation developers only have to think about stream they want to add, not how the internals work.

Usage step. Declare the value of MyExtensionStreamNames in mediawiki-config.
$wgMyExtensionStreamNames should be defined to a list of streams that the MyExtension module will use. (We could make declaring streams for use by a module optional to use with sane defaults, I don't mind either way.)

$wgMyExtensionStreamNames = ['button.click', 'link.hover'];

Finally, wherever you are instrumenting code in your JS module, do:

mw.track('eventstream.button.click', buttonClickEvent);
jlinehan added a comment.EditedMar 4 2020, 5:32 PM

Was able to have a chance to talk to both Joaquin and Timo about this off-ticket. Things to think about.

There are concerns from Joaquin because people had trouble understanding and using the old system, that's part of why we're making a new system (at least from the Product perspective). He thinks it would set us back a good deal if we delivered something that was this much more complicated. I know to the people on this ticket it doesn't seem that much more complicated, but I think the reality is that we've been marinating in this stuff for quite a while. Obviously everything has a cost, and complexity isn't always avoidable, but in this case I think that the cost of making things simple is reasonable and justified. In fact, what I'm proposing wouldn't really make it less complicated overall, just less complicated for the end-user -- the complexity gets moved to a place where they don't need to interact with it.

The reason for Timo's concern with regard to performance is the size of the module manifest which is sent to the client. Every module increases the size of that manifest, which, if it gets out of hand, will impact pageload. He likes it better if modules are thought of as bundles rather than single files. That said, he seemed flexible and willing to work with us as long as we understood the performance constraints. I think there is a credible case from Product for increased usability first, with the understanding that we must create guidelines informed by performance that allow us to operate in a proper budget and not grow the number of modules in an irresponsible way. There is also the potential for us to group commonly used streams together into a single module, or use the indirection layer that Andrew already wrote about. I think we can be good citizens together.

About whether to instantiate a produce() function with a pre-loaded configuration, or have all loaded configs in a shared variable inside the MEP client.

I think that instantiation as currently written complicates the interface unnecessarily for little benefit. I understand the reasoning and in another situation I might agree with it, but I think that here it leads the user to a wrong way of thinking. Most obviously, the fact that each stream configuration is loaded independently per-module. They are not and never were intended to be independent in that way. The abstraction was of a client as a whole. At the moment, with our simple config examples, they seem like the same thing, but for example if we ever do any computation per-config, such as for example flipping a coin or generating a token, it will be done independently per-module in the current approach. This makes the state of the client no longer something that exists as a whole entity, but rather something that exists relative to a particular module. Unfortunately I do not believe that instrumentation or our software is organized in a way that can support that abstraction. There are other reasons but this one stands out to me.

It's also not something I am keen to mirror on the apps. The goal was to bring them precisely the kind of flexible mw.track-like approach, allowing them to get rid of their class-and-object based approach which had become messy and irritating.

I think a good compromise could be to make mw.track('stream.<name>', eventData) the primary interface. That would make the client and stream-configuration-loading (if done e.g. in the manner I propose) completely passive and opaque to the caller. There would be no possibility for the application to call .configure() because they would not even have a reference to mw.eventStream. One downside is that mw.track does not provide timestamps, they would only be added when we consumed the event from the queue. Probably workable. Other thoughts welcome.

Let me ask the conditional question: if we were able to allay the performance concerns, what concerns/objections/downsides do people see in the approach where we use modules to organize stream configurations?

Let me ask the conditional question: if we were able to allay the performance concerns, what concerns/objections/downsides do people see in the approach where we use modules to organize stream configurations?

None that I know of!

With your stream module RL dependency suggestion, I'm not sure where the module and packageFiles callbacks would live. Also, I'm not sure how it simplifies the mental model of adding a new stream. If each stream config is its own RL module, then adding a new stream would require creating a new module (in some external extension?) plus a main packageFile callback definition, right?

I like the RL dependency idea in general, but I'm not sure how it makes things simpler for users.

mforns added a comment.Mar 4 2020, 7:09 PM

Hey all!
One thought re. code complexity and people having difficulties using the code from the product perspective:

I agree that the code as written now is complex on the RL side (registerEventStreamsForModule method).
I believe this is a result of performance optimization to conform with RL particularities.
Whenever code gets optimized it usually gets more complex, no?

Now, this part of the code will probably be altered very few times, and likely not by instrumentation developers.
The code that hopefully will be used a lot, is the instrumentation interface and I believe it's pretty simple as written now:

  1. Specify the list of stream names that your module is going to use in extension.json.
  2. require('./eventstreams.js');
  3. mw.track('eventstream.button.click', buttonClickEvent);

We could maybe remove step 2, but it's just one simple line. I don't think this approach is complex at all from the point of view of the user.
I can see how an instrumentation developer would wonder how this is working, seems like dark magic from the outside!
But given a nice documentation and a couple examples, I think this would be totally fine, no?

I like the RL dependency idea in general, but I'm not sure how it makes things simpler for users.

I think we go from (currently) using the EventStreamConfig hook in extension.json, which implies to some extent understanding how package files work, etc. to (proposed) just listing a dependency on a module.

If I got that right, it's dramatically simpler.

@mforns I see what you're saying, but I think that's only part of the picture. As a dev, I'd need to have some mental model of how this config works, so I know how/where to change it. I personally can't just follow examples and directions without understanding what's going on.

I think we go from (currently) using the EventStreamConfig hook in extension.json, which implies to some extent understanding how package files work, etc. to (proposed) just listing a dependency on a module.

We do go from listing a dependency on a module, but someone has to define that module. This means that for every stream, a user must actually define a new module with a new packageFiles callback to the hook. They then can depend on that new module.

Restricted Application added a project: Analytics. · View Herald TranscriptMar 5 2020, 10:10 PM
jlinehan added a comment.EditedMar 6 2020, 3:48 PM

@Krinkle and I talked for >2 hours about this and @phuedx also took time to fill me in on his perspective. He's also going to post a comment and I think he has some good points to make.

For now, I want to make some decisions with you all. The things I think we should do are summarized below under tl;dr. Below that, for those who want to read, are the justification and summaries of my conversations. I tried my best to make it easy to read, but I probably failed! Sorry!

tl;dr

  • Stream configuration loading: One centralized module hooks in the entire stream configuration object (or some 'active' subset of it defined with a regex or indirection).
  • Public interface for produce: Produce has the form .produce(streamName, eventData).

Summary

Stream configuration loading

We've explored a spectrum of options.

  1. Each module configures its own client
    • Implementation: To produce to stream foo, a module
      • creates packageFile to hook, via EventStreamConfig, the stream config for foo from the MediaWiki config.
      • calls require() on the packageFile in the module body, bringing the stream config into the client.
  2. Modules require stream modules that configure the shared client (effect: the page (collectively) configures the shared client)
    • Implementation: To produce to stream foo, a module
      • adds a resource loader dependency to xxx.stream.foo
        • xxx.stream.foo has already been defined elsewhere.
        • xxx.stream.foo does exactly what the module itself does in option 1) in order to get the stream config into the client.
  3. Some centralized module configures the shared client
    • Implementation: To produce to stream foo, a module:
      • Ensures that foo is part of an "active group" of stream configurations
      • Adds a resource loader dependency to module centralizedModule
        • centralizedModule uses EventStreamConfig the same as above, except it will hook in all stream configs marked as 'active' in some way.
Option 1: Each module configures its own client (not recommended)

If two modules produce events to the same stream, the properties of that production process should be identical. It should not matter what part of the client the event comes from. When stream configuration is only static information, it's fine to instantiate independent copies because they will be identical. It is less clear how to handle something like:

'bucketExampleStream': {
      'sample_rate': 0.5
}

or

'bucketExampleStream': {
      'bucket': [0.5,0.5]
}

Here, we take some non-deterministic action. Without going into details, we must store the results of the "coin flip" (in the case of sample_rate) or "die roll" (in the case of bucket) in shared state that lives within mw.eventStream, so that the policy resulting from the non-deterministic action (flip or roll) can be applied to all events during a given period (pageview or session).

Because we require shared state, it weakens the argument for both an instantiation model of acquiring a produce() function, and the handling of stream configurations per-module using require(). Having individual modules handle config is redundant, and asking them to explicitly manipulate it as part of the initialization process sends the message that they are receiving an independent instance, which is not true. It also suggests that they have control over the stream configuration they provide, which is also not true. There is an argument that we can make this second thing true by treating the object coming out of require() as opaque, that is, to treat it as merely a "handle" or "token" representing the stream config, rather than as an actual object. Yet this is both hard to enforce in JS and logically equivalent to saying that we should use a string (the 'handle') which is the name of the stream, to represent the string configuration. See below for more about the produce() function.

The each-module-configures-its-own-client approach is interesting from a certain perspective, and it isn't impossible to make it work, but it is far from clear what the point is. It does not revolutionize our performance profile, and shifts most of the burden of knowledge to the end-user. On top of that, it insists on an abstraction that does not match reality. We will always be fighting it, and requiring our end-users to join us in that fight as well, policing a very subtle relationship that will inevitably return to bite to us and them.

Option 2: Modules require stream modules that configure the shared client (not recommended in short-term; interesting in medium-term)

This is my proposal in T238544#5926829 and is a variation of what @Ottomata had originally implemented. It hits the correct abstraction (stream configuration is a property of the page), the user interface is close to ideal, and we only load exactly one copy of the stream configs necessary for a page. But there are two flaws.

  1. Performance concern: Every module registered with RL grows the size of the manifest sent with the base module when a page is first loaded. Per @Krinkle, Growth of this number is almost certainly more harmful than growing the ultimate page size.
  2. Development concern: We don't know the mechanism by which the "stream config loader modules" would be created. Code generation seems possible, but it's not clear exactly how we do it.

Perhaps this is an avenue for future work. In the meantime...

Option 3: Some centralized module configures the shared client (recommended for start)

This is the simple approach. You have a ball of stream configs and you just load them all. If the ball gets too big, some have to go. Right now we understand very little about how this system is going to evolve, perform, or be used. I want to understand those things but the best way to do that is start from a position of simplicity, measure, and then proceed to optimize as needed. We will probably be tuning this throughout the next FY.

Before dismissing this approach, consider:

Pros:

  • @Krinkle says it's fine
  • The hook is isolated from the modules being instrumented, as in Option 2, simplifying things for the end-user.
  • There is one copy of stream configuration and it is loaded all at once in a single location away from the end-user, so .config() does not need to be a public interface.
  • There is only a single new module so the module manifest will not grow.
  • No code outside of mediawiki config needs to be altered by the end-user (they perhaps add a dependency to mw.eventStream).
  • There is a centralized record of exactly what is in production.

Cons:

  • You are loading more stream configs than you strictly need. Discussions with @Krinkle convinced me that the con is less scary than it sounds.
    • Option 1 also loads more stream configs than you strictly need (stream configs used by multiple modules are duplicated).
    • The ultimate page size is less of a concern in terms of performance than the size of the module manifest (which this minimizes)
    • Particularly after compression, it is reasonable to believe that most stream configs will be very small on the wire.
    • We can monitor the budget from a central location (mediawiki config)
Public interface for produce()

Option 3 (or 2) provides no reason that I can see to prefer any specification of produce other than .produce(streamName, eventData). The argument is largely laid out in the discussion of Option 1, above, but I'll synthesize some of what @phuedx and I talked about below.

The model of this system should be that the stream configuration is configuring the MEP client itself. The stream configuration is equipping the client with certain streams and information about those streams. The client's calling environment -- i.e., things the end-user instruments -- may try to produce events to any stream they wish, however only the events for the streams that the MEP client has been configured will be processed and sent. This intentionally and decisively removes the calling environment from any control over the policy of what is collected and how it is processed. All of that policy should be controlled solely from the stream configuration, which is housed in mediawiki config in our environment.

It's tempting to imagine the caller passing both an event's data, and instructions about how to process that data (stream configuration), to produce an event. But the entire benefit of the stream configuration system lies in the fact that the instructions about how to process the data can be decisively isolated from the caller that provides that data. By depriving the caller of the autonomy to specify how an event is processed, we centralize and regularize all of the ad-hoc-ness that exists today in the system, and causes real actual problems constantly for product teams and analysts. This architectural advantage is possibly the single greatest feature of this system, and the key reason why it will be used so successfully. It needs to be protected at the same level as performance concerns.

mforns added a comment.Mar 6 2020, 4:58 PM

Option 3 sounds great to me (given that @Krinkle approves potential performance issues).
I believe it's definitely the simplest and most natural from both our perspective and the instrumentation developer perspective.
I also like the approach for the produce() public interface.
+1

Ok, I had a pass at doing this idea. Here's the basic gist:

  • EventLogging extension.json defines a stream_configs.json packageFile rendered from EventLoggingHooks::loadEventStreamConfigs
  • EventLoggingHooks::loadEventStreamConfigs gets a list of stream names out of $wgEventLoggingStreamNames and uses EventStreamConfig to get a stream -> configs mapping object. This is returned and rendered as the stream_configs.json packageFile.
  • core.js requires './stream_configs.json' and stores it in an immutable streamConfigs object.
  • mw.eventLog.submit( stream, event ) sends the event according to streamConfigs.
  • mw.track( 'eventstream.<stream_name>', event ) calls mw.eventLog.submit

I also added mw.eventLog.getStreamConfig(stream) and mw.eventLog.getStreamConfigs() for debugging purposes, so you can access the streamConfigs without having the ability to modify them.

I was trying to get away without the extra $wgEventLoggingStreamNames variable and just export all stream configs. But, we forgot that e.g. $wgEventStreams is not really a 'stream' -> {config} map, it is a list of stream names or patterns to stream configs. We need the list of streams we want to export. At least now it is centralized! The list is specific to EventLogging. I actually like this, as it lets us be very explicit about what streams EventLogging is able to produce to.

Instrumentation devs now have to do 3 things:

  • mediawiki-config: Have an entry in $wgEventStreams
  • mediawiki-config: List their stream in $wgEventLoggingStreamNames
  • In their instrumentation module: Just call mw.track('eventstream.<stream>', event) or mw.eventLog.submit(stream, event)

@mforns and @jlinehan this is just my pass at it. Please feel free to modify at will to your liking!

@Krinkle does this sound ok to you?

@Ottomata If I understand correctly, this proposal would require that the configuration data:

  • … for all active streams are downloaded and initialised upfront (bandwidth + CPU). E.g. no matter for which user(s) or feature(s), no matter if only within a certain interaction of a rarely used feature. Most features do not load on page views.
  • … is loaded with high-priority on all page views globally beforea any feature code can load. This as it would be a dependency for the mw.eventLog.submit function to exist, which will be dependend on by at least one thing loaded on page views, even if not called directly.

This is risky from a perf perspective as it leaves no room for you to scale the number of active streams, or extend the feature set of EvetnStreams configuration. We'd have to agree on a tight budget and stay within that. It also means that the only lever I would have in the future is an all-or-nothing approach (encouraging teams to turn instrumentations off) – Boiling frog, etc.

Status quo. The older EventLogging stack we have today is similar and that took a lot of effort last year to try and contain (T187207). It is now under control as we took away the ability to have variable size cost per-schema. Every active schema only has a fixed cost in the critical JS path (the schema revision number). Anything else, such as config vars for sampling rate, A-B test buckets, is part of the JS code or packaged config for individual features.

Developer experience. I understand your proposal would look as follows, is that right?

EventLogging/extensions.json
"ext.eventLogging": {
   "packageFiles": [
     "index.js",
     "core.js",
     "stream.js",
     { "name": "activeStreamConfigs.json",
       "callback": "EventLoggingHooks::getActiveStreamConfigs" }
       # ^ calls EventStreamConfigs->get() for each stream in $wgEventLoggingStreamNames
mediawiki / LocalSettings.php
// Configure it
$wgEventStreams['MyFeature*'] = ;



// Try loading it
$wgEventLoggingStreamNames[] = 'MyFeatureSomething';
MyFeature/extensions.json
"ext.myfeature": {
   "packageFiles": [
     "index.js",
     "foo.js",
     "bar.js"
   ],
   "dependencies": [
     "ext.eventLogging"
MyFeature/resources/ext.myfeature/foo.js
var eventLog = require( 'ext.eventLogging' );

function onSomething() {
  
  eventLog.submit( 'MyFeatureSomething',  );

The various other proposals affectively all come down to having the foo.js file explicitly bundling any data it needs. This means no stream configs (nor any meta-data about them) is shipped unless and until the module that needs it is brought onto the page. E.g. if there are 100 features with streams, of which 20 used by the current user, and 3 run on the current page view at load time, only those 3 will get initialised at load time, instead of all 100.

In terms of developer experience, this other proposals are all very similar – essentially swapping one a line of PHP for one line of JSON.

mediawiki / LocalSettings.php
$wgEventStreams['MyFeature*'] = ;
MyFeature/extensions.json
"ext.myfeature": {
   "packageFiles": [
     "index.js",
     "foo.js",
     "bar.js",
     { "name": "stream.js", "callback": "EventLoggingHooks::getStreams": "arg": [ "MyFeatureSomething" ]
   ],
   "dependencies": [
     "ext.eventLogging"
MyFeature/resources/ext.myfeature/foo.js
var eventLog = require( './stream.js' );

function onSomething() {
  
  eventLog.submit( 'MyFeatureSomething',  );

I believe this would offer the same semantics and developers ergonomics in terms of requirements for you, but have the benefit of scaling it to many active streams and feature-rich configs without adding a significant perf cost or having to revisit how it all works.

Questions. Whether this is a risk today or something to worry about later, depends on:

  • What number of active stream configs will we allow?
  • and, How much data would each transfer to the client for its stream configuration? (minimum, typical, and maximum).

For comparison, enwiki currently has 70 active EventLogging schemas, costing 2K (1K compressed) via mw.loader.moduleRegistry['ext.eventLogging'].packageExports['data.json'].

Ottomata added a comment.EditedMar 9 2020, 7:49 PM

@Krinkle, my most recent suggestion was based on @jlinehan's conversation with you and the summary he made of it in his last comment. I implemented his Option 3, which has listed as a Pro

@Krinkle says it's fine

@jlinehan I guess it isn't fine?

If I had to express a preference for any of these solutions I'd choose my previous patch, the one where a developer adds the packageFile as a callback explicitly in their module, and then requires it. But, I really don't have a preference at all! I'm fine with anything as long as we can do it.

Something is getting lost in translation here. I'm scheduling a meeting to reconcile.

In today's meeting, we agreed that as long as the size of the shipped configs is within some limit, shipping them all at once via EventLogging is ok. That means we can proceed with reviewing my pass at Jason's Option 3 now.

What number of active stream configs will we allow?

I estimate in the lower hundreds range, possibly less.

How much data would each transfer to the client for its stream configuration? (minimum, typical, and maximum).

Guesses: minimum: ~20 bytes, typical: ~30 bytes, maximum: ~100-200 bytes.

Initially, I'd thought thought that this was premature optimisation. However, @jlinehan's proposal in T238544#5926829 makes sense to me as a customer and from a performance perspective.

Making Resource Loader the level at which a customer integrates the MEP client with their system makes sense as:

  • It's familiar
  • Resource Loader provides other integrations to make developer's lives easier. Why shouldn't instrumentation be easy?
  • We can leverage already long-established patterns and practices to avoid performance pitfalls outside of the boundary of the MEP client

But the entire benefit of the stream configuration system lies in the fact that the instructions about how to process the data can be decisively isolated from the caller that provides that data. By depriving the caller of the autonomy to specify how an event is processed, we centralize and regularize all of the ad-hoc-ness that exists today in the system, and causes real actual problems constantly for product teams and analysts.

I'd actually go a little further than the original proposal and say that the function that the customer uses to produce an event should only accept the event, i.e.

MyFeature/extension.json
"ext.myfeature": {
   "packageFiles": [
     "index.js",
     "foo.js",
     "bar.js",
     { "name": "stream.js", "callback": "EventLoggingHooks::getStreams": "arg": [ "MyFeatureSomething" ]
   ],
   "dependencies": [
     "ext.eventLogging"
MyFeature/resources/ext.myfeature/foo.js
//
// Produces an event to the MyFeatureSomething stream.
var produce = require( './stream.js' );

function onSomething() {
  /* ... */
  produce( event );
}

produce in foo.js, then, is a highly-specialised, stateful MEP client that will call the more general, stateless MEP client.

This architecture has a number of desirable properties:

  • It scales (discussed above at length)
  • You're in control of the whole system
    • It is fragmented but it isn't ad-hoc: You're still in control of the contents of stream.js
  • The customer doesn't know or have to care about how their events are reaching their destination nor the detail of how they are selected

As your customer, I cannot stress that last point enough.

Please forgive my late response. I'm happy to see that y'all have decided on a course of action!

Change 554893 abandoned by Ottomata:
[WIP] Adds ext.eventLogging.client module

Reason:
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/ /573677 instead

https://gerrit.wikimedia.org/r/554893

Ottomata moved this task from Backlog to In Progress on the Event-Platform board.Mar 20 2020, 2:19 PM

Change 583447 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] Display submitted event in debugMode

https://gerrit.wikimedia.org/r/583447

Change 556221 merged by Ottomata:
[mediawiki/vagrant@master] Use eventgate for EventLogging

https://gerrit.wikimedia.org/r/556221

Jhernandez removed a subscriber: Jhernandez.Apr 2 2020, 6:46 PM

Change 583447 merged by Ottomata:
[mediawiki/extensions/EventLogging@master] Display submitted event in debugMode

https://gerrit.wikimedia.org/r/583447

jlinehan updated the task description. (Show Details)Apr 16 2020, 6:49 PM
Krinkle removed a subscriber: Krinkle.May 28 2020, 5:36 PM
Nuria moved this task from Project Newpyter to Event Platform on the Analytics board.
mpopov claimed this task.Jun 22 2020, 4:24 PM
mpopov updated the task description. (Show Details)

Change 602739 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[mediawiki/extensions/EventLogging@master] Process sampling rules in stream config

https://gerrit.wikimedia.org/r/602739

Change 602739 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Process sampling rules in stream config

https://gerrit.wikimedia.org/r/602739

jlinehan renamed this task from EventLogging MEP Upgrade to MEP Client MediaWiki.Wed, Aug 5, 12:53 PM
jlinehan renamed this task from MEP Client MediaWiki to MEP Client MediaWiki JS.Wed, Aug 5, 12:55 PM
jlinehan updated the task description. (Show Details)
jlinehan renamed this task from MEP Client MediaWiki JS to MEP Client MediaWiki JS (MVP).Wed, Aug 5, 1:10 PM
jlinehan closed this task as Resolved.
jlinehan updated the task description. (Show Details)
jlinehan updated the task description. (Show Details)