Page MenuHomePhabricator

Stream cc map should not be generated on every pageload
Closed, DeclinedPublic

Description

Given an example stream config:

let CONFIG = {
	"edit": {},
	"edit.growth.experiments.help_panel": {},
	"edit.mobile.app": {},
	"edit.mobile.web": {}
};

a map of which streams should be cc'd (when any given stream is logged to) looks like

{
   "edit": [ "edit.growth", "edit.mobile" ],
   "edit.growth": [ "edit.growth.experiments" ],
   "edit.growth.experiments": [ "edit.growth.experiments.help_panel" ],
   "edit.mobile": [ "edit.mobile.app", "edit.mobile.web" ]
}

This map can be generated via the following:

/**
 * Generate a map for stream cc'ing.
 *
 * This enables us to determine which streams can be cc'd when logging an event
 * to a stream (e.g. edit ~> edit.growth).
 *
 * Stream cc-ing should be done only 1 level deep to avoid duplication. For
 * example, 'edit' shouldn't cc 'edits.growth' AND 'edits.growth.test', since
 * 'edits.growth' would, in turn, cc 'edits.growth.test'.
 *
 * Instead of working stream by stream to find children streams for each parent
 * stream, we work backwards by constructing parent streams from the children.
 * This enables children streams such as 'edits.growth.test' to be cc'd when
 * logging events to 'edits' even when parent streams 'edits' and/or
 * 'edits.growth' are not explicitly present in the stream configuration.
 */
mapCopyStreams = function( config ) {
	var cc_map = {}; // map of streams which are to be cc'd
	for (stream in config) {
		let s = stream.split( '.' );
		let n_prefixes = s.length - 1;
		if ( n_prefixes > 1 ) {
			for (i = 1; i <= n_prefixes; i++) {
				var child = s.slice(0, i + 1).join( '.' );
				var parent = s.slice(0, i).join ( '.' );
				if ( !cc_map.hasOwnProperty( parent ) ) {
					Object.defineProperty( cc_map, parent, {
						value: [ child ],
						writable: true,
						enumerable: true
					});
				} else {
					if (!cc_map[parent].includes( child ) ) {
						cc_map[parent].push( child );
					}
				}
			}
		} else if ( n_prefixes === 1) {
			if ( !cc_map.hasOwnProperty( s[0] ) ) {
				Object.defineProperty( cc_map, s[0], {
					value: [ stream ],
					writable: true,
					enumerable: true
				});
			} else {
				if ( !(cc_map[parent].includes( stream )) ) {
					cc_map[s[0]].push( stream );
				}
			}
		} else {
			// no work to be done
		}
	}
	return cc_map;
};

and can be verified by running

let COPIED = mapCopyStreams( CONFIG );
console.log( JSON.stringify(COPIED, null, 2) );
{
  "edit": [
    "edit.growth",
    "edit.mobile"
  ],
  "edit.growth": [
    "edit.growth.experiments"
  ],
  "edit.growth.experiments": [
    "edit.growth.experiments.help_panel"
  ],
  "edit.mobile": [
    "edit.mobile.app",
    "edit.mobile.web"
  ]
}

This is quite an expensive operation to perform at every pageload! It'd be great if, within EventLogging code, that map was simply just available and accessible the same way the stream config is.

Event Timeline

mpopov moved this task from Incoming to Event Platform on the Analytics board.
mpopov renamed this task from EventStreamConfig should generate and provide the stream cc map to Stream cc map should not be generated on every pageload.Jun 24 2020, 12:24 PM

@Ottomata: do you think we can incorporate that algorithm into EventLogging in a way that the cc map isn't generated client-side on every pageload? If I write the PHP version, can we put it in EventLoggingHooks.php similar to loadEventStreamConfigs? It would generate the map from $config->get( 'EventLoggingStreamNames' ) and make it available client-side as config.streamCCMap (or something like that).

I don't fully grasp exactly what you need, but putting this logic in EventLogging ext makes the most sense. When I first saw this ticket titled for EventStreamConfig I wasn't so sure, but in EventLogging that seems fine. EventLogging is the only place that will know what to do with these 'CC' streams anyway :).

Alternatively, could you just make a config setting that indicated the parent stream for a 'CCed' child stream?

	"edit.mobile.app": {"cc_by": "edit.mobile" },

Then in EventLogging JS when you get a submit for "edit.mobile", you could iterate over all stream configs and look for ones with the cc_by (name TBD) setting and also submit to those. Or you could build up the map in the JS from stream config itself. An explicit config like this sounds saner than inferring some hierarchy from some naming convention, which will probably not be held by other non EventLogging streams (e.g. mediawiki.revision-create, mediawiki.page-delete, etc.).

Alternatively, could you just make a config setting that indicated the parent stream for a 'CCed' child stream?

	"edit.mobile.app": {"cc_by": "edit.mobile" },

Then in EventLogging JS when you get a submit for "edit.mobile", you could iterate over all stream configs and look for ones with the cc_by (name TBD) setting and also submit to those.

It'd be faster to lookup an array of stream names in a dictionary by stream name than to loop through the whole thing on every call.

Or you could build up the map in the JS from stream config itself. An explicit config like this sounds saner than inferring some hierarchy from some naming convention

That's what mapCopyStreams in the task description does, and it does it in a way that a parent stream does not have to be configured.

For example, suppose you have streams "edit.growth" and "edit.mobile" in the stream config but not "edit". If you have instrumentation that logs to "edit" stream (which does not exist), we want events to be auto-logged to "edit.growth" and "edit.mobile" streams via the cc feature.

I just don't think it'd be good to build that map client-side on every pageload, especially as the list of streams grows.

For example, suppose you have streams "edit.growth" and "edit.mobile" in the stream config but not "edit".

What happens if an instrumentation produces directly to "edit.growth"? Is that ok? What happens if "edit.growth.experiment" is also in stream config?

Generating whatever this is in the EventLogging extension should be fine. I'd be careful about using some a naming convention to figure out what gets 'CCed' to what though. It'll probably work in most cases, but I'd predict you'll run into some unexpected bugs when people start doing weird things with it.

You could make up another EventLogging config variable for this where you keep an explicit map?

$wgEventLoggingStreamCCMap = [
   "edit" => [ "edit.growth", "edit.mobile" ],
   "edit.growth" => [ "edit.growth.experiments" ],
   "edit.growth.experiments" => [ "edit.growth.experiments.help_panel" ],
   "edit.mobile" => [ "edit.mobile.app", "edit.mobile.web" ]
]

And export this to EventLogging JS via RL hooks like other configs are. Then if a stream is not in stream config, EventLogging could look up the stream name to see if it is in the streamCCMap and submit to each of those streams instead. Or do both?

Change 618306 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[mediawiki/extensions/EventLogging@master] [WIP] Generate Stream CC Map, CC streams in submit()

https://gerrit.wikimedia.org/r/618306

The WIP patch I uploaded is just so I don't lose the work I did on this, especially if we decide to keep the feature and use the implicit mapping then at least the PHP version of the JS algorithm in the description is already done.

Change 618306 abandoned by Bearloga:
[mediawiki/extensions/EventLogging@master] [WIP] Generate Stream CC Map, CC streams in submit()

Reason:
Likely going to use a manual, rather than automatic solution for stream cc-ing feature when the time comes

https://gerrit.wikimedia.org/r/618306

kzimmerman subscribed.

Closed as abandoned