Page MenuHomePhabricator

Produce an instrumentation event stream using new EPC and EventGate from client side browsers
Closed, ResolvedPublic

Description

Let's find an example instrumentation to produce through the new Event Platform components this quarter. This can be something ported over from EventLogging, or it can be a new instrumentation event stream altogether. Nuria suggested perhaps a new MediaViewer views event?

This should use all of the new analytics focused Event Platform client side componenets:

  • New schema in schemas/event/secondary repository
  • eventgate-analytics-external set up (otto)
  • new eventLogging.client RL module produces event (Jason)
  • Both eventgate-analytics-external and eventLogging.client use new EventStreamConfig extension (otto & Jason)
  • Camus imports events and Refine imports them into Hive event database.

This is a KR for Q3 2019-2020.

Event Timeline

I'm working on T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform and I think that figuring out how we want to namespace new schemas will help inform what we do for legacy ones.

What should we do? Should we have an instrumentation specific directly? Are all of EventLogging's use cases considered 'instrumentation'? Isn't everything 'instrumentation' anyway? Is this all 'mediawiki' instrumentation? Should we have an 'analytics' prefix?

I'm trying to come up with some examples to bike shed but am only coming up with ones that would fit in a mediawiki namespace, like:

  • mediawiki/session/ping
  • mediawiki/mediaviewer/view
  • mediawiki/mobileweb/upload/interaction

Oh, maybe apps? I dunno. Are those mediawiki too? I think perhaps not.

  • ios-app/feed/interaction
  • android-app/feed/interaction

?

On the other hand, I don't really want to babysit instrumentation developer schema namespacing, so perhaps it would be simpler to have some top level namespace and keep schema hierarchies relatively flat?

  • instrumentation/mediawiki_session_ping
  • instrumentation/ios_app_feed_interaction

@jlinehan @Milimetric @Nuria thoughts?

In meeting today we made a decision. To be clear, here's what the 2 repositories are for, and also how to namespace 'analytics' schemas in the secondary repository.

primary (tier 1)

  • directly user affecting production feature schemas go in primary.

secondary (tier 2)

  • non directly user affectIng features go in secondary (top level hierarchy)
  • schemas for analytics (also tier 2, but grouped together for convenience) only go in analytics/ (instrumentation schemas will be here)
  • legacy eventlogging schemas go in analytics/legacy/

I brain bounced a tricky issue with @mforns yesterday. Camus needs to know which streams to import. We've always done this via messy whitelist or blacklist regexes. For eventlogging, all topics start with eventlogging_ so we import all of those. For Mediawiki EventBus events, we whitelist them all (semi) explicitly.

But we don't have an easy to match pattern like that for MEP events. We don't really have a convention for naming streams.

We came up with two ideas.

  1. Have eventgate-analytics-external add extra prefix to the topic names. So stream 'mediawiki.session-ping' would go to a topic like 'eqiad.events.mediawiki.session-ping' (or something like that). Unfortunately we already have plenty of non eventlogging topics that are not prefixed with 'events' or anything else, e.g. 'eqiad.mediawiki.revision-create', so anything we do here will be inconsistent and still require some special specific topic whitelisting.
  1. Use stream config. We'd augment our camus CLI wrapper to be able to look up active streams from the EventStreamConfig API and use them to match Kafka topics. This is more explicit, but feels a bit brittle. This would only really work for dynamic stream configs that we can get from the EventStreamConfig API, i.e. streams produced via eventgate-analytics-external. Those streams declared statically for production eventgate instances are locked away in helm / service-runner configs. I suppose eventgate could expoose an endpoint to return known stream configs, buuut one thing at a time, eh?

AND! We actually don't have an explicit stream -> topics mapping stored anywhere, the topics in Kafka are prefixed with configs specified for eventgate.

I'm trying to refactor and simplify both the Refine and Camus job configs to make all of this less confusing. I'm inclined to option 2. If T229863: Refactor EventBus mediawiki configuration is done right we might be able too use EventStreamConfig to get the list of production mediawiki streams too, but I'm not exactly how that will work without coupling production eventgate to the Mediawiki API.

@Nuria and @Milimetric for thoughts. :)

Change 593047 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] [WIP] Add python/refinery/eventstreamconfig.py and use in in bin/camus to build dynamic topic whitelist

https://gerrit.wikimedia.org/r/593047

Change 593047 merged by Ottomata:
[analytics/refinery@master] Add python/refinery/eventstreamconfig.py and use in in bin/camus to build dynamic topic whitelist

https://gerrit.wikimedia.org/r/593047

Change 594565 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add camus job event_dynamic_stream_configs

https://gerrit.wikimedia.org/r/594565

Change 594565 merged by Ottomata:
[operations/puppet@production] Add camus job event_dynamic_stream_configs

https://gerrit.wikimedia.org/r/594565

Change 595047 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri in beta and production on group0 wikis

https://gerrit.wikimedia.org/r/595047

Change 595047 merged by Ottomata:
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri in beta and production on group0 wikis

https://gerrit.wikimedia.org/r/595047

In meeting today we made a decision. To be clear, here's what the 2 repositories are for, and also how to namespace 'analytics' schemas in the secondary repository.

primary (tier 1)

  • directly user affecting production feature schemas go in primary.

secondary (tier 2)

  • non directly user affectIng features go in secondary (top level hierarchy)
  • schemas for analytics (also tier 2, but grouped together for convenience) only go in analytics/ (instrumentation schemas will be here)
  • legacy eventlogging schemas go in analytics/legacy/

I was trying to follow the instructions listed in T248865#6011770 and I was confused about secondary vs primary, I didn't find it in the docs and the above explanation is really helpful and clear, I'd add it when possible! (I am sure that this is all work in progress etc.., I am just adding some feedback from a n0000b point of view :)

Closing this, we are producing events like session tick!