Page MenuHomePhabricator

Decommission EventLogging backend components by migrating to MEP
Open, MediumPublic

Description

As discussed in T228175: Event Platform Client Libraries, we believe we can migrate existent EventLogging extension produced streams to Modern Event Platform components. This will finally allow us to decommission the EventLogging backend pieces:

To support existent EventLogging events in eventgate-analytics, we need to do:

  • meta.wikimedia.org schemas ported to draft 7 JSONSchema in a git schema repo with common schema included via $ref.
  • stream config entry for each (active) EventLogging schema/stream.
  • Schema revision extension attributes changed to use the new semver schema version.
  • EL client side code adapted to produce full event (with capsule fields) and to POST to eventgate.
  • Resolve capsule userAgent type issues (This is a string in JSONSchema, and a struct in Hive)

Ideally, EventLogging will produce the full event including EventCapsule fields to eventgate-analytics-external, the same eventgate instance that new style schemas will use. The same Refine job we use for eventgate analytics events should be able to Refine the old EL style events. Not all fields from capsule will be set (e.g. seqId and recvFrom), but we can work with what we have on the client side. The main issue will be resolving the userAgent type discrepancy, as we will parse the user_agent during refinement.

We'll start by migrating a single high volume EventLogging stream to MEP: SearchSatisfaction - T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+107 -47
operations/mediawiki-configmaster+3 -0
schemas/event/secondarymaster+138 -0
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+8 -2
operations/mediawiki-configmaster+1 -2
operations/puppetproduction+2 -2
analytics/refinery/sourcemaster+81 -4
analytics/refinery/sourcemaster+280 -156
analytics/refinery/sourcemaster+273 -152
operations/mediawiki-configmaster+42 -2
mediawiki/extensions/EventLoggingmaster+1 -1
operations/mediawiki-configmaster+2 -2
operations/deployment-chartsmaster+9 -3
operations/mediawiki-configmaster+6 -1
operations/mediawiki-configmaster+13 -3
operations/mediawiki-configmaster+22 -0
mediawiki/extensions/WikimediaEventsmaster+2 -1
schemas/event/secondarymaster+327 -1 K
operations/puppetproduction+1 -77
operations/puppetproduction+129 -92
operations/puppetproduction+2 -73
operations/puppetproduction+99 -15
analytics/refinery/sourcemaster+6 -1
analytics/refinery/sourcemaster+559 -148
operations/mediawiki-configmaster+4 -1
operations/mediawiki-configmaster+10 -7
operations/deployment-chartsmaster+3 -3
schemas/event/secondarymaster+231 -173
eventgate-wikimediamaster+25 -3
mediawiki/extensions/EventLoggingmaster+142 -36
analytics/refinery/sourcemaster+187 -7
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenOttomata
OpenOttomata
Openjlinehan
ResolvedOttomata
ResolvedOttomata
OpenOttomata
OpenOttomata
ResolvedOttomata
DuplicateOttomata
ResolvedOttomata
OpenNone
ResolvedOttomata
OpenOttomata
OpenNone
OpenNone
OpenNone
Openmpopov
Openmpopov
OpenNone
OpenNone
OpenOttomata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

After brain bouncing this with Marcel today we found a way to exclude analytics/legacy schemas from the robustness CI tests that checks for snake_case. https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/589074/1/test/jsonschema/repository.test.js

I'll be modifying the EL client side patch to not lower case field names after all.

B. Each EL Schema will have a single corresponding stream, and each stream will be made up of (currently) 2 topics: one for each main DC. E.g. eqiad.eventlogging_NavigationTiming. The current topic is just eventlogging_NavigationTiming. Once we make this switch, no data will go to eventlogging_NavigationTiming topic anymore, data will only go to the DC prefixed ones.

I'm also reconsidering this. Perhaps keeping this migration as simple as possible is really the best thing to do. I can avoid prefixing by hardcoding an exception for streams that start with 'eventlogging_' in the eventgate-wikimedia code. We already have quite a few bits of wikimedia specific logic in there, so might as add some more! We can always revisit this decision after the migration and remove this logic to start producing to datacenter prefixed topic.

Change 589093 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Never topic prefix legacy eventlogging_.* streams

https://gerrit.wikimedia.org/r/589093

Alright, the changes I'm making today make the migration much simpler. By keeping the Kafka topics the same, and the schemas exactly the same except for adding some new required Event Platform schema fields ($schema, meta.stream, meta.dt, etc.) downstream consumers should be able to continue using the same legacy events in the existent eventlogging_<SchemaName> topics, and Refine will be able to work the same for all event data, legacy or not.

Once the schema, refinery, and eventgate-wikimedia changes are in place, all we have to do for SearchSatisfacation is add a $wgEventStreams config entry, and change its $wgEventLoggingSchemas entry to point at /analytics/legacy/searchsatisfaction/1.0.0. Everything after that should work transparently.

Nuria added a comment.Apr 16 2020, 2:44 PM

Once the schema, refinery, and eventgate-wikimedia changes are in place, all we have to do for SearchSatisfacation is add a $wgEventStreams config entry, and change its $wgEventLoggingSchemas entry to point at /analytics/legacy/searchsatisfaction/1.0.0. Everything after that should work transparently.

Nice

Change 585587 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Support POSTing legacy EventCapsule style events to EventGate

https://gerrit.wikimedia.org/r/585587

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
Ottomata added a subscriber: Analytics-Kanban.
Ottomata removed a subscriber: Analytics-Kanban.

Change 589093 merged by Ottomata:
[eventgate-wikimedia@master] Never topic prefix legacy eventlogging_.* streams

https://gerrit.wikimedia.org/r/589093

Krinkle removed a subscriber: Krinkle.Apr 20 2020, 6:54 PM

Change 589074 merged by Ottomata:
[schemas/event/secondary@master] Preserve camelCase capitalization in analytics/legacy schemas

https://gerrit.wikimedia.org/r/589074

Change 592664 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - Support backwards compatible eventlogging_ topic prefixing

https://gerrit.wikimedia.org/r/592664

Change 592664 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics-external - Support backwards compatible eventlogging_ topic prefixing

https://gerrit.wikimedia.org/r/592664

Change 586447 merged by jenkins-bot:
[analytics/refinery/source@master] Unify Refine transform functions and add user agent parser transform

https://gerrit.wikimedia.org/r/586447

Change 592726 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventStreams - Add SearchSatisfaction stream config and remove beta specific overrides

https://gerrit.wikimedia.org/r/592726

Change 592726 merged by Ottomata:
[operations/mediawiki-config@master] wgEventStreams - Add SearchSatisfaction stream config and remove beta specific overrides

https://gerrit.wikimedia.org/r/592726

Change 592735 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_

https://gerrit.wikimedia.org/r/592735

Change 592739 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] RefineTarget shouldRefine should consider both table whitelist and blacklist

https://gerrit.wikimedia.org/r/592739

Change 592735 merged by Ottomata:
[operations/mediawiki-config@master] wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_

https://gerrit.wikimedia.org/r/592735

Change 592756 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] refine.pp - Slight refactor to use new unified refine tranform functions

https://gerrit.wikimedia.org/r/592756

Change 592739 merged by jenkins-bot:
[analytics/refinery/source@master] RefineTarget shouldRefine should consider both table whitelist and blacklist

https://gerrit.wikimedia.org/r/592739

Change 592756 merged by Ottomata:
[operations/puppet@production] refine.pp - Slight refactor to use new unified refine tranform functions

https://gerrit.wikimedia.org/r/592756

Change 593573 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix refine_event table_blacklist_regex and remove absented mediawiki_events refine job

https://gerrit.wikimedia.org/r/593573

Change 593573 merged by Ottomata:
[operations/puppet@production] Refine - fix table_blacklist_regex and remove mediawiki_events refine job

https://gerrit.wikimedia.org/r/593573

Change 593594 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Factor out RefineFailuresChecker into the refine_job define

https://gerrit.wikimedia.org/r/593594

Change 593594 merged by Ottomata:
[operations/puppet@production] Factor out RefineFailuresChecker into the refine_job define

https://gerrit.wikimedia.org/r/593594

Change 593605 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove absented failed_flags_ refine::jobs

https://gerrit.wikimedia.org/r/593605

Change 593605 merged by Ottomata:
[operations/puppet@production] Remove absented failed_flags_ refine::jobs

https://gerrit.wikimedia.org/r/593605

Change 593610 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Add eventlogging_legacy job to refine EventLogging events from EventGate

https://gerrit.wikimedia.org/r/593610

Change 594981 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Add EventLogging legacy Test schema

https://gerrit.wikimedia.org/r/594981

Change 594981 merged by Ottomata:
[schemas/event/secondary@master] Add EventLogging legacy Test schema

https://gerrit.wikimedia.org/r/594981

Change 595025 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Add eventlogging_Test to wgEventStreams config

https://gerrit.wikimedia.org/r/595025

Change 595027 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/WikimediaEvents@master] Configure Test event stream to be sent via EventGate

https://gerrit.wikimedia.org/r/595027

Change 595025 merged by Ottomata:
[operations/mediawiki-config@master] Add eventlogging_Test to wgEventStreams config

https://gerrit.wikimedia.org/r/595025

Change 595027 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Configure Test event stream to be sent via EventGate

https://gerrit.wikimedia.org/r/595027

Change 595032 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce

https://gerrit.wikimedia.org/r/595032

Change 595032 merged by jenkins-bot:
[operations/mediawiki-config@master] Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce

https://gerrit.wikimedia.org/r/595032

Mentioned in SAL (#wikimedia-operations) [2020-05-07T20:10:07Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingStreamNames: set initial stream names, as yet unused - T238230 (duration: 01m 07s)

Change 595047 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri in beta and production on group0 wikis

https://gerrit.wikimedia.org/r/595047

Change 595047 merged by Ottomata:
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri in beta and production on group0 wikis

https://gerrit.wikimedia.org/r/595047

Woohoo! I just logged a Test event to eventgate in beta via mw.eventLog.logEvent("Test", {"OtherMessage": "test"}and mw.track("event.Test", {"OtherMessage": "test"})! Both work just right!

Change 595634 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Configure wgEventLoggingSchemas overrides in beta and testwiki

https://gerrit.wikimedia.org/r/595634

Change 595634 merged by Ottomata:
[operations/mediawiki-config@master] Configure wgEventLoggingSchemas overrides in beta and testwiki

https://gerrit.wikimedia.org/r/595634

Change 595969 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate - Set NODE_EXTRA_CA_CERTS

https://gerrit.wikimedia.org/r/595969

Change 595969 merged by Ottomata:
[operations/deployment-charts@master] eventgate - Set NODE_EXTRA_CA_CERTS

https://gerrit.wikimedia.org/r/595969

Change 596034 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventStreams and wgEventLoggingStreamNames Use +deploymentwiki for beta

https://gerrit.wikimedia.org/r/596034

Change 596034 merged by jenkins-bot:
[operations/mediawiki-config@master] wgEventStreams and wgEventLoggingStreamNames Use +deploymentwiki for beta

https://gerrit.wikimedia.org/r/596034

Change 596049 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] Use array_merge instead of array + when merging wgEventLoggingSchemas

https://gerrit.wikimedia.org/r/596049

Change 596049 merged by Ottomata:
[mediawiki/extensions/EventLogging@master] wgEventLoggingSchemas should override extension attributes

https://gerrit.wikimedia.org/r/596049

And mw.eventLog.logEvent("Test", {"OtherMessage": "test"} works from test.wikipedia.org too! It will also work from en.wikipedia.org after this week's MW train or after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/596049 is deployed, whichever comes first :)

SearchSatisfaction has been migrated to EventGate on deployment-prep beta wiki. :)

Change 601749 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refactor JsonSchemaLoader into JsonLoader to allow for easy loading of remote JSON blobs

https://gerrit.wikimedia.org/r/601749

Change 603591 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refactor JsonSchemaLoader into JsonLoader to allow for easy loading of remote JSON blobs

https://gerrit.wikimedia.org/r/603591

Change 601749 abandoned by Ottomata:
Refactor JsonSchemaLoader into JsonLoader to allow for easy loading of remote JSON blobs

Reason:
in favor of https://gerrit.wikimedia.org/r/c/analytics/refinery/source/ /603591

https://gerrit.wikimedia.org/r/601749

Change 603591 merged by Ottomata:
[analytics/refinery/source@master] Refactor JsonSchemaLoader into JsonLoader to allow for easy loading of remote JSON blobs

https://gerrit.wikimedia.org/r/603591

Migration plan:

0. Switch all refine jobs to refinery 0.0.126 and make eventlogging_analytics use event_transforms.

For each EventLogging schema

  1. Create /analytics/legacy/<schema_name> schema
  2. Evolve eventlogging table to use new schema, e.g.
schema_name=searchsatisfaction
table="event.${schema_name}"
schema_uri="/analytics/legacy/${schema_name}/latest"

echo "Evolving $table using schema at $schema_uri"
spark2-submit --conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar:/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar --driver-java-options='-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080' --class org.wikimedia.analytics.refinery.job.refine.tool.EvolveHiveTable  /srv/deployment/analytics/refinery/refinery-job.jar --table=
${table}" --schema_uri="${schema_uri}"
  1. Rolling deploy mediawiki-config changes (e.g. this one) to make EventLogging produce new schema data via EventGate.
  1. Once schema's data is fully produced through EventGate, use Refine job that uses schema repo instead of meta.wm.org:
    • If first EventLogging table migration, merge patch to make new Refine eventlogging_legacy job and add table to it.
    • else add table to Refine eventlogging_legacy job

Change 605955 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] refine.pp - bump refinery jar version and make eventlogging_analytics use event_transforms

https://gerrit.wikimedia.org/r/605955

Change 605989 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] event_transforms - Set legacy eventlogging ip field if it exists

https://gerrit.wikimedia.org/r/605989

Change 605989 merged by Ottomata:
[analytics/refinery/source@master] event_transforms - Set legacy eventlogging ip field if it exists

https://gerrit.wikimedia.org/r/605989

Change 605955 merged by Ottomata:
[operations/puppet@production] refine.pp - bump version and make eventlogging_analytics use event_transforms

https://gerrit.wikimedia.org/r/605955

Mentioned in SAL (#wikimedia-analytics) [2020-06-16T19:41:43Z] <ottomata> bumping Refine refinery jar version to 0.0.127 - T238230

Mentioned in SAL (#wikimedia-operations) [2020-06-19T18:10:07Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt - T238230 (duration: 00m 59s)

Change 607017 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri for all wikis

https://gerrit.wikimedia.org/r/607017

Change 607017 merged by Ottomata:
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri for all wikis

https://gerrit.wikimedia.org/r/607017

Mentioned in SAL (#wikimedia-operations) [2020-06-22T13:19:27Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt and set wgEventLoggingServiceUri for all wikis - T238230 (duration: 00m 58s)

@Samwilson @Niharika Hello!

I'm looking for a candidate EventLogging schema stream to migrate to EventGate. The migration should be 100% backwards compatible. I was using SearchSatisfaction as my candidate schema, but on Friday I made a mistake and lost some data while doing the migration. This was user error on my part.

I'd like to try again, but before I do would like to prove that it works for a lower volume data stream. TemplateWizard looks like a good candidate. Would you mind if I used it as a guinea pig? I don't expect any issues (but I didn't last week either). No worries if you do mind, I can keep looking for a different candidate.

Thank you!

I think it'd be fine to use TemplateWizard logging as a guinea pig. I don't think anyone's doing much with the data at the moment.

Change 607333 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate

https://gerrit.wikimedia.org/r/607333

Change 607333 merged by Ottomata:
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate on group0

https://gerrit.wikimedia.org/r/607333

Mentioned in SAL (#wikimedia-operations) [2020-06-23T18:53:36Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on group0 - T238230 (duration: 01m 06s)

Change 607346 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate on all wikis

https://gerrit.wikimedia.org/r/607346

Change 607349 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Add simple script to help converting EventLogging metawiki schemas

https://gerrit.wikimedia.org/r/607349

Change 607346 merged by Ottomata:
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate on all wikis

https://gerrit.wikimedia.org/r/607346

Mentioned in SAL (#wikimedia-operations) [2020-06-23T19:16:32Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - T238230 (duration: 01m 05s)

Change 607349 merged by Ottomata:
[schemas/event/secondary@master] Add simple script to help converting EventLogging metawiki schemas

https://gerrit.wikimedia.org/r/607349

Mentioned in SAL (#wikimedia-operations) [2020-06-23T20:31:22Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - take 2 - T238230 (duration: 01m 06s)

Change 607520 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate SearchSatisfaction from EventLogging to EventGate on group1

https://gerrit.wikimedia.org/r/607520

Change 607520 merged by Ottomata:
[operations/mediawiki-config@master] Migrate SearchSatisfaction from EventLogging to EventGate on group1

https://gerrit.wikimedia.org/r/607520

Something I've overlooked:

Camus's eventlogging job uses the dt field for hourly partitioning. As we move events to EventGate, dt will now be set by EventLogging client side, which means it will be using the browser's time, which is untrustworthy. I don't know what can be done about this during the incremental roll out. E.g. right now SearchSatisfaction -> EventGate is deployed to only group0 wikis, so those ones have dt set by browsers, wheras all the others have dt set by eventlogging-processor. This could cause weird partitioning errors where data is written to camus partitions much after (or before) the current time.

As long as the browser dt isn't too far off (within 28 hours should be ok I think), then the data will be noticed by Refine and re-ingested. Once a schema is fully migrated to EventGate, we can configure it to be ingested by a Camus job that uses meta.dt instead of dt.

Ooof, but you can easily have outliers with offline features and buffered events sent in batch. The way goblin deals with late arrivals is cool, no?

Ah, for the most part, we won't be using the client's time for partitioning, its only during this incremental rollout that things are weird.

Change 593610 merged by Ottomata:
[operations/puppet@production] Add eventlogging_legacy Refine job for events migrated to EventGate

https://gerrit.wikimedia.org/r/c/operations/puppet/ /593610

Nuria added a comment.Mon, Jul 6, 11:07 PM

scripts/eventlogging_legacy_schema_convert.js

is this script just used via node on the repo in which we store the schemas?