Page MenuHomePhabricator

Decommission EventLogging backend components by migrating to MEP
Open, MediumPublic

Description

As discussed in T228175: Event Platform Client Libraries, we believe we can migrate existent EventLogging extension produced streams to Modern Event Platform components. This will finally allow us to decommission the EventLogging backend pieces:

To support existent EventLogging events in eventgate-analytics, we need to do:

  • meta.wikimedia.org schemas ported to draft 7 JSONSchema in a git schema repo with common schema included via $ref.
  • stream config entry for each (active) EventLogging schema/stream.
  • Schema revision extension attributes changed to use the new semver schema version.
  • EL client side code adapted to produce full event (with capsule fields) and to POST to eventgate.
  • Resolve capsule userAgent type issues (This is a string in JSONSchema, and a struct in Hive)

Ideally, EventLogging will produce the full event including EventCapsule fields to eventgate-analytics-external, the same eventgate instance that new style schemas will use. The same Refine job we use for eventgate analytics events should be able to Refine the old EL style events. Not all fields from capsule will be set (e.g. seqId and recvFrom), but we can work with what we have on the client side. The main issue will be resolving the userAgent type discrepancy, as we will parse the user_agent during refinement.

We'll start by migrating a single high volume EventLogging stream to MEP: SearchSatisfaction - T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform

Once T259163: Migrate legacy metawiki schemas to Event Platform is done, we should clean up all schemas on metawiki, either by deleting them or emptying out their content with {}.


Steps

Undeploy varnishkafka-eventlogging

  • Set ensure => 'absent' on varnishkafka::instance and nrpe::monitor_service in profile::cache::kafka::eventlogging.
  • Apply puppet on all varnish cache nodes and ensure varnishkafka-eventlogging is stopped and config files are removed.
  • Remove profile::cache::kafka::eventlogging from operations/puppet.

Undeploy eventlogging-processor

Delete the legacy backend eventlogging Kafka topics in Kafka jumbo-eqiad cluster:

To delete:

eventlogging-client-side
eventlogging-valid-mixed
eventlogging-virtualpageview

DO NOT delete any eventlogging_* topics. These are migrated eventlogging topics.

Update gobblin job eventlogging_legacy.pull

The eventlogging_legacy Gobblin job is still using wildcard topic names to figure out which Kafka topics to import. Now that all legacy streams are migrated, we should use EventStreamConfig to determine which streams to import.

The eventlogging_legacy_test.pull job (running only in the analytics-test-hadoop cluster) already does this.

Remove refine_eventlogging_analytics job

This is a Refine job dedicated for ingesting not-yet-migrated legacy EventLogging data. This is not to be confused with 'refine_eventlogging_legacy', which is used to ingest migrated legacy EventLogging data.

Now that we have finished the migration, we can delete the refine_eventlogging_analytics job.

Reconfigure refine_eventlogging_legacy job

We can now remove the manually maintained $eventlogging_legacy_table_include_list.

  • Remove puppet code that references $eventlogging_legacy_table_include_list and table_include_regex => $eventlogging_legacy_table_include_regex from the EventLogging Legacy data refine job configuration
  • Apply puppet on an-launcher1002
  • Ensure that refine_eventlogging_legacy job still works.

Decommission eventlog1003

meta.wikimedia.org schemas

There is not a consistent practice for 'deleting' metawiki schemas. Often, the content is just zeroed out by editing the schema page and deleting the content. We should probably do this for ALL metawiki schemas.

Or, we could decide to just leave them as is. The system won't use these anymore, so it doesn't really matter from a technical perspective.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+6 -0
operations/mediawiki-configmaster+0 -5
operations/mediawiki-configmaster+0 -1
mediawiki/extensions/TemplateWizardmaster+1 -1
operations/puppetproduction+107 -47
operations/mediawiki-configmaster+3 -0
schemas/event/secondarymaster+138 -0
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+8 -2
operations/mediawiki-configmaster+1 -2
operations/puppetproduction+2 -2
analytics/refinery/sourcemaster+81 -4
analytics/refinery/sourcemaster+280 -156
analytics/refinery/sourcemaster+273 -152
operations/mediawiki-configmaster+42 -2
mediawiki/extensions/EventLoggingmaster+1 -1
operations/mediawiki-configmaster+2 -2
operations/deployment-chartsmaster+9 -3
operations/mediawiki-configmaster+6 -1
operations/mediawiki-configmaster+13 -3
operations/mediawiki-configmaster+22 -0
mediawiki/extensions/WikimediaEventsmaster+2 -1
schemas/event/secondarymaster+327 -1 K
operations/puppetproduction+1 -77
operations/puppetproduction+129 -92
operations/puppetproduction+2 -73
operations/puppetproduction+99 -15
analytics/refinery/sourcemaster+6 -1
analytics/refinery/sourcemaster+559 -148
operations/mediawiki-configmaster+4 -1
operations/mediawiki-configmaster+10 -7
operations/deployment-chartsmaster+3 -3
schemas/event/secondarymaster+231 -173
eventgate-wikimediamaster+25 -3
mediawiki/extensions/EventLoggingmaster+142 -36
analytics/refinery/sourcemaster+187 -7
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedOttomata
OpenOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
OpenOttomata
Resolved Gilles
Resolvedmforns
Resolvedovasileva
DeclinedOttomata
ResolvedOttomata
Resolvedmforns
Resolved Mholloway
ResolvedOttomata
DuplicateNone
DuplicateNone
DuplicateNone
DuplicateNone
Resolved Mholloway
DuplicateNone
ResolvedOttomata
ResolvedSBisson
ResolvedSBisson
ResolvedSBisson
Resolvedmforns
ResolvedOttomata
Resolvedmforns
ResolvedOttomata
ResolvedOttomata
DeclinedNone
DeclinedNone
Resolved bmansurov
ResolvedJAllemandou
Resolvedmforns
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DuplicateNone
ResolvedOttomata
DeclinedMNeisler
ResolvedSBisson
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedJdrewniak
ResolvedOttomata
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
ResolvedPginer-WMF
Resolvedphuedx
ResolvedMMiller_WMF
Resolvedphuedx
Resolvedphuedx
OpenNone
Resolvedphuedx
Resolvedphuedx
OpenNone
ResolvedEtonkovidova
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
OpenNone
OpenNone
Resolvedphuedx
Resolvedphuedx
Resolvedmatmarex
DeclinedNone
ResolvedOttomata
Resolvedmforns
ResolvedOttomata
DeclinedOttomata
ResolvedOttomata
ResolvedSharvaniharan
ResolvedSharvaniharan
DeclinedOttomata
ResolvedOttomata
Resolvedovasileva

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 601749 abandoned by Ottomata:
Refactor JsonSchemaLoader into JsonLoader to allow for easy loading of remote JSON blobs

Reason:
in favor of https://gerrit.wikimedia.org/r/c/analytics/refinery/source/ /603591

https://gerrit.wikimedia.org/r/601749

Change 603591 merged by Ottomata:
[analytics/refinery/source@master] Refactor JsonSchemaLoader into JsonLoader to allow for easy loading of remote JSON blobs

https://gerrit.wikimedia.org/r/603591

Migration plan:

0. Switch all refine jobs to refinery 0.0.126 and make eventlogging_analytics use event_transforms.

For each EventLogging schema

  1. Create /analytics/legacy/<schema_name> schema
  2. Evolve eventlogging table to use new schema, e.g.
schema_name=searchsatisfaction
table="event.${schema_name}"
schema_uri="/analytics/legacy/${schema_name}/latest"

echo "Evolving $table using schema at $schema_uri"
spark2-submit --conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar:/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar --driver-java-options='-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080' --class org.wikimedia.analytics.refinery.job.refine.tool.EvolveHiveTable  /srv/deployment/analytics/refinery/refinery-job.jar --table=
${table}" --schema_uri="${schema_uri}"
  1. Rolling deploy mediawiki-config changes (e.g. this one) to make EventLogging produce new schema data via EventGate.
  1. Once schema's data is fully produced through EventGate, use Refine job that uses schema repo instead of meta.wm.org:
    • If first EventLogging table migration, merge patch to make new Refine eventlogging_legacy job and add table to it.
    • else add table to Refine eventlogging_legacy job

Change 605955 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] refine.pp - bump refinery jar version and make eventlogging_analytics use event_transforms

https://gerrit.wikimedia.org/r/605955

Change 605989 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] event_transforms - Set legacy eventlogging ip field if it exists

https://gerrit.wikimedia.org/r/605989

Change 605989 merged by Ottomata:
[analytics/refinery/source@master] event_transforms - Set legacy eventlogging ip field if it exists

https://gerrit.wikimedia.org/r/605989

Change 605955 merged by Ottomata:
[operations/puppet@production] refine.pp - bump version and make eventlogging_analytics use event_transforms

https://gerrit.wikimedia.org/r/605955

Mentioned in SAL (#wikimedia-analytics) [2020-06-16T19:41:43Z] <ottomata> bumping Refine refinery jar version to 0.0.127 - T238230

Mentioned in SAL (#wikimedia-operations) [2020-06-19T18:10:07Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt - T238230 (duration: 00m 59s)

Change 607017 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri for all wikis

https://gerrit.wikimedia.org/r/607017

Change 607017 merged by Ottomata:
[operations/mediawiki-config@master] Set wgEventLoggingServiceUri for all wikis

https://gerrit.wikimedia.org/r/607017

Mentioned in SAL (#wikimedia-operations) [2020-06-22T13:19:27Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt and set wgEventLoggingServiceUri for all wikis - T238230 (duration: 00m 58s)

@Samwilson @Niharika Hello!

I'm looking for a candidate EventLogging schema stream to migrate to EventGate. The migration should be 100% backwards compatible. I was using SearchSatisfaction as my candidate schema, but on Friday I made a mistake and lost some data while doing the migration. This was user error on my part.

I'd like to try again, but before I do would like to prove that it works for a lower volume data stream. TemplateWizard looks like a good candidate. Would you mind if I used it as a guinea pig? I don't expect any issues (but I didn't last week either). No worries if you do mind, I can keep looking for a different candidate.

Thank you!

I think it'd be fine to use TemplateWizard logging as a guinea pig. I don't think anyone's doing much with the data at the moment.

Change 607333 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate

https://gerrit.wikimedia.org/r/607333

Change 607333 merged by Ottomata:
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate on group0

https://gerrit.wikimedia.org/r/607333

Mentioned in SAL (#wikimedia-operations) [2020-06-23T18:53:36Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on group0 - T238230 (duration: 01m 06s)

Change 607346 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate on all wikis

https://gerrit.wikimedia.org/r/607346

Change 607349 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Add simple script to help converting EventLogging metawiki schemas

https://gerrit.wikimedia.org/r/607349

Change 607346 merged by Ottomata:
[operations/mediawiki-config@master] Migrate TemplateWizard from EventLogging to EventGate on all wikis

https://gerrit.wikimedia.org/r/607346

Mentioned in SAL (#wikimedia-operations) [2020-06-23T19:16:32Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - T238230 (duration: 01m 05s)

Change 607349 merged by Ottomata:
[schemas/event/secondary@master] Add simple script to help converting EventLogging metawiki schemas

https://gerrit.wikimedia.org/r/607349

Mentioned in SAL (#wikimedia-operations) [2020-06-23T20:31:22Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - take 2 - T238230 (duration: 01m 06s)

Change 607520 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Migrate SearchSatisfaction from EventLogging to EventGate on group1

https://gerrit.wikimedia.org/r/607520

Change 607520 merged by Ottomata:
[operations/mediawiki-config@master] Migrate SearchSatisfaction from EventLogging to EventGate on group1

https://gerrit.wikimedia.org/r/607520

Something I've overlooked:

Camus's eventlogging job uses the dt field for hourly partitioning. As we move events to EventGate, dt will now be set by EventLogging client side, which means it will be using the browser's time, which is untrustworthy. I don't know what can be done about this during the incremental roll out. E.g. right now SearchSatisfaction -> EventGate is deployed to only group0 wikis, so those ones have dt set by browsers, wheras all the others have dt set by eventlogging-processor. This could cause weird partitioning errors where data is written to camus partitions much after (or before) the current time.

As long as the browser dt isn't too far off (within 28 hours should be ok I think), then the data will be noticed by Refine and re-ingested. Once a schema is fully migrated to EventGate, we can configure it to be ingested by a Camus job that uses meta.dt instead of dt.

Ooof, but you can easily have outliers with offline features and buffered events sent in batch. The way goblin deals with late arrivals is cool, no?

Ah, for the most part, we won't be using the client's time for partitioning, its only during this incremental rollout that things are weird.

Change 593610 merged by Ottomata:
[operations/puppet@production] Add eventlogging_legacy Refine job for events migrated to EventGate

https://gerrit.wikimedia.org/r/c/operations/puppet/ /593610

scripts/eventlogging_legacy_schema_convert.js

is this script just used via node on the repo in which we store the schemas?

Change 649594 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/TemplateWizard@master] Switch event to use the new platform

https://gerrit.wikimedia.org/r/649594

Change 650093 had a related patch set uploaded (by Awight; owner: Awight):
[operations/mediawiki-config@master] Migrate TemplateWizard to full "new" events

https://gerrit.wikimedia.org/r/650093

Change 649594 merged by jenkins-bot:
[mediawiki/extensions/TemplateWizard@master] Switch event to explicitly use the new platform

https://gerrit.wikimedia.org/r/649594

Change 650093 abandoned by Awight:
[operations/mediawiki-config@master] Migrate TemplateWizard to full "new" events

Reason:

https://gerrit.wikimedia.org/r/650093

Change 822666 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Remove reference to unreachable eventlogging-procesor service

https://gerrit.wikimedia.org/r/822666

Change 822666 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove reference to unreachable eventlogging-processor service

https://gerrit.wikimedia.org/r/822666

@Ottomata: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

Nope, not yet. T259163: Migrate legacy metawiki schemas to Event Platform and T282131: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned, and then also all the existent mobile app usages need to be reimplemented (in progress, IIUC), before we can actually decomission.

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Change 982163 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] varnishkafka::instance - Add ensure param

https://gerrit.wikimedia.org/r/982163

Change 982163 merged by Ottomata:

[operations/puppet@production] varnishkafka::instance - Add ensure param

https://gerrit.wikimedia.org/r/982163