Page MenuHomePhabricator

WikipediaPortal Event Platform Migration
Closed, ResolvedPublic5 Estimated Story Points

Description

This task is about making the WikipediaPortal code itself work with Event Platform. Right now, it is a direct copy/paste of old code in the EventLogging extension. Because WikipediaPortal does not use the EventLogging extension itself, it is a completely separate codebase and client, and does not (yet) support the backwards compatible migration we are doing for all other schemas in T259163.

We need some team to own the WikipediaPortal codebase and event client, and make it POST events to EventGate.

Any of the tasks listed under 'Schemas produced by other software' in T259163: Migrate legacy metawiki schemas to Event Platform are instrumentations that needed to be manually rewritten, just like WikipediaPortal. You can click through those to find commits to do that. T271163: TranslationRecommendation* Schemas Event Platform Migration is probably the most like WikipediaPortal of them. In this change, the instrumentation code is very simple (it just hardcodes POSTing an event, so there's no real library/client code), but you can see how the code had to change to work with Event Platform.

See also: https://wikitech.wikimedia.org/wiki/Event_Platform/EventLogging_legacy.
Unless otherwise notified, client IP and consequently geocoded data will no longer be collected for this event data after this migration. Please let us know if this should continue to be captured. See also T262626.

Acceptance criteria
Analytics events are being sent from the www.wikipedia.org to the Event Platform infrastructure, as a similar rate as before, and without any changes to the data being collected.

Migration Checklist

  • 1. Pick a schema to migrate
  • 2. Create a new task to track this schema's migration
  • 3. Create /analytics/legacy/ schema
  • 4. Edit-protect the metawiki Schema page at https://meta.wikimedia.org/wiki/Schema:WikipediaPortal
  • 5. Manually evolve the Hive table to use new schema
  • 6. Add entry to wgEventStreams, in operations/mediwiki-config
  • 7. Once the legacy stream's data is fully produced through EventGate, switch to using Refine job that uses schema repo instead of meta.wm.org
  • 8. Mark the schema as migrated in the EventLogging Schema Migration Audit spreadsheet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Fair enough, @Ottomata! Is Q1 too late for this self-imposed deadline?

That'd be great we can work with that! Thank you.

Hi @EYener just verifying: Can we expect this code to work with EventGate as EventLogging does now by the end of Q1 and schedule the migration for just after that happens?

Ottomata triaged this task as Medium priority.Jun 11 2021, 6:18 PM

Thank you! Yes we can get this done in Q1.

Great thanks, we'll aim to migrate this schema early in Q2 then.

Hi @EYener! Checking in, how's the Q1 migration of this instrumentation going?

Hi @Ottomata thanks for the ping - this is on my list of quarterly projects, and I've scheduled time out of this week to focus on it in earnest. I've read through the docs and I'll start on step 3 as outlined here this week. I might be pinging back here or on IRC if I run into any questions!

@EYener ah so, the ask is different than the steps outlined in T259163. This task is about making the WikipediaPortal code itself work with Event Platform. Right now, it is a direct copy/paste of old code in the EventLogging extension. Because WikipediaPortal does not use the EventLogging extension itself, it is a completely separate codebase and client, and does not (yet) support the backwards compatible migration we are doing for all other schemas in T259163.

We need some team to own the WikipediaPortal codebase and event client, and make it POST events to EventGate.

Am happy to jump in a meeting sometime to discuss more if that would help. :)

Hi @Ottomata! Actually that would be super helpful. Would you mind picking anything on my calendar that is open and works for you? I'll remove unnecessary events, and I have my working hours blocked off.

@EYener and I met today and we are going to have to sync up with some FRtech team members about this.

@EYener, so any of the tasks listed under 'Schemas produced by other software' in T259163: Migrate legacy metawiki schemas to Event Platform are instrumentations that needed to be manually rewritten, just like WikipediaPortal. You can click through those to find commits to do that. T271163: TranslationRecommendation* Schemas Event Platform Migration is probably the most like WikipediaPortal of them. In this change, the instrumentation code is very simple (it just hardcodes POSTing an event, so there's no real library/client code), but you can see how the code had to change to work with Event Platform.

Hey @Jdlrobson , after discussions with the data-engineering team, I agreed to implement this migration (since I originally wrote this code) and @ovasileva agreed to bring it through the Readers Web process for visibility. This work is planned for Q3 2022.

Jan can you scope the work involved in this ticket, so we can plan accordingly?

Checking in, @Jdrewniak how's this going? Can I help in any way?

@ovasileva curious for an update, how's this going?

hi @Ottomata, I'm just in the process of scoping out this work this week (I anticipate we'll get started on it shortly after that) and I'll reach out if I have any questions :)

Ok great! FYI i'm out for the next 1.5 weeks. Probably @mforns can help you if you need anything before then.

So I've looked into the event logging migration. The documentation is very good! I think I have everything I need to start work on this task.

From what I can tell this'll be most similar to the Recommendation API migration:
https://gerrit.wikimedia.org/r/c/research/recommendation-api/+/660658/7/recommendation/web/static/event_logger.js#b87

Most of the Portals custom event-logging implementation will stay the same. The only part that has to change for this migration the way the events are sent and some slight modification to the data structure. The event endpoint should change from htttps://www.wikipedia.org/beacon/event to https://intake-analytics.wikimedia.org/v1/events and we'll have to modify the event metadata a bit. That'll mostly consist of adding this metadata to the event:

{
"client_td": "...",
"$schema": "/analytics/legacy/WikipediaPortal",
"meta": {
    "streamName": "eventlogging_WikipediaPortal",
    "domain": "www.wikipedia.org"
    }
}

These changes shouls mostly happen in the prepare() function here
https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/portals/+/refs/heads/master/src/wikipedia.org/assets/js/event-logging-lite.js#184

I think testing the data structure can probably be done via curl (like T271163#6917256) or by verifying the events in the Data Lake via Hive or other means.

@Ottomata great, should we merge the schema addition to analytics/legacy before, after, or in tandem with the migrated instrumentation?

Schema should go first. Then stream config, then instrumentation deployment. :)

Change 685513 merged by jenkins-bot:

[schemas/event/secondary@master] Add WikipediaPortal to analytics/legacy

https://gerrit.wikimedia.org/r/685513

Change 772504 had a related patch set uploaded (by Jdrewniak; author: Jdrewniak):

[wikimedia/portals@master] Migrate wikipedia.org analytics to Event Platform

https://gerrit.wikimedia.org/r/772504

hi @Ottomata , I've merged the schema to analytics/legacy and have a patch up for the portals repo as well as mediawiki-config. I'm guessing we'll be able to test the new schema once it's been deployed via the train? (that's happening 4 times this week so it shouldn't be long)

The endpoint in the Portals repo has been changed to https://intake-analytics.wikimedia.org/v1/events and the data looks something like this:

{
"event":{ 
	"session_id":"c5860eb99af6d7d9",
	"event_type":"landing",
	"referer":"http://localhost:8000/src/wikipedia.org/",
	"accept_language":"en",
	"cohort":"baseline"
	},
"revision":15890769,
"schema":"WikipediaPortal",
"$schema":"/analytics/legacy/wikipediaportal/1.0.0",
"client_dt":"2022-03-21T21:46:08.308Z",
"webHost":"localhost",
"wiki":"metawiki",
"meta":{
	"stream":"eventlogging_WikipediaPortal",
	"domain":"localhost"
	}
};

Great!

The schema gets deployed automatically, so it is out. https://schema.wikimedia.org/#!//secondary/jsonschema/analytics/legacy/wikipediaportal

What's the mediawiki-config patch? I can +1 if you like! :)

Thank you @Jdrewniak and @Ottomata! This is a schema that I've worked with before to have the data whitelisted for permanent data capture. Is there anything I'll need to do on my end to ensure that this process continues after the migration?

This migration is being done in a backwards compatible way, so the same system is sanitizing and keeping data. If this schema is listed in the allowlist file as described in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization, it will continue to be sanitized and kept permanently.

Great @Ottomata yes it is in that yaml file!

Change 772504 merged by jenkins-bot:

[wikimedia/portals@master] Migrate wikipedia.org analytics to Event Platform

https://gerrit.wikimedia.org/r/772504

Jdlrobson reassigned this task from Edtadros to Jdrewniak.
Jdlrobson added a subscriber: Edtadros.

Is this ready for QA? If so could you add some QA steps. If not, please move to doing.

Change 773373 had a related patch set uploaded (by Jdrewniak; author: Jdrewniak):

[wikimedia/portals@master] Followup 246c37b7, event-logging sends data in POST body not as URL params

https://gerrit.wikimedia.org/r/773373

Change 773373 merged by jenkins-bot:

[wikimedia/portals@master] Followup 246c37b7, event-logging sends data in POST body not as URL params

https://gerrit.wikimedia.org/r/773373

Change 773380 had a related patch set uploaded (by Jdrewniak; author: Jdrewniak):

[operations/mediawiki-config@master] Bumping portals to master

https://gerrit.wikimedia.org/r/773380

@Ottomata I'm planning to deploy the change to the portals repo Thursday March 24, afternoon deploy window. After that, we should start seeing production events :)

Change 773380 merged by jenkins-bot:

[operations/mediawiki-config@master] Bumping portals to master

https://gerrit.wikimedia.org/r/773380

Mentioned in SAL (#wikimedia-operations) [2022-03-24T20:22:23Z] <thcipriani@deploy1002> Synchronized portals/wikipedia.org/assets: Config: [[gerrit:773380|Bumping portals to master (T282012)]] (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2022-03-24T20:23:16Z] <thcipriani@deploy1002> Synchronized portals: Config: [[gerrit:773380|Bumping portals to master (T282012)]] (duration: 00m 52s)

hi @Ottomata, I've clicked around Hue and verified that events are being logged successfully at about the volume as before (about 25k per day). Would you like to verify this as well & sign this off?

Awesome! no if you see them that's great. There are some backend finalization steps that I can take from here. I'll try to get them in today or tomorrow.

Thank you!!!!

Change 775374 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Finalize WikipediaPortal eventlogging event platform migration

https://gerrit.wikimedia.org/r/775374

Ottomata updated the task description. (Show Details)

Change 775374 merged by Ottomata:

[operations/puppet@production] Finalize WikipediaPortal eventlogging event platform migration

https://gerrit.wikimedia.org/r/775374

Just waiting to get my (expired) edit-protect permissions on metawiki back, then I can close this!