Page MenuHomePhabricator

TranslationRecommendation* Schemas Event Platform Migration
Closed, ResolvedPublic

Description

See: https://wikitech.wikimedia.org/wiki/Event_Platform/EventLogging_legacy

We will keep client_ip and geocoded data for these schemas.

status
  • 2021-02-22 - schemas merged and edit protected on metawiki.
  • 2021-02-23
    • Hive table evolved
    • event streams declared
  • 2021-04-30
    • Finalized backend migration for refine legacy and eventlogging processor

Recommendation API code changes needed

Since these events are sent to the legacy eventlogging backend using from a custom client (not via the MW EventLogging extension), the code will need to be changed to send events to EventGate instead.

It looks as though this code has both JavaScript and Python logic to send events. Both implementations will need updating to send events to EventGate.

For more info see:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Isaac, let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of this migration.

Also, do you know what produces these events? I'm assuming its a MW extension somewhere?

let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of this migration.

If it's not hard, I'd ask to retain the geocoded data. Client IP is nice for determining unique number of users (UA+IP) but there's a user token in the data that also works for that purpose. I do use the geocoded data (country specifically) for looking at geographic diversity of users for the system though so would prefer to retain it.

Also, not ideal, but my understanding is that eventlogging also always shows up in webrequests so we can always extract that information even if it's not logged in the event.<schemaname> table. Is that going to change too?

Also, do you know what produces these events? I'm assuming its a MW extension somewhere?

Yeah, the eventlogging is all coming from GapFinder (not technically an extension) which has this codebase -- specifically, UserAction, UIRequests, APIRequests.

If it's not hard, I'd ask to retain the geocoded data.

It isn't hard, we can do!

eventlogging also always shows up in webrequests so we can always extract that information even if it's not logged in the event.<schemaname> table. Is that going to change too?

All webrequests are logged, so the POST of the event will be available in the webrequest logs, however, it won't be possible to extract the event data from it, since the data is now sent as part of the POST body, which isn't logged in webrequest.

Yeah, the eventlogging is all coming from GapFinder (not technically an extension)

Ok, interesting this will need code changes then. We'll put this off for now, but when we take it up, who should we work with to to make the changes?

If it's not hard, I'd ask to retain the geocoded data.

It isn't hard, we can do!

Thanks!

All webrequests are logged, so the POST of the event will be available in the webrequest logs, however, it won't be possible to extract the event data from it, since the data is now sent as part of the POST body, which isn't logged in webrequest.

Ahh...bummer but makes sense.

Ok, interesting this will need code changes then. We'll put this off for now, but when we take it up, who should we work with to to make the changes?

I'd start with reaching out to Leila. If the work is more technical, Baha knows the most about the codebase. If it's just code review / guidance, I can probably do that. Fabian might have an interest. Either way, Leila will be able to direct appropriately.

Ottomata updated the task description. (Show Details)
Ottomata added a subscriber: leila.

@leila o/ We're getting closer to being done with the low hanging fruit parts of this EventLogging -> Event Platform migration. The TranslationRecommendation events that come from GapFinder are 'medium hanging' :) GapFinder will need code changes. Specifically, the code that sends events as JSON url encoded query parameters to /beacon/event need to be changed to POST a fully formatted event to https://intake-analytics.wikimedia.org/v1/events?hasty=true, following the same logic as implemented in the EventLogging extension.

I don't think the code changes would be difficult, but I'd prefer if someone on your team could do the code. I'd be very available to advise and review. I understand if this is unexpected work that y'all don't have time for, but the sooner we get this done better! The more migrations we finish the sooner we can turn off the legacy eventlogging backed.

On it, meaning: I'll check on our end to see if we can pick this up and one of us will get back to you here.

Change 660658 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/recommendation-api@master] Send events to EventGate

https://gerrit.wikimedia.org/r/660658

Change 661399 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Migrate TranslationRecommendation from metawiki

https://gerrit.wikimedia.org/r/661399

Change 661399 merged by Ottomata:
[schemas/event/secondary@master] Migrate TranslationRecommendation from metawiki

https://gerrit.wikimedia.org/r/661399

Change 666392 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Declare TranslationRecommendation event streams

https://gerrit.wikimedia.org/r/666392

Change 666392 merged by Ottomata:
[operations/mediawiki-config@master] Declare TranslationRecommendation event streams

https://gerrit.wikimedia.org/r/666392

Mentioned in SAL (#wikimedia-operations) [2021-02-23T16:02:55Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Declare TranslationRecommendation event streams - T271163 (duration: 00m 58s)

@bmansurov you should be able to produce theses events now using your code. This should work in both beta and production. See also https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Viewing_and_querying_events

Change 660658 merged by jenkins-bot:
[research/recommendation-api@master] Send events to EventGate

https://gerrit.wikimedia.org/r/660658

@Ottomata thanks for reviewing the patch. I've merged it and deployed it.

I think so. For example, here's what was sent:

{"schema":"TranslationRecommendationUIRequests","$schema":"/analytics/legacy/translationrecommendationuirequests/1.0.0","revision":15484897,"event":{"timestamp":1615601235,"userAgent":"Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0","sourceLanguage":"en","targetLanguage":"es","origin":"language_select","userToken":"43124d8a-9b6e-426a-9b5d-192de7db562f","requestToken":"dcbe04f9-8fe9-4336-b252-a88a33abec0f","campaign":""},"webHost":"recommend.wmflabs.org","client_dt":"2021-03-13T02:07:15.077Z","meta":{"stream":"eventlogging_TranslationRecommendationUIRequests","domain":"recommend.wmflabs.org"}}

And the response status code was 202.

Ok, that's good! Note that a 202 with hasty=true does not necessarily mean everything is working. See https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#Producer_types%3A_Guaranteed_and_Hasty

You could set hasty=false to test and get a sync response back from EventGate. Or, you could query Hive a few hours after and make sure that event made it through. Or, you could use the eventstreams-internal instance to view live events in Kafka.

I just looked in Hive and I did not see an event with requestToken":"dcbe04f9-8fe9-4336-b252-a88a33abec0f"

Indeed. Those schema tables are all empty. I visited the eventgate-validation dashboard on Logstash, but I couldn't find any such requests. Where can I find more about those missing events?

I just manually POSTed your event from the CLI, and all went well:

curl -v -H 'Content-Type: text/plain' -d@/tmp/tr.json 'https://intake-analytics.wikimedia.org/v1/events' | jq .
...
< HTTP/2 201

Oh! I see. Legacy EventLogging data has always filtered out 'non WMF hostnames' from being included in the final refined tables. recommend.wmflabs.org is not currently considered a WMF hostname, but it probably should. I'll file a task.

This is not a problem with new event platform streams, as they don't have non wmf domains filtered out, just tagged with a boolean is_wmf_domain field.

From what I can tell, these events are fully migrated on the clients. Proceeding with the rest of the backend migration.

Change 683984 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Finalize TranslationRecommendation refine migration

https://gerrit.wikimedia.org/r/683984

Change 683984 merged by Ottomata:

[operations/puppet@production] Finalize TranslationRecommendation refine migration

https://gerrit.wikimedia.org/r/683984

Change 772507 had a related patch set uploaded (by Jdrewniak; author: Jdrewniak):

[operations/mediawiki-config@master] Enable EventGate logging for WikipediaPortal schema

https://gerrit.wikimedia.org/r/772507

Change 772507 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable EventGate logging for WikipediaPortal schema

https://gerrit.wikimedia.org/r/772507

Mentioned in SAL (#wikimedia-operations) [2022-03-22T20:24:08Z] <urbanecm@deploy1002> Synchronized wmf-config/InitialiseSettings.php: 17caf0359b99b69c0b3e0d7a5fa2f5c7fb7464ef: Enable EventGate logging for WikipediaPortal schema (T271163) (duration: 01m 54s)