Page MenuHomePhabricator

Source geolocation directly rather than using IP in schema
Open, MediumPublic

Description

With discussion on https://phabricator.wikimedia.org/T288853 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/713526, it's worth exploring whether we can source geolocation directly using existing APIs or cookies available to clients (app and web), rather than passing the user IP in some way to the backend, where it is looked up in the geolocation database.

This would mean that when writing schemas or setting up streams, users would specify that they want the 'location' and not the 'ip', since in practice ip is exclusively used for determining geolocation today.

Event Timeline

Responding to @Tgr's question from T287121: Add geolocation information to Growth schemas:

@Krinkle pointed out on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/713526 that collecting IPs when we only need high-level geolocation is unnecessarily privacy-invasive. It's possible to collect geodata manually, but adding that for five different schemas is a bit of a chore - @Ottomata is there any chance that will be supported in the foreseeable future? (Ie. a fragment similar to #client_ip that results in the geolocation data, but not the IP, being automatically added in the refine step)

IIRC, there is a limited geolocation http header that is set by varnish (I don't remember the name of the header). Instrumentation code could just set that in http.request_headers['X-Geolocation'] (or whatever the header name is), since http.request_headers is a map type field and would not require schema changes.

This wouldn't be automated though, instrumentation code would do it. Would that work?

This wouldn't be automated though, instrumentation code would do it. Would that work?

Depends on the definition of "instrumentation code". To me that means client-side JS calling mw.eventLog or PHP code.

By the way, to my knowledge, most schemas using Geo already do (and have for years) on the instrumentation side, as explicit part of the event object by sending it as one of keys, with a value read from the Geo cookie.

If I understand correctly, you (plural) are suggesting not to adopt this in the newer schemas that started capturing IP addresses instead, but rather leapfrog to a more automated injection. Essentially magically adding it to the data within the event-intake service or some later processing step for schemas that ask for it. I'd support that. I suppose we'd have to make sure that this information is indeed reliably available for the eventgate beacon requests given that it no longer uses a wiki domain (and thus not get as much of the text-lb/varnish treatment).

Oh, it's a cookie?

GeoIP=US:NY:Brooklyn:40.70:-73.97:v4;

https://wikitech.wikimedia.org/wiki/Geolocation

I guess we'd want a similar field for cookie values we want to collect then?

Depends on the definition of "instrumentation code". To me that means client-side JS calling mw.eventLog or PHP code.

Yes, whatever constructs the event.

event.http.cookies['GeoIP'] = getCookie('GeoIP');
odimitrijevic moved this task from Incoming to Event Platform on the Analytics board.
odimitrijevic added a project: Event-Platform.