Page MenuHomePhabricator

Add geolocation information to Growth schemas
Closed, ResolvedPublic

Description

In the fiscal year that has just started, the Growth team's focus will be on the Newcomer Experience Pilots. This work has a geographic component to it, we're going to be interested in analyzing newcomer behaviour and usage of Growth team features on a per-country basis. To enable that analysis, we'll be needing to add geolocation information in the schemas the team uses (this was removed as part of the migration process for legacy schemas, and never added to Event Platform schemas). The team has a mix of legacy and Event Platform schemas that we need this added to.

Legacy schemas:

Event Platform schemas:

We are not looking to retain any geolocation data beyond 90 days, as none of the schemas have a structure that would allow for that. And by the way, the only schema of the ones listed that are on the allowlist is ServerSideAccountCreation.

The team also maintains the legacy NewcomerTask schema, but that schema does not need this information as it's only used to store tasks-specific information to be connected with HomepageModule, HelpPanel, and link_suggestion_interaction.

Event Timeline

nettrom_WMF added a subscriber: Ottomata.

From what I've been able to find, this is the first time this has been requested, and so I'm unsure what exactly to ask for and how to do this. I'm hoping someone from Data Engineering can help us out. I suspect @Ottomata is the right person to tag first, so adding him here.

If you add the http.client_ip field to your schemas, EventGate will automatically populate it. If this field exists, the Hive Refine step will then automatically add the geocoded data field to the Hive table.

https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#http_information

These schemas are all migrated to Event Platform, so the steps are the same for each of them. :)

Note that there is a fragment/http/client_ip schema that should be $refed to add this field. See other schemas that $ref client_ip, like virtualpageview.

@mewoph : Can we add the reference to client_ip to https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/704402 so we also get that done and only have one version change to the schema? Based on the code for VirtualPageview that @Ottomata mentions above, all that should be needed is to add - $ref: /fragment/http/client_ip/1.0.0# to the referenced fragments.

I'm not great at Gerrit, so I'm unsure how I'd go about adding it to that patch. If you give or point me to some instructions, I can probably do it, though.

Hi @nettrom_WMF — I can add it to the existing patch

Change 704402 had a related patch set uploaded (by MewOphaswongse; author: MewOphaswongse):

[schemas/event/secondary@master] Add a link: Update schema to support edit mode and link inspector toggles; add client_ip

https://gerrit.wikimedia.org/r/704402

Change 704402 merged by jenkins-bot:

[schemas/event/secondary@master] Add a link: Update schema to support edit mode and link inspector toggles; add client_ip

https://gerrit.wikimedia.org/r/704402

Change 710310 had a related patch set uploaded (by MewOphaswongse; author: MewOphaswongse):

[schemas/event/secondary@master] Add client_ip to Growth schemas

https://gerrit.wikimedia.org/r/710310

Change 710311 had a related patch set uploaded (by MewOphaswongse; author: MewOphaswongse):

[mediawiki/extensions/GrowthExperiments@master] Use updated schemas with client_ip

https://gerrit.wikimedia.org/r/710311

Sample payload from updated schemas

HomepageVisit

"http": {
  "request_headers": {
    "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1"
  },
  "client_ip": "127.0.0.1"
},
"schema": "HomepageVisit",
"wiki": "my_wiki",
"webHost": "localhost:8080",
"$schema": "/analytics/legacy/homepagevisit/1.2.0",
"client_dt": "2021-08-05T17:11:51Z"

HomepageModule

"$schema": "/analytics/legacy/homepagemodule/1.3.0",
"client_dt": "2021-08-05T17:02:43.108Z",
"meta": {
  "stream": "eventlogging_HomepageModule",
  "domain": "localhost",
  "id": "a876ef03-b2d0-414b-a496-ff6a4d8b5774",
  "dt": "2021-08-05T17:02:44.116Z",
  "request_id": "f38ef720-f60e-11eb-8e57-dd5d9374b9c7"
},
"http": {
  "client_ip": "127.0.0.1",
  "request_headers": {
    "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1"
  }

HelpPanel

"$schema": "/analytics/legacy/helppanel/1.1.0",
"client_dt": "2021-08-05T17:04:23.849Z",
"meta": {
  "stream": "eventlogging_HelpPanel",
  "domain": "localhost",
  "id": "b8d0b7a0-dbe7-43ce-b92f-88bc72e1685b",
  "dt": "2021-08-05T17:04:24.914Z",
  "request_id": "2f9a5d40-f60f-11eb-905b-ba280dc5deb6"
},
"http": {
  "client_ip": "127.0.0.1",
  "request_headers": {
    "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1"
  }

ServerSideAccountCreation

event: {
  "meta": {
    "domain": "localhost",
    "stream": "eventlogging_ServerSideAccountCreation",
    "id": "7d2d859c-f535-4bb1-8a85-598423fad779",
    "dt": "2021-08-05T20:12:50.405Z",
    "request_id": "82137f60-f629-11eb-b3e5-dce138bec23e"
  },
  "http": {
    "request_headers": {
      "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
    },
    "client_ip": "127.0.0.1"
  },
  "event": {
    "userId": 25,
    "userName": "MewCampaign1",
    "isSelfMade": true,
    "campaign": "testCampaign",
    "displayMobile": false,
    "token": "",
    "userBuckets": "",
    "isApi": false
  },
  "schema": "ServerSideAccountCreation",
  "wiki": "my_wiki",
  "webHost": "localhost:8080",
  "$schema": "/analytics/legacy/serversideaccountcreation/1.1.0",
  "client_dt": "2021-08-05T20:12:50Z"
}

Change 710349 had a related patch set uploaded (by MewOphaswongse; author: MewOphaswongse):

[schemas/event/secondary@master] Add client_ip to serversideaccountcreation schema

https://gerrit.wikimedia.org/r/710349

Change 710350 had a related patch set uploaded (by MewOphaswongse; author: MewOphaswongse):

[mediawiki/extensions/Campaigns@master] Use serversideaccountcreation schema version 1.1.0 (client_ip added)

https://gerrit.wikimedia.org/r/710350

Change 710349 merged by jenkins-bot:

[schemas/event/secondary@master] Add client_ip to serversideaccountcreation schema

https://gerrit.wikimedia.org/r/710349

Change 710310 merged by jenkins-bot:

[schemas/event/secondary@master] Add client_ip to Growth schemas

https://gerrit.wikimedia.org/r/710310

Change 710311 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Use updated schemas with client_ip

https://gerrit.wikimedia.org/r/710311

Change 710350 merged by jenkins-bot:

[mediawiki/extensions/Campaigns@master] Use serversideaccountcreation schema version 1.1.0 (client_ip added)

https://gerrit.wikimedia.org/r/710350

Etonkovidova subscribed.

Checked in production (wmf.18)

    • schemas HomepageModule (/analytics/legacy/homepagemodule/1.3.0) and HelpPanel (/analytics/legacy/helppanel/1.1.0) have updated versions
  • the logstash doesn't records any errors for the schemas

@Krinkle pointed out on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/713526 that collecting IPs when we only need high-level geolocation is unnecessarily privacy-invasive. It's possible to collect geodata manually, but adding that for five different schemas is a bit of a chore - @Ottomata is there any chance that will be supported in the foreseeable future? (Ie. a fragment similar to #client_ip that results in the geolocation data, but not the IP, being automatically added in the refine step)