Page MenuHomePhabricator

Clients may need to set wiki/domain/webHost in event metadata
Closed, ResolvedPublic

Description

Legacy EventLogging capsule includes wiki and webHost for <technical reasons>. At analysis time, this really only matters when needing to focus on a particular wiki's web usage. It also provides a uniqueness element – on the web, where EL and persistently stored data (with the exception of CentralAuth stuff) are isolated between wikis, we can use {wiki, session ID} as an even more unique session identifier in an extremely rare case that two different sessions on two different wikis would have the same ID in an overlapping time period and we were interested in analyzing sessions across all Wikipedia languages (for example).

On mobile apps, it's really fuzzy. Currently, iOS and Android (and KaiOS too?) set the wiki/webHost based on context. If user has English, Russian, and German as their languages in the app, "enwiki" (being the 1st preferred language) will be used for events that are specific to English (interacting with Explore Feed, reading an article) and not specific to English (looking at map of nearby articles).

It gets trickier for Suggested Edits on Android, a feature where users are adding/translating Wikidata descriptions to articles which have Wikidata items and are adding/translating captions for images on Commons (and soon to have computer vision-suggested tags). Which "wiki" do you attribute those analytics events? Wikidata? (Which doesn't even have EventLogging.) Wikimedia Commons? Currently, again, the user's 1st preferred language is used. P.S. 1st preferred language can change during the usage of the app.

As we approach the finish line on a lot of the [Modern] Event Platform components, I wanted us to discuss this concept – the concept of domains/webHosts/wikis in event metadata. What are the technical reasons to have them, what are the use cases for them, and how to handle cross-wiki interactions.

Suppose the COVID-19 article has a table of case counts by country which somehow uses a global source of data (e.g. Data namespace on Commons, some kind of a global Template which doesn't exist but has been talked about, or Wikidata) and suppose an editor can edit that data from any page that data is used and have the changes propagate to other locations. Let's just assume that's how it worked. Suppose we had tracking attached to this feature. What does it mean to say "this event came from Japanese Wikipedia" when the event itself is not specific to Japanese Wikipedia, but instead of the event is about editing something that doesn't even exist on any Wikipedia?

Strawman proposal: optional wiki & domain fields. Instrumentation sets them when it makes sense to, doesn't when there's no reason for them. domain doesn't make sense on mobile apps but makes sense for client-side error logging on the web. wiki doesn't always make sense, especially in the context of wiki-agnostic events. What do we lose by not requiring those fields in every event?

Event Timeline

mpopov triaged this task as Medium priority.Apr 28 2020, 2:09 PM
mpopov created this task.
mpopov edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
mpopov moved this task from Inbox to Done! on the Better Use Of Data board.
mpopov moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Quick reply from my PoV:

meta.domain should be the domain part of the URL that the event pertains to. In cases where there is none, it can be omitted, or it can be filled in with whatever makes the most sense to the instrumentation developer.

webHost and wiki are EventLogging EventCapsule fields that aren't in any MEP event schemas. We do have the convention of using database for mediawiki events, e.g. enwiki, metawiki, etc. as that is where that value comes from: the name of the MySQL database the wiki uses. There is the concept of project families e.g. wikipedia, wiktionary, and projects, e.g. english wikipedia, etc. but these aren't 'domains' really, where a domain could be e.g. en.m.wikipedia.org.

See also: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L316-L384

Outcomes from yesterday's meeting:

  • clients will set meta.domain when relevant to context and omit it otherwise
    • on the web, this will always be the case because it's easy to do that
    • on the apps, where it's not not always the case that the event should be attributed to a particular Wikipedia, it is up to the engineers to provide the domain in the instrumentation as needed
      • it doesn't always make sense to auto-fill meta.domain with the domain associated with the user's 1st preferred language inside the app (e.g. changing settings, managing reading lists, navigating map of nearby places with articles)
      • for Suggested Edits analytics, it might make more sense to use "wikidata.org", "commons.wikimedia.org", or "meta.wikimedia.org"?
  • backend will refine meta.domain when it's been set (T251320) and turn it into a struct in Hive so we can easily query events for project families and qualifiers like mobile site

A side-note from something @nshahquinn-wmf mentioned: in a bright and beautiful future where Wikipedia is a progressive web app on a single domain with all the languages available, we may want to stop using the domain to distinguish between desktop and mobile front-end sites, and instead switch to tracking "where the content came from" and "how was the content presented" separately.