Page MenuHomePhabricator

Change userAgent field to user_agent_map in EventCapsule
Closed, ResolvedPublic21 Estimated Story Points

Description

The EventCapsule data that's present in all EventLogging schema currently only contains raw, unparsed user agent data.

It would facilitate many kinds of analyses a lot if we could store pre-parsed data derived from the user agent alongside it, as we do for the webrequest and pageview_hourly tables on Hive. They contain the very useful user_agent_map field, with the following data extracted from the raw user agent (still available as a separate field in webrequest):
device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version.
I understand it is based on the ua-parser library, apart from the WMF-specific app version field. (The Analytics Engineering team has built a dashboard that uses this data and in August published a popular blog post about it.)

See als this thread on Analytics-l, where @Nuria already asserted that this should be feasible using the Python implementation of ua-parser. And there is already T121550: eventlogging user agent data should be parsed so spiders can be easily identified {flea} which I assume largely concerns the same work, but with a more specific purpose.

Event Timeline

  • for mysql/EL
  • install ua-parser python
  • proces raw ua on capsule, we need to cache ua string versus parsed blob for the life of the application
  • insert json blob instead of raw user agent on table
  • document

Could we use a LRU cache like: https://pypi.python.org/pypi/pylru?

Do we need to debianize it?

Nuria set the point value for this task to 13.Dec 15 2016, 5:33 PM

Take a look at queries that use UA.

Milimetric subscribed.

Reminder to also change the description of the userAgent field in the capsule: https://meta.wikimedia.org/wiki/Schema:EventCapsule

Change 333641 had a related patch set uploaded (by Fdans):
Changes UA string to JSON map

https://gerrit.wikimedia.org/r/333641

Change 333641 merged by Nuria:
Changes UA string to JSON map

https://gerrit.wikimedia.org/r/333641

Please note that this task is about adding the user agent map to the capsule and storing it alongside the existing raw user agent, not about replacing it. I think the task description is pretty clear in that respect, but looking at https://gerrit.wikimedia.org/r/333641 , there appears to be a serious misunderstanding about this.

Replacing the raw UA entirely is something we could discuss too, but this would need to be preceded by a consultation about the needs of the people who may rely on the raw string for their work. (Personally, I suspect that the parsed UA will be sufficient for my current work needs, provided that it really offers the same information as the aforementioned Hive tables, e.g. includes app version numbers for both our iOS and Android apps. But I haven't looked thoroughly at the implications and in any case am not the only person using this data in EL.)

Please note that this task is about adding the user agent map to the capsule and storing it alongside the existing raw user agent, not about replacing it.

Very sorry, indeed is a missunderstanding. Our plan is to replace the user agent field, we have wanted to do this for a while for privacy reasons. See my original reply on the thread you mention:

"I think we can also probably consider doing the parsing in EL/MySQL so the
user agent is never raw on tables but rather always parsed. We could use
the python ua parser library ..."

Personally, I suspect that the parsed UA will be sufficient for my current work needs
provided that it really offers the same information as the aforementioned Hive tables, e.g. includes app version numbers for both our iOS and Android apps

This is a good catch, thank you. The code to add app version needs to be included.

We will work on this so what is logged is identical to how hive does it:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/UAParser.java#L133

We have looked at other use cases and, if anything, it will be easy to query the parsing column rather than the raw one via using json_extract,

Change 335145 had a related patch set uploaded (by Nuria):
Changes UA string to JSON map

https://gerrit.wikimedia.org/r/335145

Replacing the raw UA entirely is something we could discuss too, but this would need to be preceded by a consultation

As Nuria said, this goal has been established a long time ago. The consultation already happened and involved all the relevant mailing lists: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization

Replacing the raw UA entirely is something we could discuss too, but this would need to be preceded by a consultation

As Nuria said, this goal has been established a long time ago. The consultation already happened and involved all the relevant mailing lists: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization

That page doesn't mention a consultation of "all the relevant mailing lists" (which exactly?). In any case, it dates from 2014 and misses, for example, the app version use case discussed above; there may well be others that have come up in the years since then.

I'm not saying the consultation was perfect, only that it already happened. It's always good to store less private data; a new consultation to re-expand collection and storage of PII can be made later if necessary.

Nuria renamed this task from Add user_agent_map field to EventCapsule to Change userAgent field to user_agent_map in EventCapsule.Feb 1 2017, 8:01 PM

Change 335854 had a related patch set uploaded (by Nuria):
Adding uaprser to eventlogging deps

https://gerrit.wikimedia.org/r/335854

Change 335854 abandoned by Nuria:
Adding uaprser to eventlogging deps

Reason:
already done.

https://gerrit.wikimedia.org/r/335854

@Tbayer and @Nemo_bis FYI that we will be deploying this next week, after our work with @Krinkle we have also included browser minor in the parsed user agent as that is useful for perf folks. We likely need to add browser minor to hive user_agent_map too.

Change 335145 merged by Nuria:
[eventlogging] Change UA string to JSON map

https://gerrit.wikimedia.org/r/335145

We had to revert this changes due to an issue with database columns that we did not detect in beta: https://phabricator.wikimedia.org/T160454

Change 343895 had a related patch set uploaded (by Nuria):
[eventlogging] Change UA string to JSON map

https://gerrit.wikimedia.org/r/343895

Change 343895 merged by Ottomata:
[eventlogging] Change UA string to JSON map

https://gerrit.wikimedia.org/r/343895

Nuria changed the point value for this task from 13 to 21.Mar 29 2017, 6:59 PM

Changes have been deployed to production.

Reminder to also change the description of the userAgent field in the capsule: https://meta.wikimedia.org/wiki/Schema:EventCapsule

Reminder that this hasn't happened yet ;)

Hm, we could just edit the schema and then have a different revision than is used, but functionally the same, or we could update the revision id that's used by EL https://github.com/wikimedia/eventlogging/blob/75ab39cb1e9967b9c7eeeaf6d94200e8025e6c74/eventlogging/schema.py#L67

Sorry, I though I had comented on this earlier. We do not need to change the capsule as type of field hasn't changed. it is still a string.

Sorry, I though I had comented on this earlier. We do not need to change the capsule as type of field hasn't changed. it is still a string.

That may be true in a technical sense, but for people reading the documentation and constructing queries, it is important to know that the field now contains data in a valid JSON format.

Also, the documentation is not solely about the data type, but also about the field description, and there it is no longer true that this string is identical to the "User Agent from HTTP request".

Hm, we could just edit the schema and then have a different revision than is used, but functionally the same, or we could update the revision id that's used by EL https://github.com/wikimedia/eventlogging/blob/75ab39cb1e9967b9c7eeeaf6d94200e8025e6c74/eventlogging/schema.py#L67

Even if there are reasons against updating the revision in the ID in the code, it would be preferable to update the documentation anyway. It's actually pretty routine for the schema page on Meta being a bit ahead of the code, cf. T153328#3147768 , although of course they should always be in sync ideally. If no one else volunteers, I'll make the edit myself soon, but someone who worked on the actual implementation might be best qualified for that ;)

This comment was removed by Nuria.