Page MenuHomePhabricator

EventLogging does not properly classify KaiOS user agents
Closed, ResolvedPublic

Description

In November 2019, we updated ua-parser to detect KaiOS devices and deployed the new version (T237743). The updated detection has been working properly in the pageview_hourly data stream (T244547#5979835, T244547#5989789).

However, the updated detection does not seem to be functioning in EventLogging data streams. The InukaPageView stream is getting a lot of events from KaiOS devices, but none of them are being classified as KaiOS. 98% are being classified as Firefox OS.

SELECT
  COUNT(1) as kaios_events,
  SUM(CAST(useragent.os_family = "KaiOS" AS INT)) / COUNT(1) AS kaios_classified,
  SUM(CAST(useragent.os_family = "Firefox OS" AS INT)) / COUNT(1) AS firefox_os_classified
FROM event.inukapageview ipv
WHERE
  event.client_type IN ("kaios-web", "kaios-app") AND
  year > 0

   kaios_events  kaios_classified  firefox_os_classified
0        416003             0.000                  0.978

Perhaps the ua-parser version needs to be separately updated for EventLogging?

Event Timeline

Nuria assigned this task to Milimetric.Mar 27 2020, 4:04 PM
Nuria added a project: Analytics-Kanban.

Ok, I'm looking into this. Will document my debug steps.

  1. Looked at raw data: kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging-client-side | grep -i kaios | grep InukaPageView
  2. Applied UA parser to some of the user agent strings found:
ADD JAR hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.119.jar;
CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';
select ua_parser('...');
-> "os_family":"KaiOS"
  1. Applied the python version of UA parser to the same strings:
from ua_parser import user_agent_parser
user_agent_parser.Parse('...')
-> ... { os { family: u'KaiOS'

That's with latest code. But indeed if I do the query you do or look at eventlogging-valid-mixed or eventlogging_InukaPageView, I don't see the right os family in the user agent map. So maybe the code running somewhere is not the latest?

Milimetric triaged this task as High priority.Mar 27 2020, 8:39 PM
Milimetric moved this task from Incoming to Ops Week on the Analytics board.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

It looks like our python-ua-parser debian package deployed on eventlog1002 is old. I just had a quick go at repackaging it, but failed. It looks like some upstream stuff has changed enough that our debian/patches don't apply anymore.

But, overall, it kinda sucks that we have 2 separate sets of user agent parsing logic. In T238230: Decommission EventLogging backend components by migrating to MEP I'm currently working on getting Refine to parse EventLogging data the same way other data in Hadoop is parsed. The plan there is to slowly switch each topic over to the new system.

The right thing to do here would be to accelerate the migration of InukaPageView over to EventGate. We are waiting for some patches to be reviewed and merged and deployed before we can do this.

Or, we could take some time and rebuild the python-ua-parser package. This probably isn't THAT hard to do, I just wasn't able to do it in 5 minutes :p

Milimetric added a comment.EditedApr 16 2020, 7:34 PM

What Andrew said. Also, as for the data, we keep the raw events as Varnish sees them for 90 days, so we have as much data as you have in event.inukapageview. Let us know if you need to re-parse past data, and I can backfill it for you (it's a bit of a pain, but if you need it, it's fine).

What Andrew said. Also, as for the data, we keep the raw events as Varnish sees them for 90 days, so we have as much data as you have in event.inukapageview. Let us know if you need to re-parse past data, and I can backfill it for you (it's a bit of a pain, but if you need it, it's fine).

Thanks for figuring this out, y'all! Don't worry, there's no need to backfill the old data 😊

Nuria added a subscriber: Nuria.EditedApr 19 2020, 10:18 PM

@Ottomata let's please try updating the debian package

Done. Tested on eventlog1002 with

from ua_parser import user_agent_parser
user_agent_parser.Parse("Mozilla/5.0 (Mobile; ALCATEL4044T; rv:37.0) Gecko/37.0 Firefox/37.0 KaiOS/1.0")

Before:

{'user_agent': {'minor': '0', 'family': 'Firefox Mobile', 'major': '37', 'patch': None}, 'string': 'Mozilla/5.0 (Mobile; ALCATEL4044T; rv:37.0) Gecko/37.0 Firefox/37.0 KaiOS/1.0', 'device': {'family': 'Generic Smartphone', 'brand': 'Generic', 'model': 'Smartphone'}, 'os': {'minor': None, 'family': 'Firefox OS', 'major': None, 'patch_minor': None, 'patch': None}}

After:

{'user_agent': {'major': '37', 'minor': '0', 'patch': None, 'family': 'Firefox Mobile'}, 'string': 'Mozilla/5.0 (Mobile; ALCATEL4044T; rv:37.0) Gecko/37.0 Firefox/37.0 KaiOS/1.0', 'device': {'model': '4044T', 'brand': 'Alcatel', 'family': 'Alcatel 4044T'}, 'os': {'patch_minor': None, 'major': '1', 'minor': '0', 'patch': None, 'family': 'KaiOS'}}
Nuria closed this task as Resolved.Apr 20 2020, 2:48 PM
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.Apr 20 2020, 3:03 PM