Page MenuHomePhabricator

Update UA parser for better spider traffic classification {hawk} [8 pts]
Closed, ResolvedPublic

Description

Can we auto-update code for UA parser? need to determine first before starting this task.

It appears that Bing bots (e.g., bingbot and BingPreview) may be getting classified as user traffic. Should they be spider instead?

hive> select day, user_agent_map['browser_family'], count(*) from
wmf.webrequest where
year = 2015 and month = 7 and day < 15
and (hour = 1 or hour = 7 or hour = 13 or hour = 19)
and access_method = 'mobile web'
and uri_host like '%.m.wikipedia.org'
and is_pageview = true and agent_type = 'user'
group by day, user_agent_map['browser_family'];

1	Mobile Safari	12132057
1	Chrome Mobile	8801804
1	Android	3749250
1	Other	2528518
1	Opera Mini	1403237
1	Chrome	900893
1	IE Mobile	681078
1	Chrome Mobile iOS	621432
1	BingPreview	449956
1	UC Browser	297750
1	IE	204187
1	Firefox Mobile	178981
1	BlackBerry WebKit	152236
1	Amazon Silk	118894
1	Opera Mobile	102316
1	NetFront NX	64352
1	Firefox	53725
1	Yandex Browser	50416
1	UP.Browser	40531
1	NetFront	37946
1	Nokia Browser	29383
1	bingbot	28545
1	Nokia Services (WAP) Browser	24726
1	Opera	22199
1	Safari	19415
1	BlackBerry	18870
...

Event Timeline

dr0ptp4kt raised the priority of this task from to Needs Triage.
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt added a project: Analytics.
dr0ptp4kt moved this task to Incoming on the Analytics board.
dr0ptp4kt added a subscriber: dr0ptp4kt.
kevinator lowered the priority of this task from High to Medium.Aug 3 2015, 4:54 PM
kevinator moved this task from Incoming to Medium on the Analytics-Backlog board.
kevinator renamed this task from Classification of Bing robots as spider traffic instead of user traffic to Classification of Bing robots as spider traffic instead of user traffic {hawk}.Aug 3 2015, 4:58 PM

@dr0ptp4kt is this urgent? If so, we'll look into manually updating UA parser to the latest codebase and it should help.
Otherwise, we want to look into how to auto update the code properly.

@kevinator, I don't think it's urgent urgent. The only thing I would say is it may make sense to spot check any stats looking at the past couple months in case it has outsize influence. I think the aggregation of PVs from these UAs and other similar UAs is probably more interesting for looking at the agent_type = 'user' traffic than just looking at these specific UAs. Cool to learn you're looking at even more auto updating!

kevinator raised the priority of this task from Medium to High.Aug 7 2015, 4:21 PM
kevinator removed a project: Analytics-Kanban.
kevinator moved this task from Medium to Prioritized on the Analytics-Backlog board.
kevinator renamed this task from Classification of Bing robots as spider traffic instead of user traffic {hawk} to Update UA parser for better spider traffic classification {hawk} [ pts].Aug 17 2015, 4:39 PM
kevinator updated the task description. (Show Details)
JAllemandou renamed this task from Update UA parser for better spider traffic classification {hawk} [ pts] to Update UA parser for better spider traffic classification {hawk} [5 pts].Sep 7 2015, 10:52 AM
JAllemandou claimed this task.
JAllemandou edited projects, added Analytics-Kanban; removed Analytics-Backlog.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou renamed this task from Update UA parser for better spider traffic classification {hawk} [5 pts] to Update UA parser for better spider traffic classification {hawk} [8 pts].Sep 7 2015, 1:43 PM

Thanks @QChris I found ways to check differences between repos, and understood why we repackaged ua-parser.

It seems now that ua-parser has splitted repos, having on uap-core repo and one by language.

uap-java contains the exact same code we are currently using, but uap-core contains many changes in the regexp.yaml file.

I'll raise the point in standup/tasking today as to how to update based on the new repos.

Change 238139 had a related patch set uploaded (by Joal):
Update code and separated repos scheme

https://gerrit.wikimedia.org/r/238139

Change 238139 merged by Nuria:
Update code and separated repos scheme

https://gerrit.wikimedia.org/r/238139

Change 238302 had a related patch set uploaded (by Joal):
Update ua-parser version to 1.3.0-wmf2 and tests

https://gerrit.wikimedia.org/r/238302

Change 238302 merged by Nuria:
Update ua-parser version to 1.3.0-wmf2 and tests

https://gerrit.wikimedia.org/r/238302

Change 237419 had a related patch set uploaded (by Joal):
Bump core and hive jar versions

https://gerrit.wikimedia.org/r/237419

Change 237419 merged by Joal:
Bump core and hive jar versions

https://gerrit.wikimedia.org/r/237419