Page MenuHomePhabricator

eventlogging user agent data should be parsed so spiders can be easily identified {flea}
Closed, DuplicatePublic

Description

This is an ask from reading team

We could pre-parse UA data using ua parser python library,

Original request was to have a is_spider column similar to our hive parsing.

Event Timeline

Nuria raised the priority of this task from to Needs Triage.
Nuria updated the task description. (Show Details)
Nuria added a project: Analytics-Backlog.
Nuria added a subscriber: Nuria.
Milimetric renamed this task from 'is_spider' column in eventlogging user agent data to 'is_spider' column in eventlogging user agent data {flea}.Dec 17 2015, 6:20 PM
Milimetric triaged this task as Medium priority.
Milimetric set Security to None.
Milimetric moved this task from Incoming to Backlog on the Analytics-Backlog board.

The easiest way to add user-agent refinement to eventlogging would be to use the refinery code through hive or spark on eventlogging logged into hadoop.

Adding this column to the capsule requires work on the EL mysql database end of things which is having a lot of issues right now (as a new column needs to be added to every single table) so this is not likely to get done in the near term.

Just a thought Will be much better to re-use hadoop logic for this as we have the code to parse bots ready to go.

Nuria renamed this task from 'is_spider' column in eventlogging user agent data {flea} to eventlogging user agent data should be parsed so spiders can be easily identified {flea}.Mar 7 2016, 5:19 PM
Nuria updated the task description. (Show Details)

Change 311127 had a related patch set uploaded (by Joal):
Add singleton capactiy to UAParser

https://gerrit.wikimedia.org/r/311127

Change 311127 merged by jenkins-bot:
Add singleton capactiy to UAParser

https://gerrit.wikimedia.org/r/311127

Update: we will be replacing the data held by the user agent column with the parsed version . Resolving this ticket as duplicate.