Page MenuHomePhabricator

Refine should parse user agent field as it is done on refinery pipeline
Closed, ResolvedPublic21 Story Points

Event Timeline

Nuria created this task.Oct 17 2017, 9:32 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 17 2017, 9:32 PM

Since the current code knows nothing about schemas ahead of time, this might be a little difficult.

We could add a pluggable transform step (function) that would map the JSON DataFrame before it is converted to a hive schema and 'refined'.

Perhaps! We could detect the existence of a JSON string in a string field, and auto-parse it as a JSON sub object and use a struct. Since EventLogging Analytics auto converts userAgent into a JSON string, this would cause the userAgent field in Hive to be transformed into a struct.

Or, we could just leave this as is. The data in userAgent is sanitized and is queryable via Hive JSON function.

hive (otto_json_refine_test)> select json_tuple(userAgent, 'browser_family', 'browser_major') from centralnoticebannerhistory where year='2017' and month='8' and day='11' limit 4;
OK
c0	c1
Chrome	60
Chrome	59
Chrome	60
Safari	10
Nuria added a comment.Nov 2 2017, 8:09 PM

Doesn't seem that json-tuple is that friendly on a where clause right? I just cannot get it to work in something like:

select json_tuple(userAgent, 'browser_family', 'browser_major') from centralnoticebannerhistory h, lateral view json_tuple(userAgent, 'browser_family', 'browser_major) as ua bf, bm where year='2017' and month='8' and day='11' and bm like '55' limit 4

Yeah a little nasty.

It'd be better to do this in EventLogging processor or something though. We're already parsing the user agent into a JSON string there. The only reason we didn't turn it into an object was for compatibility with existing MySQL table varchar userAgent fields.

I'm already going to have to do some custom tranformation to solve T179540, perhaps we can just take the userAgent dict and serialize that to a JSON string before inserting into MySQL too.

Nuria added a comment.Nov 2 2017, 8:58 PM

I agree to do it in EL makes most sense , totally, the field should not be coupled to the ability of the storage to handle / not to handle a nested object.

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 6 2017, 4:14 PM

Change 393613 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Pop capsule up to top level before applying format caster functions

https://gerrit.wikimedia.org/r/393613

Change 393613 merged by Milimetric:
[eventlogging@master] Prefer event capsule defined fields over those found in raw log line

https://gerrit.wikimedia.org/r/393613

Mentioned in SAL (#wikimedia-operations) [2017-11-28T14:40:54Z] <otto@tin> Started deploy [eventlogging/analytics@c464b8c]: Fixing bug where userAgent set by client producer was not used T178440

Mentioned in SAL (#wikimedia-operations) [2017-11-28T14:41:00Z] <otto@tin> Finished deploy [eventlogging/analytics@c464b8c]: Fixing bug where userAgent set by client producer was not used T178440 (duration: 00m 02s)