Page MenuHomePhabricator

UAParser should skip parsing User-Agent strings with too many digits
Closed, ResolvedPublic3 Story Points

Description

The webrequest refine job just encountered a UA like:

Mozilla/5.0 (iPhone; CPU iPhone OS 7_5 like Mac OS X) AppleWebKit/19822339667222323933282213333434478878822231377998767332218874522198879677665983774456349576112944300985599848738873998889778898337233422187988557789982445897762223998977689827761221441991214887556887113722333422399878366532952235922176855545655632887888899555448875549988733888945584822189332889658888799852222121236445665456678363346638665221221444573323399821129557764877687887767772236888979821622711998889288876887765444244121499845546623554396588788659669821133466861199824453259965299499715549857742110776098872821134231796111233458765432509879812189999558899907998355711222176667112778700933275876328764683870990965 (KHTML, like Gecko) Version/5.1 Mobile/9334 Safari/7548.320

This long digit string is causing the regex parser to timeout. We should just add a MAX_UA_DIGIT_COUNT in UAParser.java to skip parsing UA strings with too many digits, like we do if they bigger than MAX_UA_LENGTH.

Event Timeline

Ottomata created this task.May 22 2019, 8:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2019, 8:46 PM
fdans triaged this task as High priority.Jun 3 2019, 3:52 PM
fdans moved this task from Incoming to Ops Week on the Analytics board.
mforns claimed this task.Jun 4 2019, 12:21 PM
mforns added a project: Analytics-Kanban.
mforns added a comment.Jun 4 2019, 1:03 PM

The example in the task description has 623 digits in the number after AppleWebKit/.
What do you think is a good threshold?

Here's a small study of user_agent digit count in wmf.webrequest:

select int((size(split(user_agent, '[0-9]')) - 1) / 10), count(*) from wmf.webrequest where webrequest_source='text' and year=2019 and month=5 and day=1 and hour=0 group by int((size(split(user_agent, '[0-9]')) - 1) / 10);

digit count (tens)	frequency
0	22158157
1	19973930
2	128126942
3	19886709
4	1339899
5	291360
6	66892
7	8065
8	1361
9	1493
10	509
11	79
12	49
15	23
16	16
18	15
21	9
24	8
27	5
30	5
33	5
36	5
39	5
42	3
45	3
48	3
50	22

It seems that the long tail is already very narrow at MAX_UA_DIGIT_COUNT=200.
If we used 200, we'd be throwing away around 0.00003%, which I believe is neglectable (3 in 10 million).

If we used 100, we'd be throwing away around 0.0001%, which I think is still super acceptable (one in a million).
And we'd be cutting off user_agents with more than 100 digits, which I'd say is already weird.

Dunno, to make it conservative, maybe pick MAX_UA_DIGIT_COUNT=200?

Change 514295 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Do not parse webrequest user agents with too many digits

https://gerrit.wikimedia.org/r/514295

Nuria added a subscriber: Nuria.EditedJun 4 2019, 1:35 PM

Edited: changed 200 to 400
On my work on bots i found that 400 was a good threshold. Anything beyond that was more often than not malicious.

Anything sounds fine to me. 200 or 400 :)

mforns added a comment.Jun 4 2019, 4:18 PM

I think @Nuria is talking about the length threshold, no?

mforns added a comment.Jun 4 2019, 4:42 PM

As per Nuria's CR comment I believe she is referring to the MAX_UA_LENGTH limit and not the MAX_UA_DIGIT_COUNT when she mentions to change to 400.

I did this small analysis of the user_agent lengths:

select int(length(user_agent) / 10) as len, count(*) as num from wmf.webrequest where webrequest_source='text' and year=2019 and month=5 and day=1 and hour=0 group by int(length(user_agent) / 10) order by len limit 1000;

len (in tens!)	num
0	4011236
1	3841091
2	2280603
3	2901686
4	6633041
5	2466390
6	11541987
7	11313006
8	3338428
9	852109
10	6419009
11	33017069
12	41952199
13	41413132
14	8601067
15	2033431
16	5637764
17	817504
18	656053
19	863553
20	928940
21	172661
22	45226
23	6211
24	74115
25	5193
26	18066
27	7995
28	3105
29	2335
30	589
31	163
32	81
33	97
34	13
35	2
36	120
37	39
38	13
39	29
41	17
44	1
48	33
60	23
66	40
72	15
84	9
96	8
105	16
108	5
121	5
133	5
145	5
157	5
169	3
181	3
193	3
203	25

I agree we could decrease the current MAX_UA_LENGTH=1024 to MAX_UA_LENGTH=400/500.
Regarding the MAX_UA_DIGIT_COUNT=200, it might still be needed for user agents that are smaller than 400/500 but have lots of numbers in them.

If we reduce MAX_UA_LENGTH, then perhaps we don't need MAX_UA_DIGIT_COUNT? I'm fine with having both if you prefer.

Change 514295 merged by jenkins-bot:
[analytics/refinery/source@master] Do not parse webrequest user agents with too many digits

https://gerrit.wikimedia.org/r/514295

mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 6 2019, 2:31 PM
Nuria closed this task as Resolved.Jun 10 2019, 5:14 PM
Nuria set the point value for this task to 3.