Page MenuHomePhabricator

Fix failing webrequest hours (upload and text 2018-06-14-11)
Closed, ResolvedPublic


Oozie jobs have been failing at least a few times each. More investigation needed.

Event Timeline

JAllemandou updated the task description. (Show Details)
JAllemandou added a project: Analytics-Kanban.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by akosiaris.
akosiaris raised the priority of this task from Lowest to Needs Triage.Jun 15 2018, 9:56 AM
akosiaris removed a project: acl*security.
akosiaris changed the visibility from "Custom Policy" to "Public (No Login Required)".
akosiaris moved this task from In Code Review to In Progress on the Analytics-Kanban board.
akosiaris added a subscriber: akosiaris.

Change 440533 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update UA parsing to limit agent length

One problem is related to user-agent parsing for very long strings:

sudo -u hdfs spark2-shell --master yarn --conf spark.dynamicAllocation.maxExecutors=256 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
val p ="/wmf/data/raw/webrequest/webrequest_upload/hourly/2018/06/14/11"
val rdd = spark.sparkContext.sequenceFile[Long, String](p).map(_._2).toDS
val df =

val uas ="user_agent")
/// res1: Long = 170125632  

uas.filter(_.length > 512).count
///res2: Long = 64  

val t1 = System.currentTimeMillis
uas.filter(_.length <= 512).map(ua => UAParser.getInstance.getUAMap(ua)).count()
/// res6: Long = 170125568     
val t2 = System.currentTimeMillis
println(t2 - t1)

The command uas.filter(_.length > 512).map(ua => UAParser.getInstance.getUAMap(ua)).count() did not finish in more than 1/2 hour time (for 64 rows !!!!!).
I suggest we don't even try to parse user-agent srings larger than 512. See associated patch.

Change 440533 merged by Nuria:
[analytics/refinery/source@master] Update UA parsing to limit agent length

I did another quick check this morning: there are some valid user-agent strings of length larger than 512 in our faulty hour (9 over 64). The other 55 are exactly the same, of length 2035.
I also have successfully parsed user-agents with a length-limit of 1024 over the faulty hour, and double checked how many user-agents would not have been parsed with various limits for another full day of raw webrequest:

  • Total number of rows for that day: 3626986512
max UA lengthunparsed UA rowsunparsed UA rows %

Something interesting to notice here is that the reason for which parsing failed on our faulty hour is not only because of the length of the UA, but also because of it's shape: it has the exact shape of a valid user agent, with huge numbers in place of versions. I assume the regexs parsing the UAs don;t expect that many succesive digits to run at a reasonable speed.

I have provided a patch to update the UA-length limit to 1024, and successfully refined the upload and text faulty hours in test mode with it.

Vvjjkkii renamed this task from Fix failing webrequest hours (upload and text 2018-06-14-11) to azaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed JAllemandou as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: JAllemandou; removed: gerritbot, Aklapper.
CommunityTechBot renamed this task from azaaaaaaaa to Fix failing webrequest hours (upload and text 2018-06-14-11).Jul 2 2018, 6:30 AM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to JAllemandou.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: gerritbot, Aklapper; removed: JAllemandou.