Oozie jobs have been failing at least a few times each. More investigation needed.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Update UA parsing to limit agent length | analytics/refinery/source | master | +43 -10 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | JAllemandou | T197281 Fix failing webrequest hours (upload and text 2018-06-14-11) | |||
Resolved | • ema | T198152 Size of headers processed by varnish? |
Event Timeline
Change 440533 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update UA parsing to limit agent length
One problem is related to user-agent parsing for very long strings:
sudo -u hdfs spark2-shell --master yarn --conf spark.dynamicAllocation.maxExecutors=256 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
import org.wikimedia.analytics.refinery.core.UAParser val p ="/wmf/data/raw/webrequest/webrequest_upload/hourly/2018/06/14/11" val rdd = spark.sparkContext.sequenceFile[Long, String](p).map(_._2).toDS val df = spark.read.json(rdd) val uas = df.select("user_agent").rdd.map(_.getString(0)) uas.count /// res1: Long = 170125632 uas.filter(_.length > 512).count ///res2: Long = 64 val t1 = System.currentTimeMillis uas.filter(_.length <= 512).map(ua => UAParser.getInstance.getUAMap(ua)).count() /// res6: Long = 170125568 val t2 = System.currentTimeMillis println(t2 - t1) ///530902
The command uas.filter(_.length > 512).map(ua => UAParser.getInstance.getUAMap(ua)).count() did not finish in more than 1/2 hour time (for 64 rows !!!!!).
I suggest we don't even try to parse user-agent srings larger than 512. See associated patch.
Change 440533 merged by Nuria:
[analytics/refinery/source@master] Update UA parsing to limit agent length
I did another quick check this morning: there are some valid user-agent strings of length larger than 512 in our faulty hour (9 over 64). The other 55 are exactly the same, of length 2035.
I also have successfully parsed user-agents with a length-limit of 1024 over the faulty hour, and double checked how many user-agents would not have been parsed with various limits for another full day of raw webrequest:
- Total number of rows for that day: 3626986512
max UA length | unparsed UA rows | unparsed UA rows % |
512 | 1959 | 0.00005% |
1024 | 665 | 0.00002% |
1536 | 391 | 0.00001% |
Something interesting to notice here is that the reason for which parsing failed on our faulty hour is not only because of the length of the UA, but also because of it's shape: it has the exact shape of a valid user agent, with huge numbers in place of versions. I assume the regexs parsing the UAs don;t expect that many succesive digits to run at a reasonable speed.
I have provided a patch to update the UA-length limit to 1024, and successfully refined the upload and text faulty hours in test mode with it.