We noticed that web requests coming from http://google.com weren't getting identified by the referer classifier UDF because of a non-optional period in the regex. This has been patched, but now our data collection codebase "golden" needs to be updated to use the new JARs (e.g. /home/bearloga/Code/refineries/refinery-hive-0.0.28-SNAPSHOT.jar).
Specifically, we need to update the scripts external_traffic/search_referers.R and portal/referers.R which use an outdated JAR that was packaged before the patch was finalized. Changes include:
- RefererClassifyUDF became SmartReferrerClassifierUDF
- The output value "Search engine" became "external (search engine)"
- The output value "Other" became "external"
- Additionally, SmartReferrerClassifierUDF outputs:
- "internal" where the referer is wikipedia/mediawiki/etc. (if we include this in the aggregate dataset, it would resolve T129137)
- "none" if empty or "-"
- "unknown" otherwise (see classifyReferer() in Webrequest.java)
Because requests from http://google.com weren't being counted as coming from Google but will be after golden is updated, we expect the % of referral traffic from Google to rise after the patch, and this change should be annotated on the External Traffic and Portal::External Traffic dashboards.
Finally, the external traffic part of the Portal dashboard (which references "Search engine" and "Other" labels) would need to be updated to use the new labels once golden has been patched.