We noticed that web requests coming from http://google.com weren't getting identified by the referer classifier UDF because of a non-optional period in the regex. This has been [[ https://gerrit.wikimedia.org/r/#/c/277679/ | patched ]], but now our data collection codebase "**golden**" needs to be updated to use the new JARs (e.g. **/home/bearloga/Code/refineries/refinery-hive-0.0.28-SNAPSHOT.jar**).
Specifically, we need to update the scripts **external_traffic/search_referers.R** and **portal/referers.R** which use an outdated JAR that was packaged before the [[ https://gerrit.wikimedia.org/r/#/c/247601/ | patch ]] was [[ https://git.wikimedia.org/commit/analytics%2Frefinery%2Fsource.git/80c6918206d6e46b8d6975c81bce76a6337acdf9 | finalized ]]. Changes include:
- RefererClassifyUDF became SmartReferrerClassifierUDF
- The output value "Search engine" became "external (search engine)"
- The output value "Other" became "external"
- Additionally, SmartReferrerClassifierUDF outputs:
- "internal" where the referer is wikipedia/mediawiki/etc. (if we include this in the aggregate dataset, it would resolve T129137)
- "none" if empty or "-"
- "unknown" otherwise (see [[ https://git.wikimedia.org/blob/analytics%2Frefinery%2Fsource.git/master/refinery-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fwikimedia%2Fanalytics%2Frefinery%2Fcore%2FWebrequest.java#L179 | classifyReferer() ]] in [[ https://git.wikimedia.org/blob/analytics%2Frefinery%2Fsource.git/master/refinery-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fwikimedia%2Fanalytics%2Frefinery%2Fcore%2FWebrequest.java | Webrequest.java ]])
Because requests from http://google.com weren't being counted as coming from Google but will be after **golden** is updated, we expect the % of referral traffic from Google to rise after the patch, and this change should be annotated on the External Traffic and Portal::External Traffic dashboards.
Finally, the external traffic part of the Portal dashboard (which references "Search engine" and "Other" labels) would need to be updated to use the new labels once **golden** has been patched.