Page MenuHomePhabricator

Update data collections to use latest refinery (UDF) version
Closed, ResolvedPublic2 Estimate Story Points


We noticed that web requests coming from weren't getting identified by the referer classifier UDF because of a non-optional period in the regex. This has been patched, but now our data collection codebase "golden" needs to be updated to use the new JARs (e.g. /home/bearloga/Code/refineries/refinery-hive-0.0.28-SNAPSHOT.jar).

Specifically, we need to update the scripts external_traffic/search_referers.R and portal/referers.R which use an outdated JAR that was packaged before the patch was finalized. Changes include:

  • RefererClassifyUDF became SmartReferrerClassifierUDF
  • The output value "Search engine" became "external (search engine)"
  • The output value "Other" became "external"
  • Additionally, SmartReferrerClassifierUDF outputs:
    • "internal" where the referer is wikipedia/mediawiki/etc. (if we include this in the aggregate dataset, it would resolve T129137)
    • "none" if empty or "-"
    • "unknown" otherwise (see classifyReferer() in

Because requests from weren't being counted as coming from Google but will be after golden is updated, we expect the % of referral traffic from Google to rise after the patch, and this change should be annotated on the External Traffic and Portal::External Traffic dashboards.

Finally, the external traffic part of the Portal dashboard (which references "Search engine" and "Other" labels) would need to be updated to use the new labels once golden has been patched.


Related Gerrit Patches:
wikimedia/discovery/dashboard : masterDeploy new dashboard versions
wikimedia/discovery/prince : masterAnnotate data format change
wikimedia/discovery/wonderbolt : masterUpdate to use new format
wikimedia/discovery/prince : masterUpdate dashboard to use new referer traffic format
wikimedia/discovery/golden : masterSwitch to final referer classifier UDF

Event Timeline

mpopov created this task.Mar 15 2016, 11:22 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2016, 11:22 PM
mpopov updated the task description. (Show Details)Mar 15 2016, 11:36 PM
Deskana triaged this task as Medium priority.Mar 17 2016, 8:16 PM
mpopov claimed this task.Apr 1 2016, 6:54 PM
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 2.

Change 281064 had a related patch set uploaded (by Bearloga):
Switch to final referer classifier UDF

Had to modify the data files to be consistent with the new data that's gonna be coming in as a result of

Checking the scripts right now before merging + backfilling (a typo -- lack of a space -- caused the external traffic and portal referer dashboard data to stop collecting after 6 March 2016) from 7 March 2016 using the new refinery.

Dashboards are currently set up to work with the old data format, so I'll have to fix External Traffic and Portal dashboards on Monday to work on the new data format.

Change 281064 merged by Bearloga:
Switch to final referer classifier UDF

Change 281472 had a related patch set uploaded (by Bearloga):
Update dashboard to use new referer traffic format

Change 281472 merged by Bearloga:
Update dashboard to use new referer traffic format

Change 281488 had a related patch set uploaded (by Bearloga):
Update to use new format

Change 281488 merged by Bearloga:
Update to use new format

Change 281492 had a related patch set uploaded (by Bearloga):
Annotate data format change

Change 281492 merged by Bearloga:
Annotate data format change

mpopov added a comment.Apr 4 2016, 7:11 PM

Data collection scripts: updated. Data: backfilled from 2016-03-07 using the new refinery version. Dashboards: updated to use new data format & annotated to indicate when the switch happened.

Will deploy to production later this week if no bugs/issues are reported.

P.S. Going forward, our scripts should only use finalized UDFs to avoid situations like this. Hive queries should import a statically-named refinery JAR that's always kept up to date (see: for more details)

Change 281666 had a related patch set uploaded (by Bearloga):
Deploy new dashboard versions

Change 281666 merged by Bearloga:
Deploy new dashboard versions

debt closed this task as Resolved.Jun 7 2016, 7:43 PM