Page MenuHomePhabricator

Update data collections to use latest refinery (UDF) version
Closed, ResolvedPublic2 Estimated Story Points

Description

We noticed that web requests coming from http://google.com weren't getting identified by the referer classifier UDF because of a non-optional period in the regex. This has been patched, but now our data collection codebase "golden" needs to be updated to use the new JARs (e.g. /home/bearloga/Code/refineries/refinery-hive-0.0.28-SNAPSHOT.jar).

Specifically, we need to update the scripts external_traffic/search_referers.R and portal/referers.R which use an outdated JAR that was packaged before the patch was finalized. Changes include:

  • RefererClassifyUDF became SmartReferrerClassifierUDF
  • The output value "Search engine" became "external (search engine)"
  • The output value "Other" became "external"
  • Additionally, SmartReferrerClassifierUDF outputs:
    • "internal" where the referer is wikipedia/mediawiki/etc. (if we include this in the aggregate dataset, it would resolve T129137)
    • "none" if empty or "-"
    • "unknown" otherwise (see classifyReferer() in Webrequest.java)

Because requests from http://google.com weren't being counted as coming from Google but will be after golden is updated, we expect the % of referral traffic from Google to rise after the patch, and this change should be annotated on the External Traffic and Portal::External Traffic dashboards.

Finally, the external traffic part of the Portal dashboard (which references "Search engine" and "Other" labels) would need to be updated to use the new labels once golden has been patched.

Event Timeline

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 2.

Change 281064 had a related patch set uploaded (by Bearloga):
Switch to final referer classifier UDF

https://gerrit.wikimedia.org/r/281064

Had to modify the data files to be consistent with the new data that's gonna be coming in as a result of https://gerrit.wikimedia.org/r/#/c/281064/

Checking the scripts right now before merging + backfilling (a typo -- lack of a space -- caused the external traffic and portal referer dashboard data to stop collecting after 6 March 2016) from 7 March 2016 using the new refinery.

Dashboards are currently set up to work with the old data format, so I'll have to fix External Traffic and Portal dashboards on Monday to work on the new data format.

Change 281064 merged by Bearloga:
Switch to final referer classifier UDF

https://gerrit.wikimedia.org/r/281064

Change 281472 had a related patch set uploaded (by Bearloga):
Update dashboard to use new referer traffic format

https://gerrit.wikimedia.org/r/281472

Change 281472 merged by Bearloga:
Update dashboard to use new referer traffic format

https://gerrit.wikimedia.org/r/281472

Change 281488 had a related patch set uploaded (by Bearloga):
Update to use new format

https://gerrit.wikimedia.org/r/281488

Change 281488 merged by Bearloga:
Update to use new format

https://gerrit.wikimedia.org/r/281488

Change 281492 had a related patch set uploaded (by Bearloga):
Annotate data format change

https://gerrit.wikimedia.org/r/281492

Change 281492 merged by Bearloga:
Annotate data format change

https://gerrit.wikimedia.org/r/281492

Data collection scripts: updated. Data: backfilled from 2016-03-07 using the new refinery version. Dashboards: updated to use new data format & annotated to indicate when the switch happened.

Will deploy to production later this week if no bugs/issues are reported.

P.S. Going forward, our scripts should only use finalized UDFs to avoid situations like this. Hive queries should import a statically-named refinery JAR that's always kept up to date (see: https://meta.wikimedia.org/wiki/Discovery/Analytics#User-defined_Functions for more details)

Change 281666 had a related patch set uploaded (by Bearloga):
Deploy new dashboard versions

https://gerrit.wikimedia.org/r/281666

Change 281666 merged by Bearloga:
Deploy new dashboard versions

https://gerrit.wikimedia.org/r/281666