Page MenuHomePhabricator

Create user defined function to classify network origin of an IP address
Closed, ResolvedPublic

Description

Create a new user defined function (UDF) for use in Hive queries that can generate a network classification ("internal", "external", "labs") for a given IPv4 or IPv6 address. This UDF will be used in ad-hoc queries and Hive scripts to broadly categorize the origin of Action API requests. It may also be found to be useful for other reporting needs.

Details

Related Gerrit Patches:
analytics/refinery/source : masterRename network_origin UDF partitions
analytics/refinery/source : masterAdd UDF for network origin
analytics/refinery/source : masterRename ipAddressMatcherCache -> trustedProxiesCache

Event Timeline

bd808 claimed this task.
bd808 raised the priority of this task from to Medium.
bd808 updated the task description. (Show Details)
bd808 added subscribers: Nuria, bd808, Ainali.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2015, 6:34 PM
bd808 moved this task from To Do to In Dev/Progress on the User-bd808 board.Nov 13 2015, 6:35 PM
Ainali removed a subscriber: Ainali.Nov 13 2015, 9:50 PM

Change 253045 had a related patch set uploaded (by BryanDavis):
Rename ipAddressMatcherCache -> trustedProxiesCache

https://gerrit.wikimedia.org/r/253045

Change 253046 had a related patch set uploaded (by BryanDavis):
Add UDF for network origin

https://gerrit.wikimedia.org/r/253046

Change 253046 merged by Nuria:
Add UDF for network origin

https://gerrit.wikimedia.org/r/253046

Change 253045 merged by Nuria:
Rename ipAddressMatcherCache -> trustedProxiesCache

https://gerrit.wikimedia.org/r/253045

bd808 moved this task from In Dev/Progress to Done on the User-bd808 board.Nov 19 2015, 12:03 AM

Change 254170 had a related patch set uploaded (by BryanDavis):
Rename network_origin UDF partitions

https://gerrit.wikimedia.org/r/254170

Milimetric moved this task from Incoming to Radar on the Analytics-Backlog board.Nov 19 2015, 6:22 PM

Change 254170 merged by Nuria:
Rename network_origin UDF partitions

https://gerrit.wikimedia.org/r/254170

The new UDF can be used from my homedir on stat1002 until such time as the shared refinery jars are updated:

ADD JAR /home/bd808/projects/analytics-refinery-source/refinery-core/target/refinery-core-0.0.23-SNAPSHOT.jar;
ADD JAR /home/bd808/projects/analytics-refinery-source/refinery-hive/target/refinery-hive-0.0.23-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION network_origin as 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF';

USE wmf;

SELECT month,
    day,
    network_origin(client_ip) as network,
    COUNT(*) as hits
FROM webrequest
WHERE year = 2015
 AND month = 11
 AND day = 2
GROUP BY month, day, network_origin(client_ip)
LIMIT 1000;
bd808 closed this task as Resolved.Dec 23 2015, 4:50 PM

Function is in /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar on stat1002

ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION network_origin as 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF'

SELECT month,
    day,
    network_origin(client_ip) as network,
    COUNT(*) as hits
FROM webrequest
WHERE year = 2015
 AND month = 11
 AND day = 2
GROUP BY month, day, network_origin(client_ip)
LIMIT 1000;
bd808 moved this task from Done to Archive on the User-bd808 board.Dec 31 2015, 6:09 AM