Page MenuHomePhabricator

Create user defined function to classify network origin of an IP address
Closed, ResolvedPublic

Description

Create a new user defined function (UDF) for use in Hive queries that can generate a network classification ("internal", "external", "labs") for a given IPv4 or IPv6 address. This UDF will be used in ad-hoc queries and Hive scripts to broadly categorize the origin of Action API requests. It may also be found to be useful for other reporting needs.

Event Timeline

bd808 claimed this task.
bd808 raised the priority of this task from to Medium.
bd808 updated the task description. (Show Details)
bd808 added subscribers: Nuria, bd808, Ainali.

Change 253045 had a related patch set uploaded (by BryanDavis):
Rename ipAddressMatcherCache -> trustedProxiesCache

https://gerrit.wikimedia.org/r/253045

Change 253046 had a related patch set uploaded (by BryanDavis):
Add UDF for network origin

https://gerrit.wikimedia.org/r/253046

Change 253046 merged by Nuria:
Add UDF for network origin

https://gerrit.wikimedia.org/r/253046

Change 253045 merged by Nuria:
Rename ipAddressMatcherCache -> trustedProxiesCache

https://gerrit.wikimedia.org/r/253045

Change 254170 had a related patch set uploaded (by BryanDavis):
Rename network_origin UDF partitions

https://gerrit.wikimedia.org/r/254170

Change 254170 merged by Nuria:
Rename network_origin UDF partitions

https://gerrit.wikimedia.org/r/254170

The new UDF can be used from my homedir on stat1002 until such time as the shared refinery jars are updated:

ADD JAR /home/bd808/projects/analytics-refinery-source/refinery-core/target/refinery-core-0.0.23-SNAPSHOT.jar;
ADD JAR /home/bd808/projects/analytics-refinery-source/refinery-hive/target/refinery-hive-0.0.23-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION network_origin as 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF';

USE wmf;

SELECT month,
    day,
    network_origin(client_ip) as network,
    COUNT(*) as hits
FROM webrequest
WHERE year = 2015
 AND month = 11
 AND day = 2
GROUP BY month, day, network_origin(client_ip)
LIMIT 1000;

Function is in /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar on stat1002

ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION network_origin as 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF'

SELECT month,
    day,
    network_origin(client_ip) as network,
    COUNT(*) as hits
FROM webrequest
WHERE year = 2015
 AND month = 11
 AND day = 2
GROUP BY month, day, network_origin(client_ip)
LIMIT 1000;