Page MenuHomePhabricator

Create user defined function to classify network origin of an IP address
Closed, ResolvedPublic

Description

Create a new user defined function (UDF) for use in Hive queries that can generate a network classification ("internal", "external", "labs") for a given IPv4 or IPv6 address. This UDF will be used in ad-hoc queries and Hive scripts to broadly categorize the origin of Action API requests. It may also be found to be useful for other reporting needs.

Event Timeline

bd808 created this task.Nov 13 2015, 6:34 PM
bd808 claimed this task.
bd808 raised the priority of this task from to Normal.
bd808 updated the task description. (Show Details)
bd808 added subscribers: Nuria, bd808, Ainali.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2015, 6:34 PM
bd808 moved this task from To Do to In Dev/Progress on the User-bd808 board.Nov 13 2015, 6:35 PM
Ainali removed a subscriber: Ainali.Nov 13 2015, 9:50 PM

Change 253045 had a related patch set uploaded (by BryanDavis):
Rename ipAddressMatcherCache -> trustedProxiesCache

https://gerrit.wikimedia.org/r/253045

Change 253046 had a related patch set uploaded (by BryanDavis):
Add UDF for network origin

https://gerrit.wikimedia.org/r/253046

Change 253046 merged by Nuria:
Add UDF for network origin

https://gerrit.wikimedia.org/r/253046

Change 253045 merged by Nuria:
Rename ipAddressMatcherCache -> trustedProxiesCache

https://gerrit.wikimedia.org/r/253045

bd808 moved this task from In Dev/Progress to Done on the User-bd808 board.Nov 19 2015, 12:03 AM

Change 254170 had a related patch set uploaded (by BryanDavis):
Rename network_origin UDF partitions

https://gerrit.wikimedia.org/r/254170

Milimetric moved this task from Incoming to Radar on the Analytics-Backlog board.Nov 19 2015, 6:22 PM

Change 254170 merged by Nuria:
Rename network_origin UDF partitions

https://gerrit.wikimedia.org/r/254170

bd808 added a comment.Dec 10 2015, 5:24 PM

The new UDF can be used from my homedir on stat1002 until such time as the shared refinery jars are updated:

ADD JAR /home/bd808/projects/analytics-refinery-source/refinery-core/target/refinery-core-0.0.23-SNAPSHOT.jar;
ADD JAR /home/bd808/projects/analytics-refinery-source/refinery-hive/target/refinery-hive-0.0.23-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION network_origin as 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF';

USE wmf;

SELECT month,
    day,
    network_origin(client_ip) as network,
    COUNT(*) as hits
FROM webrequest
WHERE year = 2015
 AND month = 11
 AND day = 2
GROUP BY month, day, network_origin(client_ip)
LIMIT 1000;
bd808 closed this task as Resolved.Dec 23 2015, 4:50 PM

Function is in /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar on stat1002

ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION network_origin as 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF'

SELECT month,
    day,
    network_origin(client_ip) as network,
    COUNT(*) as hits
FROM webrequest
WHERE year = 2015
 AND month = 11
 AND day = 2
GROUP BY month, day, network_origin(client_ip)
LIMIT 1000;
bd808 moved this task from Done to Archive on the User-bd808 board.Dec 31 2015, 6:09 AM