Page MenuHomePhabricator

Fix non MapReduce execution of GeoCode UDF
Open, HighPublic

Description

Currently jobs using Maxmind data fail when they are executed locally:

ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION geocoded_data as 'org.wikimedia.analytics.refinery.hive.GeocodedDataUDF';
SELECT geocoded_data(ip) from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 11 and day = 14 and hour = 0 limit 10;

Stacktrace logged by hive-server2 on an-coord1001:

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating get_geo_data(ip)
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:463)
        at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:294)
        at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:769)
        at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
        at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
        at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
        at com.sun.proxy.$Proxy21.fetchResults(Unknown Source)
        at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:462)
        at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:696)
        at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
        at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating get_geo_data(ip)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:154)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2071)
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:458)
        ... 24 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating get_geo_data(ip)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:82)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:98)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:425)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:417)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:140)
        ... 26 more
Caused by: java.lang.NullPointerException
        at org.wikimedia.analytics.refinery.hive.GetGeoDataUDF.evaluate(GetGeoDataUDF.java:153)
        at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
        at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77)
        ... 31 more

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Nov 15, 6:18 PM
Ottomata triaged this task as High priority.Mon, Nov 18, 4:39 PM
Ottomata moved this task from Incoming to Operational Excellence on the Analytics board.

Change 551776 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics_cluster::coordinator: add geoip database

https://gerrit.wikimedia.org/r/551776

Change 551776 merged by Elukey:
[operations/puppet@production] profile::analytics_cluster::coordinator: add geoip database

https://gerrit.wikimedia.org/r/551776

elukey added a comment.EditedTue, Nov 19, 2:37 PM

@JAllemandou I reverted the patch since I didn't realize that we already include class { 'geoip': } on an-coord1001:

elukey@an-coord1001:~$ ls /usr/share/GeoIP/
GeoIP2-City.mmdb	     GeoIP2-Country.mmdb  GeoIPASNum.dat    GeoIPCity.dat  GeoIPNetSpeedCell.dat  GeoIPRegion.dat  GeoLite2-City.mmdb  GeoLiteCityv6.dat
GeoIP2-Connection-Type.mmdb  GeoIP2-ISP.mmdb	  GeoIPASNumv6.dat  GeoIP.dat	   GeoIPNetSpeed.dat	  GeoIPv6.dat	   GeoLiteCity.dat     GeoLite.dat

Is there something else missing?

As far as I can see from https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java we need /usr/share/GeoIP/GeoIP2-City.mmdb, that is on an-coord1001:

elukey@an-coord1001:~$ ls /usr/share/GeoIP/GeoIP2-City.mmdb
/usr/share/GeoIP/GeoIP2-City.mmdb
elukey updated the task description. (Show Details)Tue, Nov 19, 2:42 PM
mpopov added a subscriber: mpopov.Wed, Nov 20, 10:02 PM
mforns reassigned this task from elukey to JAllemandou.Mon, Nov 25, 5:35 PM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.
mforns added a subscriber: elukey.

Change 553726 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Fix GetGeoDataUDF and underlying function

https://gerrit.wikimedia.org/r/553726

JAllemandou renamed this task from Add MaxMind DB files on an-coord1001 for hive local-jobs using UDF to succeed to Fix non MapReduce execution of GeoCode UDF.Fri, Nov 29, 1:20 PM
JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.
JAllemandou updated the task description. (Show Details)