Page MenuHomePhabricator

third party domain data is getting refined
Closed, ResolvedPublic5 Estimated Story Points

Description

We have third party "filtering" going on on refine code but third party data is getting refined:

nuria@an-coord1001:~$ more /etc/refinery/refine/refine_eventlogging_analytics.properties
database = event
hive_server_url = an-coord1001.eqiad.wmnet:10000
input_path = /wmf/data/raw/eventlogging
input_path_regex = eventlogging_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)
input_path_regex_capture_groups = table,year,month,day,hour
output_path = /wmf/data/event
schema_base_uri = eventlogging
should_email_report = true
since = 26
table_blacklist_regex = ^Edit|ChangesListHighlights|InputDeviceDynamics|PageIssues$
to_emails = analytics-alerts@wikimedia.org
transform_functions = org.wikimedia.analytics.refinery.job.refine.deduplicate_eventlogging,org.wikimedia.analytics.refinery.job.refine.geocode_ip,org.wikimedia.analytics.refine
ry.job.refine.eventlogging_filter_is_allowed_hostname
until = 2
select webhost, count(*) from virtualpageview where year=2019 and month=07 and day=19 and webhost not like "%wiki%" group by webhost;

0s.oj2q.o5uww2lqmvsgsyjon5zgo.cmle.ru	9
dakaita.com	9
zhwp.iotac.xyz	2
1937-engara.tryitforfree.at-wt.com	1
wb.v2dd.com	14
en.w.meaqua.org	1
z5h64q92x9.net	211
speechpanel.readspeaker.com	1
web.archive.org	7
w.upupming.site	337
0s.mvxa.o5uww2lqmvsgsyjon5zgo.cmle.ru	25
0s.pjua.o5uww2lqmvsgsyjon5zgo.dresk.ru	3
zh.100ke.info	2

Event Timeline

Restricted Application added subscribers: Stang, Aklapper. · View Herald Transcript

Unit tests work fine so there must be something about the way this change is applied to the df that makes it not work, pining @Ottomata to work on this tomorrow, to work on this together, that is.

fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans moved this task from Operational Excellence to Data Quality on the Analytics board.

Issue was camel case "webHost" (spark) versus column name ("webhost") in hive

Change 533629 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery/source@master] Correcting column name as spark is case sensitive

https://gerrit.wikimedia.org/r/533629

Nuria set the point value for this task to 5.

Corrected, code will be deployed with next refinery

select webhost, count(*) from virtualpageview where year=2019 and month=07 and day=19 and hour=20 and webhost not like "%wiki%" group by webhost;

Total MapReduce CPU Time Spent: 3 minutes 37 seconds 520 msec
OK
webhost _c1
Time taken: 20.075 seconds

We need to increment the refinery-jar version in refine puppet code

Change 533629 merged by jenkins-bot:
[analytics/refinery/source@master] Correcting column name as spark is case sensitive

https://gerrit.wikimedia.org/r/533629