Page MenuHomePhabricator

third party domain data is getting refined
Closed, ResolvedPublic5 Story Points

Description

We have third party "filtering" going on on refine code but third party data is getting refined:

nuria@an-coord1001:~$ more /etc/refinery/refine/refine_eventlogging_analytics.properties
database = event
hive_server_url = an-coord1001.eqiad.wmnet:10000
input_path = /wmf/data/raw/eventlogging
input_path_regex = eventlogging_(.+)/hourly/(\d+)/(\d+)/(\d+)/(\d+)
input_path_regex_capture_groups = table,year,month,day,hour
output_path = /wmf/data/event
schema_base_uri = eventlogging
should_email_report = true
since = 26
table_blacklist_regex = ^Edit|ChangesListHighlights|InputDeviceDynamics|PageIssues$
to_emails = analytics-alerts@wikimedia.org
transform_functions = org.wikimedia.analytics.refinery.job.refine.deduplicate_eventlogging,org.wikimedia.analytics.refinery.job.refine.geocode_ip,org.wikimedia.analytics.refine
ry.job.refine.eventlogging_filter_is_allowed_hostname
until = 2
select webhost, count(*) from virtualpageview where year=2019 and month=07 and day=19 and webhost not like "%wiki%" group by webhost;

0s.oj2q.o5uww2lqmvsgsyjon5zgo.cmle.ru	9
dakaita.com	9
zhwp.iotac.xyz	2
1937-engara.tryitforfree.at-wt.com	1
wb.v2dd.com	14
en.w.meaqua.org	1
z5h64q92x9.net	211
speechpanel.readspeaker.com	1
web.archive.org	7
w.upupming.site	337
0s.mvxa.o5uww2lqmvsgsyjon5zgo.cmle.ru	25
0s.pjua.o5uww2lqmvsgsyjon5zgo.dresk.ru	3
zh.100ke.info	2

Details

Related Gerrit Patches:
analytics/refinery/source : masterCorrecting column name as spark is case sensitive

Event Timeline

Nuria created this task.Jul 20 2019, 4:54 AM
Restricted Application added a project: Internet-Archive. · View Herald TranscriptJul 20 2019, 4:54 AM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald Transcript
Nuria added a comment.EditedJul 20 2019, 5:30 AM

...

Nuria added a subscriber: Ottomata.EditedJul 22 2019, 3:46 AM

Unit tests work fine so there must be something about the way this change is applied to the df that makes it not work, pining @Ottomata to work on this tomorrow, to work on this together, that is.

fdans triaged this task as High priority.Jul 22 2019, 4:01 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans moved this task from Operational Excellence to Data Quality on the Analytics board.
Nuria claimed this task.Jul 22 2019, 8:51 PM
Nuria added a project: Analytics-Kanban.
Nuria moved this task from Next Up to Paused on the Analytics-Kanban board.Jul 30 2019, 4:09 PM
Nuria moved this task from Paused to In Progress on the Analytics-Kanban board.Aug 23 2019, 4:03 PM

Issue was camel case "webHost" (spark) versus column name ("webhost") in hive

Change 533629 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery/source@master] Correcting column name as spark is case sensitive

https://gerrit.wikimedia.org/r/533629

Nuria set the point value for this task to 5.
Nuria added a comment.Aug 31 2019, 1:34 PM

Corrected, code will be deployed with next refinery

select webhost, count(*) from virtualpageview where year=2019 and month=07 and day=19 and hour=20 and webhost not like "%wiki%" group by webhost;

Total MapReduce CPU Time Spent: 3 minutes 37 seconds 520 msec
OK
webhost _c1
Time taken: 20.075 seconds

We need to increment the refinery-jar version in refine puppet code

Change 533629 merged by jenkins-bot:
[analytics/refinery/source@master] Correcting column name as spark is case sensitive

https://gerrit.wikimedia.org/r/533629

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Sep 13 2019, 2:16 AM
Nuria closed this task as Resolved.Sep 19 2019, 6:27 PM