Page MenuHomePhabricator

Refine should add field to indicate if event is from wikimedia domain instead of filtering
Closed, ResolvedPublic

Description

Due to T256674, the event.wdqs_external_sparql_query table has been not refining any records since https://gerrit.wikimedia.org/r/c/operations/puppet/+/592756/ was merged. Events from query.wikidata.org were being filtered out.

The motivation for filtering events was to prevent non wikimedia sites from emitting junk events via EventLogging extension. We are trying to unify the ingestion logic for both production and analytics data, so in the future we will not so easily be able to tell from Kafka topic names what data is 'external' and what data is 'internal'.

We could find other ways to do this (a tag in EventStreamConfig?) , and Refine these datasets differently, but I think it would make more sense to not filter any data at all at Refine time. Instead, we should add a boolean field like is_wikimedia_source during Refinement that will tag events appropriately. Users can then filter on this field.

Event records are analogous to webrequest records. It will be hard to determine (in an automated way) at ingestion time what is valuable data and what is not. We should just keep the data and allow users to make this decision in their derived data.

Event Timeline

fdans triaged this task as Medium priority.Jul 13 2020, 4:55 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.

Change 646828 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refine - Add is_wmf_domain transform function

https://gerrit.wikimedia.org/r/646828

Change 646828 merged by Ottomata:
[analytics/refinery/source@master] Refine - Add is_wmf_domain transform function

https://gerrit.wikimedia.org/r/646828

Change 654308 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump refine jar version to refinery-job 0.0.143

https://gerrit.wikimedia.org/r/654308

Change 654308 merged by Razzi:
[operations/puppet@production] Bump refine jar version to refinery-job 0.0.143

https://gerrit.wikimedia.org/r/654308