Due to T256674, the event.wdqs_external_sparql_query table has been not refining any records since https://gerrit.wikimedia.org/r/c/operations/puppet/+/592756/ was merged. Events from query.wikidata.org were being filtered out.
The motivation for filtering events was to prevent non wikimedia sites from emitting junk events via EventLogging extension. We are trying to unify the ingestion logic for both production and analytics data, so in the future we will not so easily be able to tell from Kafka topic names what data is 'external' and what data is 'internal'.
We could find other ways to do this (a tag in EventStreamConfig?) , and Refine these datasets differently, but I think it would make more sense to not filter any data at all at Refine time. Instead, we should add a boolean field like is_wikimedia_source during Refinement that will tag events appropriately. Users can then filter on this field.
Event records are analogous to webrequest records. It will be hard to determine (in an automated way) at ingestion time what is valuable data and what is not. We should just keep the data and allow users to make this decision in their derived data.