The 0.36.0 release of Superset adds a specific setting to allow Druid datasources (as we use them) to be visible/usable. Superset upstream suggested mapping every Druid datasource as a table.
Steps to create Druid tables: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Druid_datasources_vs_Druid_tables
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | odimitrijevic | T211706 Superset Updates | |||
Resolved | elukey | T249681 Experiment with Druid and SqlAlchemy | |||
Resolved | cchen | T251857 Create Druid tables for Druid datasources in Superset |
Event Timeline
@elukey I created Druid tables from all Druid datasources we have except the followings:
- event_readingdepth
- mediawiki_history_beta
- mediawiki_history_reduced*
- netflow
- tbayer_popups
- test_elukey_webrequest
- test_elukey_wmf_netflow
- test_kafka_event_centralnoticeimpression
- test_wmf_netflow
The error message while creating the table was
Table [druid.mediawiki_history_reduced_2019_12] could not be found, please double check your database connection, schema, and table name, error: This result object does not return rows. It has been closed automatically.
Please confirm these sources are not active or not in use.
Another thing is, for source event_pageissues, the "eventCount" column is missing from the newly created table.
Sources are old, I confirm! Thanks :)
No idea for even_pageissues, will investigate :)
@fdans if you have time this week can you check this task and see what's the issue with event_pageissues?
@cchen just to clarify, is this eventCount column that count measure that druid adds to the datasource, or is it a column that should have been ingested as part of the job?
@fdans this eventCount is a column that should have been ingested as part of the job. This column is in the Druid datasource, but when mapping the datasource to a Druid table, this column is missing.
event_pageissues: Druid datasource and Druid table
This is the puppet configuration for the job that pulls from Hive and index to Druid:
profile::analytics::refinery::job::eventlogging_to_druid_job { 'pageissues': ensure => 'absent', job_config => { dimensions => 'event.action,event.editCountBucket,event.isAnon,event.issuesSeverity,event.issuesVersion,event.namespaceId,event.sectionNumbers,revision,wiki,useragent.browser_family,useragent.browser_major,useragent.browser_minor,useragent.device_family,useragent.is_bot,useragent.os_family,useragent.os_major,useragent.os_minor', }, }
A couple of notes:
- the job is currently absented, IIUC it was a one off, so only some data is present in Druid
- eventCount is not among the dimensions, so I think it was added manually to superset as extra column, but in theory it shouldn't be needed since it seems a measure (for example, in the Druid table the event count can be easily retrieved via SQL).
@cchen I checked the Druid segments for event_pageissues on Druid and I noticed that they are from 2018-10-03 to 2018-11-05. If I explore the datasource with a simple Table Chart for the past 7 days, I see a "No data" returned, meanwhile if I do the same with the Druid SQL Table I see a count that appears. This is something very strange, maybe a bug in Superset? (or maybe me not seeing the obvious lacking coffee?)
Also, do you need eventCount in this case for some chart/query? (Just to understand the use case, I suspect that we could live without it in the Druid SQL Table).
I checked in the new version of Superset in staging (under testing) and the bug seems gone, weird..
@elukey I forgot to click "submit" button...
There's no charts/query using this. I was checking each table to make sure they are the same as datasource. And yes we can live without this!