Page MenuHomePhabricator

Create Druid tables for Druid datasources in Superset
Closed, ResolvedPublic

Description

The 0.36.0 release of Superset adds a specific setting to allow Druid datasources (as we use them) to be visible/usable. Superset upstream suggested mapping every Druid datasource as a table.
Steps to create Druid tables: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Druid_datasources_vs_Druid_tables

Event Timeline

cchen triaged this task as Medium priority.May 5 2020, 6:03 AM
cchen moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

@elukey I created Druid tables from all Druid datasources we have except the followings:

  • event_readingdepth
  • mediawiki_history_beta
  • mediawiki_history_reduced*
  • netflow
  • tbayer_popups
  • test_elukey_webrequest
  • test_elukey_wmf_netflow
  • test_kafka_event_centralnoticeimpression
  • test_wmf_netflow

The error message while creating the table was

Table [druid.mediawiki_history_reduced_2019_12] could not be found, please double check your database connection, schema, and table name, error: This result object does not return rows. It has been closed automatically.

Please confirm these sources are not active or not in use.

Another thing is, for source event_pageissues, the "eventCount" column is missing from the newly created table.

Sources are old, I confirm! Thanks :)

No idea for even_pageissues, will investigate :)

@fdans if you have time this week can you check this task and see what's the issue with event_pageissues?

@cchen just to clarify, is this eventCount column that count measure that druid adds to the datasource, or is it a column that should have been ingested as part of the job?

@fdans this eventCount is a column that should have been ingested as part of the job. This column is in the Druid datasource, but when mapping the datasource to a Druid table, this column is missing.

event_pageissues: Druid datasource and Druid table

This is the puppet configuration for the job that pulls from Hive and index to Druid:

profile::analytics::refinery::job::eventlogging_to_druid_job { 'pageissues':
    ensure     => 'absent',
    job_config => {
        dimensions => 'event.action,event.editCountBucket,event.isAnon,event.issuesSeverity,event.issuesVersion,event.namespaceId,event.sectionNumbers,revision,wiki,useragent.browser_family,useragent.browser_major,useragent.browser_minor,useragent.device_family,useragent.is_bot,useragent.os_family,useragent.os_major,useragent.os_minor',
    },
}

A couple of notes:

  • the job is currently absented, IIUC it was a one off, so only some data is present in Druid
  • eventCount is not among the dimensions, so I think it was added manually to superset as extra column, but in theory it shouldn't be needed since it seems a measure (for example, in the Druid table the event count can be easily retrieved via SQL).

@cchen I checked the Druid segments for event_pageissues on Druid and I noticed that they are from 2018-10-03 to 2018-11-05. If I explore the datasource with a simple Table Chart for the past 7 days, I see a "No data" returned, meanwhile if I do the same with the Druid SQL Table I see a count that appears. This is something very strange, maybe a bug in Superset? (or maybe me not seeing the obvious lacking coffee?)

Also, do you need eventCount in this case for some chart/query? (Just to understand the use case, I suspect that we could live without it in the Druid SQL Table).

I checked in the new version of Superset in staging (under testing) and the bug seems gone, weird..

@elukey I forgot to click "submit" button...

There's no charts/query using this. I was checking each table to make sure they are the same as datasource. And yes we can live without this!

Close the ticket since there's no further updates to this task.