Create Druid tables for Druid datasources in Superset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cchen
	May 5 2020, 6:02 AM

Description

The 0.36.0 release of Superset adds a specific setting to allow Druid datasources (as we use them) to be visible/usable. Superset upstream suggested mapping every Druid datasource as a table.
Steps to create Druid tables: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Druid_datasources_vs_Druid_tables

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T211706 Superset Updates
Resolved	elukey	T249681 Experiment with Druid and SqlAlchemy
Resolved	cchen	T251857 Create Druid tables for Druid datasources in Superset

Event Timeline

cchen created this task.May 5 2020, 6:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 5 2020, 6:02 AM

cchen triaged this task as Medium priority.May 5 2020, 6:03 AM

cchen added a parent task: T249681: Experiment with Druid and SqlAlchemy.

cchen moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

@elukey I created Druid tables from all Druid datasources we have except the followings:

event_readingdepth
mediawiki_history_beta
mediawiki_history_reduced*
netflow
tbayer_popups
test_elukey_webrequest
test_elukey_wmf_netflow
test_kafka_event_centralnoticeimpression
test_wmf_netflow

The error message while creating the table was

Table [druid.mediawiki_history_reduced_2019_12] could not be found, please double check your database connection, schema, and table name, error: This result object does not return rows. It has been closed automatically.

Please confirm these sources are not active or not in use.

Another thing is, for source event_pageissues, the "eventCount" column is missing from the newly created table.

Sources are old, I confirm! Thanks :)

No idea for even_pageissues, will investigate :)

@fdans if you have time this week can you check this task and see what's the issue with event_pageissues?

elukey added a project: Analytics.May 12 2020, 6:12 AM

cchen moved this task from Doing to Needs Review on the Product-Analytics (Kanban) board.May 12 2020, 5:15 PM

@cchen just to clarify, is this eventCount column that count measure that druid adds to the datasource, or is it a column that should have been ingested as part of the job?

@fdans this eventCount is a column that should have been ingested as part of the job. This column is in the Druid datasource, but when mapping the datasource to a Druid table, this column is missing.

event_pageissues: Druid datasource and Druid table

This is the puppet configuration for the job that pulls from Hive and index to Druid:

profile::analytics::refinery::job::eventlogging_to_druid_job { 'pageissues':
    ensure     => 'absent',
    job_config => {
        dimensions => 'event.action,event.editCountBucket,event.isAnon,event.issuesSeverity,event.issuesVersion,event.namespaceId,event.sectionNumbers,revision,wiki,useragent.browser_family,useragent.browser_major,useragent.browser_minor,useragent.device_family,useragent.is_bot,useragent.os_family,useragent.os_major,useragent.os_minor',
    },
}

A couple of notes:

the job is currently absented, IIUC it was a one off, so only some data is present in Druid
eventCount is not among the dimensions, so I think it was added manually to superset as extra column, but in theory it shouldn't be needed since it seems a measure (for example, in the Druid table the event count can be easily retrieved via SQL).

@cchen I checked the Druid segments for event_pageissues on Druid and I noticed that they are from 2018-10-03 to 2018-11-05. If I explore the datasource with a simple Table Chart for the past 7 days, I see a "No data" returned, meanwhile if I do the same with the Druid SQL Table I see a count that appears. This is something very strange, maybe a bug in Superset? (or maybe me not seeing the obvious lacking coffee?)

Also, do you need eventCount in this case for some chart/query? (Just to understand the use case, I suspect that we could live without it in the Druid SQL Table).

I checked in the new version of Superset in staging (under testing) and the bug seems gone, weird..

elukey moved this task from Incoming to Radar on the Analytics board.May 26 2020, 7:36 AM

@elukey I forgot to click "submit" button...

There's no charts/query using this. I was checking each table to make sure they are the same as datasource. And yes we can live without this!

Close the ticket since there's no further updates to this task.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

elukey mentioned this in T263972: Remove support for the (deprecated) Druid datasources (in favor of Druid Tables) on Superset.Sep 28 2020, 7:50 AM

Create Druid tables for Druid datasources in SupersetClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create Druid tables for Druid datasources in Superset
Closed, ResolvedPublic
Actions

Related Objects
Search...