Superset is now able to query Druid via SQLAlchemy. Upstream suggests that this is the "supported" way from now on, so we should try it and see if we can migrate charts to it.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | odimitrijevic | T211706 Superset Updates | |||
Resolved | elukey | T249681 Experiment with Druid and SqlAlchemy | |||
Resolved | cchen | T251857 Create Druid tables for Druid datasources in Superset |
Event Timeline
The 0.36.0 release of Superset adds a specific setting to allow Druid datasources (as we use them) to be visible/usable. Upstream is clearly in favor of using sqlalchemy and we should migrate our charts' datasources to it before we have too many :)
I added a "Database" to the Superset config, and I was able to create a druid.edits_hourly table with the following config:
Database: Druid SQL Analytics
Schema: druid
table: edits_hourly
It seems that a manual action is needed for all the tables, there is no quick way to get a list of table via metadata query (even if in sqllab is possible). I followed up with Upstream and they confirmed that this is a limitation/choice.
Verified in staging that a user without admin permissions can add tables from Druid, so it should work as Druid datasources.
I'll follow up over email with people using Superset to start testing charts with Druid SQLAlchemy tables, and add some documentation to the Superset page.
Mentioned in SAL (#wikimedia-analytics) [2020-04-30T10:36:20Z] <elukey> run superset init to add missing perms on an-tool1005 and analytics-tool1004 - T249681
Discoveries made today:
- superset init needs to be run on every upgrade, otherwise some permissions may not be created on the db tables related to new features. I was getting a HTTP 401 trying to modify a chart with a new datasource and now the issue is gone.
- the current superset version, 0.35.2, seems to be problematic when changing charts manually. If I select a chart, open the explore window (clicking on its name) and try to change the datasource I get an error message telling me that there is no way to modify the datetime column. In Superset 0.36.0 (currently in staging) seems to have two new dropdown to help with this, so moving charts to druid tables will have to wait 0.36 deployment (see related task under the parent Superset Updates).
Tested 4 tables and 2 dashboards with Druid tables:
- Features: features are all working with Druid tables;
- Load time: longer load time in refreshing charts and dashboards, but acceptable. Got Druid error while using pageviews_hourly table and group by project and countries:
druid error: Resource limit exceeded (io.druid.query.ResourceLimitExceededException): query[97442d5e-abf1-49a9-a2bc-39c11f6f8cb9] url[http://druid1001.eqiad.wmnet:8083/druid/v2/] timed out or max bytes limit reached.
- Data: discrepancy using the same data and same filters, see following examples
Example 1 Pageviews_daily
Example 2 Virtualpageviews_hourly
- Data: discrepancy using the same data and same filters, see following examples.
This is due to the semantics in time-range querying being different from the datasource and the SQL druid modes:
- In datasource mode, the date-times specified in the time section of the UI are passed to druid as an Interval whose en-date is exluded
- In SQL mode, the end-date is passed with an <= operator ... There fore it's included, and as we have hourly data (in pageview_hourly at least), we get an extra hour added to the number.
I obtained the same exact numbers using end-date 2020-03-31T23:59:59 in SQL mode.
Got Druid error while using pageviews_hourly table and group by project and countries:
This is expected as sql access uses presto which will not be able to deal with our largest datatsources
Closing ticket as it seems experiments are done, we expect performance of presto to improve even more when it is deployed to the whole cluster (right now it is only present in 5 nodes)