Page MenuHomePhabricator

Experiment with Druid and SqlAlchemy
Closed, ResolvedPublic3 Estimated Story Points

Description

Superset is now able to query Druid via SQLAlchemy. Upstream suggests that this is the "supported" way from now on, so we should try it and see if we can migrate charts to it.

Event Timeline

The 0.36.0 release of Superset adds a specific setting to allow Druid datasources (as we use them) to be visible/usable. Upstream is clearly in favor of using sqlalchemy and we should migrate our charts' datasources to it before we have too many :)

I added a "Database" to the Superset config, and I was able to create a druid.edits_hourly table with the following config:

Database: Druid SQL Analytics
Schema: druid
table: edits_hourly

It seems that a manual action is needed for all the tables, there is no quick way to get a list of table via metadata query (even if in sqllab is possible). I followed up with Upstream and they confirmed that this is a limitation/choice.

Verified in staging that a user without admin permissions can add tables from Druid, so it should work as Druid datasources.

I'll follow up over email with people using Superset to start testing charts with Druid SQLAlchemy tables, and add some documentation to the Superset page.

Mentioned in SAL (#wikimedia-analytics) [2020-04-30T10:36:20Z] <elukey> run superset init to add missing perms on an-tool1005 and analytics-tool1004 - T249681

Discoveries made today:

  • superset init needs to be run on every upgrade, otherwise some permissions may not be created on the db tables related to new features. I was getting a HTTP 401 trying to modify a chart with a new datasource and now the issue is gone.
  • the current superset version, 0.35.2, seems to be problematic when changing charts manually. If I select a chart, open the explore window (clicking on its name) and try to change the datasource I get an error message telling me that there is no way to modify the datetime column. In Superset 0.36.0 (currently in staging) seems to have two new dropdown to help with this, so moving charts to druid tables will have to wait 0.36 deployment (see related task under the parent Superset Updates).

Tested 4 tables and 2 dashboards with Druid tables:

  • Features: features are all working with Druid tables;
  • Load time: longer load time in refreshing charts and dashboards, but acceptable. Got Druid error while using pageviews_hourly table and group by project and countries:
druid error: Resource limit exceeded (io.druid.query.ResourceLimitExceededException): query[97442d5e-abf1-49a9-a2bc-39c11f6f8cb9] url[http://druid1001.eqiad.wmnet:8083/druid/v2/] timed out or max bytes limit reached.
  • Data: discrepancy using the same data and same filters, see following examples

Example 1 Pageviews_daily

Druid Datasource

Screen Shot 2020-05-05 at 4.02.56 AM.png (811×1 px, 83 KB)

Druid Table

Screen Shot 2020-05-05 at 4.02.48 AM.png (867×1 px, 91 KB)

Example 2 Virtualpageviews_hourly

Duird Datasource

Screen Shot 2020-05-05 at 4.07.12 AM.png (785×1 px, 86 KB)

Druid Table

Screen Shot 2020-05-05 at 4.07.21 AM.png (772×1 px, 86 KB)

  • Data: discrepancy using the same data and same filters, see following examples.

This is due to the semantics in time-range querying being different from the datasource and the SQL druid modes:

  • In datasource mode, the date-times specified in the time section of the UI are passed to druid as an Interval whose en-date is exluded
  • In SQL mode, the end-date is passed with an <= operator ... There fore it's included, and as we have hourly data (in pageview_hourly at least), we get an extra hour added to the number.

I obtained the same exact numbers using end-date 2020-03-31T23:59:59 in SQL mode.

Got Druid error while using pageviews_hourly table and group by project and countries:

This is expected as sql access uses presto which will not be able to deal with our largest datatsources

Closing ticket as it seems experiments are done, we expect performance of presto to improve even more when it is deployed to the whole cluster (right now it is only present in 5 nodes)

Nuria set the point value for this task to 3.
Nuria set Final Story Points to 3.