Page MenuHomePhabricator

Presto error in Superest - only when grouping
Open, Needs TriagePublic

Description

Hi all,

I am working with the event_santized.centralnoticebannerhistory data in Superset. I've done some data transformation and cleaning within Superset by creating columns that are meaningful to reflect how Fundraising teams digest campaign data. The end results can be seen in this dashboard. We have data for the FY2021 India campaign and the FY1920 en6C banner campaign working well with these created fields!

For this year's FY2021 en6C banner campaign, I am getting an error that seems to be tied to the created field 'campaign'. However, this error only appears when aggregated (not when rows are selected ungrouped), and only appears for the first two days of the en6C campaign, Nov. 30, 2020 and Dec. 1, 2020.

I have a s[[ https://superset.wikimedia.org/r/400 | eperate tab ]] of the dashboard displaying the issue. The 'Ungouped' view shows data flowing through without issue on Nov. 30 and Dec. 1; the 'Grouped' view shows the Presto error when grouping this same data. The 'Grouped, Starting Dec. 2' view is identical to the 'Grouped' view with the date range changed.

I have looked through the raw data in hive to see if I could identify any data quality errors on Nov. 30 and Dec. 1 in the event.l array, which is the original source of the 'campaign' attribute and have not been able to find anything (though would welcome ideas, of course).

Thank you for your help looking into this issue - please let me know if there is anything else I can provide!

Event Timeline

EYener created this task.Dec 18 2020, 3:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2020, 3:57 PM

Hi @EYener - Sorry for not getting to this problem faster.

query issue:
By exploring the dashboards you mentioned, I found a problematic query being:

SELECT split_part(event.l[cardinality(event.l)],'|', 2) AS "campaign",
       COUNT(*) AS "count"
FROM "event_sanitized"."centralnoticebannerhistory"
WHERE concat(substr(dt, 1, 10), ' ', substr(dt, 12, 8)) >= '2020-11-30 00:00:00.000000'
  AND concat(substr(dt, 1, 10), ' ', substr(dt, 12, 8)) < '2020-12-02 00:00:00.000000'
  AND split_part(event.l[cardinality(event.l)],'|', 2) LIKE 'C2021_en6C%'
GROUP BY split_part(event.l[cardinality(event.l)],'|', 2)
ORDER BY "count" DESC
LIMIT 50;

for instance.

I did some tests on data quality and found that there is a single row on 2020-12-01 that has an empty event.l field, leading to presto failing to run the WHERE clause AND split_part(event.l[cardinality(event.l)],'|', 2) LIKE 'C2021_en6C%' (cardinality of the list is 0, and presto tells that index of an array starts at 1).

To make the query succesfull you should add: AND cardinality(event.l) > 0 before the the clause filtering for banner-name (I tested, it works :).

Performance improvement:
I wish to add something about query performance: your queries currently filter dates using the dt field only. You would gain a good amount of performance by adding limits using the year, month, day etc fields. Indeed, while filtering for exactly the same thing functionaly, the latter fields are partitions for the dataset, meaning the querying system knows how to only read data for the given periods when those fileds are specified.
For instance, your query written this way (including correction ;):

SELECT split_part(event.l[cardinality(event.l)],'|', 2) AS "campaign",
       COUNT(*) AS "count"
FROM "event_sanitized"."centralnoticebannerhistory"
WHERE concat(substr(dt, 1, 10), ' ', substr(dt, 12, 8)) >= '2020-11-30 00:00:00.000000'
  AND concat(substr(dt, 1, 10), ' ', substr(dt, 12, 8)) < '2020-12-02 00:00:00.000000'
  AND cardinality(event.l) > 0
  AND split_part(event.l[cardinality(event.l)],'|', 2) LIKE 'C2021_en6C%'
GROUP BY split_part(event.l[cardinality(event.l)],'|', 2)
ORDER BY "count" DESC
LIMIT 50;

        campaign         |  count  
-------------------------+---------
 C2021_en6C_m_FR         | 1344564 
 C2021_en6C_dsk_FR       |  881575 
 C2021_en6C_low_twins_FR |   22091 
 C2021_en6C_low_other_FR |     546 
(4 rows)

Query 20210113_164959_00098_dcwyu, FINISHED, 5 nodes
Splits: 22,600 total, 22,600 done (100.00%)
0:16 [198M rows, 709MB] [12.6M rows/s, 45.3MB/s]

The line saying Splits: 22,600 tells you how many files-split the system has read (in that case 22,600).
Now let's add year, month, day filtering:

SELECT split_part(event.l[cardinality(event.l)],'|', 2) AS "campaign",
       COUNT(*) AS "count"
FROM "event_sanitized"."centralnoticebannerhistory"
WHERE year = 2020 AND ((month = 11 AND day = 30) OR (month = 12 AND day = 1))
  AND concat(substr(dt, 1, 10), ' ', substr(dt, 12, 8)) >= '2020-11-30 00:00:00.000000'
  AND concat(substr(dt, 1, 10), ' ', substr(dt, 12, 8)) < '2020-12-02 00:00:00.000000'
  AND cardinality(event.l) > 0
  AND split_part(event.l[cardinality(event.l)],'|', 2) LIKE 'C2021_en6C%'
GROUP BY split_part(event.l[cardinality(event.l)],'|', 2)
ORDER BY "count" DESC
LIMIT 50;

        campaign         |  count  
-------------------------+---------
 C2021_en6C_m_FR         | 1344544 
 C2021_en6C_dsk_FR       |  881568 
 C2021_en6C_low_twins_FR |   22091 
 C2021_en6C_low_other_FR |     546 
(4 rows)

Query 20210113_165713_00099_dcwyu, FINISHED, 5 nodes
Splits: 325 total, 325 done (100.00%)
0:03 [2.27M rows, 208MB] [671K rows/s, 61.4MB/s]

As you can see, the number of splits has dropped significantly :)
Using that trick will help our queries perform better, and also will help the system perform better for others - I'd be glad if you can get used to it :)
Many thanks :)

razzi edited projects, added Analytics-Radar; removed Analytics.Jan 14 2021, 5:59 PM

Hi @JAllemandou thanks for the reply! I am pulling this task back up and opened the dashboard to implement these suggestions. However, I encountered a new error on all charts:

presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=21

I am not seeing this issue in Hive when I query directly. Can you give me any guidance as to what the issue might be?

Also, do you have any best practices on how to modify a Superset chart query directly? As you suggest, the addition of partitions in the WHERE clause will be a great optimization. I know these can be inserted via in-chart filters, but I'm curious if there is a way to modify the resulting query directly as that would be a less manual process for multi-chart dashboards.

Thank you!

Hi @EYener

presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=21

I have not experienced that error when opening the dashboard. If this problem is still ongoing on your side, let's try a collaborative debugging session - either planned or impromptu (irc ping for instance).

Also, do you have any best practices on how to modify a Superset chart query directly?

I don't have a good way to update many charts easily :( I have tried, adding a 'cutom sql' filter containing :

((year = 2020 AND month = 12 AND day > 1) OR (year = 2021 AND month = 1 AND day = 1))

has worked, but it would still mean a lot of manual copy-paste.
I don't know how feasible it would be, but maybe creating a view on presto (dedicated table using a query) could simplify: you create and save a presto query filtering for only the dates you're interested in using partition fields, keeping everything else. And then you use this query as source of all your charts. Not sure if it represents less work though.

Finally, this need you have is something other users will have, and we have technical solution that could help with being better at this partitioning problem - basically, being able to use dt filters and have partitions filtered from that. We are far from being able to do it now, but I wanted to let you know that we have those issues in mind :)

EYener added a subscriber: elukey.Thu, Jan 28, 4:38 PM

Hi all, I spoke to @JAllemandou today about the issue I am seeing while editing this dashboard:

presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event_sanitized/CentralNoticeBannerHistory/year=2021/month=1/day=9/hour=21

and it appears to be permissions-based. CC @elukey here are the steps to recreate:

  • Select a chart in the dashboard, such as this one (if link does not work, the dashboard is the Banner History MVP dashboard in Superset> FY2021 | en6C Banner Campaign tab > Status Codes | FY2021 | en6C Campaign chart).
  • This should be the edit chart mode. If it doesn't take you there directly, explore the chart so that you can edit the in-chart filters
  • Add or change a filter. For instance, add a filter for year = 2020 to add a partition to the query.

This should result in an error (at least on my user!) Thank you!

elukey added a comment.Fri, Feb 5, 7:36 AM

After a bit of digging, the problem was found in Superset, namely @EYener's username was EYener instead of eyener. This seems to be a little change but since the username is the one that Hadoop uses to verify membership to analytics-privatedata-users, everything needs to be lowercase. I think that Erin's account was created a long time ago, and the user auto-creation was for some reason not selecting the correct uid from LDAP when auto-creating the user upon first login. We didn't really notice anything weird up to Presto since Druid is not authenticated, so the username didn't matter a lot, but now it does.

I went through all the usernames and corrected the ones with camel case, in theory we shouldn't see the problem again, but in case we'll have to check what's wrong on the Superset side.

Thanks again for helping with the authentication issue @elukey. I'm sure it's all fixed now, but I also wanted to note that it looks like my email address in Superset is listed as EYener@email.notfound - just in case this is helpful.

To continue the conversation on the original presto error with you and @JAllemandou, after implementing the suggestions above from @JAllemandou's original suggestion, I'm still getting an error that

presto error: SQL array indices start at 1

but only in certain circumstances. Here is a link to the dashboard where I am working with these filters. The top 3 charts in the dashboard (Impression Count Pie Chart | Banners Selected | FY2021 | en6C Campaign ; Status Codes | FY2021 | en6C Campaign ; Status Codes | FY2021 | en6C Campaign | % of Total Impressions) have the adjusted filters.

  • The adjusted filter suggestion seems to works well in a pie chart, but not in a table or in an area chart
  • It seems that the position of the in-chart filter does not affect the position in the WHERE clause of the resulting query. IE, if the position of my in-chart filter is to stack cardinality(event.l) > 0 campaign LIKE en6C% it doesn't necessarily mean that the WHERE clause will read WHERE cardinality(event.l) > 0 AND campaign LIKE en6C% So instead of individual filter elements based on the columns & partitions I need to filter on, I've added a "run-on" filter condition to this chart, which forces the order in the WHERE clause.
  • I have not used the single, large filter in this chart name, so the order of the WHERE clause elements is different and the query is failing.

Should order in the WHERE be impactful? Thanks again!

Thanks again for helping with the authentication issue @elukey. I'm sure it's all fixed now, but I also wanted to note that it looks like my email address in Superset is listed as EYener@email.notfound - just in case this is helpful.

Yep it is done via the user auto-creation, we don't really use the email so all good. Thanks for reporting!

Thanks @elukey! I also wanted to ask about caching; it appears that caching is no longer working for this dashboard - do you know of a possible cause for this?

Thanks @elukey! I also wanted to ask about caching; it appears that caching is no longer working for this dashboard - do you know of a possible cause for this?

Yep! We opened T273850 for this, basically it was brought up to us that any cached dashboard/widget skips security check with the current config, so we'll keep it disabled for a bit (until we figure out a solution).

Should order in the WHERE be impactful?

It is! Partition fields of the where clause are worked on their own, so their order is not of importance, but for other fields it is. Clauses are executed from top to bottom, so if you have split_part(event.l[cardinality(event.l)],'|', 4) = '6' filter before cardinality(event.l) > 0, you end up with an error. More explanations on specific cases below :)

  • The adjusted filter suggestion seems to works well in a pie chart, but not in a table or in an area chart

On this pie chart, the event.r IS NULL filter removed the problematic row, meaning the query would even work without the cardinality(event.l) > 0 filter.

  • It seems that the position of the in-chart filter does not affect the position in the WHERE clause of the resulting query.

Not exactly - From my tests here is how where clauses are ordered:

  • Generated-clauses first, in order of appearance in the filter-box. IE the clause that don't use and-written SQL.
  • Then hand-written SQL clauses, bundled into an single clause (between parenthesis), in order of appearance in the filter-box.

This makes it tricky to get the order right in the mind, as the two classes of clauses are separated in the resulting query but can be mixed in the box.

Now with that ordering understanding, the cardinality(event.l) > 0 hand-written SQL clause will always happen AFTER a split_part(event.l[cardinality(event.l)],'|', 4) = '6' generated clause.
The solution I have found for this is to add the event_l_card computed field to the table definition (defined as cardinality(event.l), numeric type). You can now use that field in a filter using the UI, making the clause being generated and therefore going first if ordered correctly in the filter-box :)
What a mess :)

Let me know this all makes sense!