Page MenuHomePhabricator

Superset Presto LIMIT >10000 error
Closed, ResolvedPublic

Description

Running query below, selected pulldown LIMIT 100000, and added LIMIT 100000 to query.

First got error message (!) 10000 rows returned It appears that the number of rows in the query results displayed was limited on the server side to the 10000 limit.

Ran query again and got (!) 11161 rows returned It appears that the number of rows in the query results displayed was limited on the server side to the 11161 limit.

(Result should be 80633 rows).

WITH parsed_json AS (
SELECT
event.app_install_id AS install_id,
event.session_token AS session_id,
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks,
event.time_spent
FROM mobilewikiappsuggestededits
WHERE year = 2021 AND month = 4
AND event.edit_tasks != '{}'
AND event.time_spent >= 0
AND regexp_like(useragent.wmf_app_version, '-r-')
)
SELECT
install_id,
session_id,
-- Session summary:
SUM(time_spent) AS total_time_spent, -- in seconds
COALESCE(SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-c'), 'suc'), 0)), 0) AS n_edits 
FROM parsed_json
GROUP BY install_id, session_id
WITH parsed_json AS (
SELECT
event.app_install_id AS install_id,
event.session_token AS session_id,
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks,
event.time_spent
FROM mobilewikiappsuggestededits
WHERE year = 2021 AND month = 4
AND event.edit_tasks != '{}'
AND event.time_spent >= 0
-- AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
AND regexp_like(useragent.wmf_app_version, '-r-')
)
SELECT
install_id,
session_id,
-- Session summary:
SUM(time_spent) AS total_time_spent, -- in seconds
COALESCE(SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-c'), 'suc'), 0)), 0) AS n_edits 
FROM parsed_json
GROUP BY install_id, session_id
WITH parsed_json AS (
SELECT
event.app_install_id AS install_id,
event.session_token AS session_id,
CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks,
event.time_spent
FROM mobilewikiappsuggestededits
WHERE year = 2021 AND month = 4
AND event.edit_tasks != '{}'
AND event.time_spent >= 0
-- AND useragent.wmf_app_version = '2.7.50322-r-2020-06-08' -- note: new ver replaces 'add-description' with 'a-d' and 'successes' with 'suc'
AND regexp_like(useragent.wmf_app_version, '-r-')
)
SELECT
install_id,
session_id,
-- Session summary:
SUM(time_spent) AS total_time_spent, -- in seconds
COALESCE(SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-c'), 'suc'), 0)), 0) AS n_edits 
FROM parsed_json
GROUP BY install_id, session_id
LIMIT 100000

Event Timeline

SNowick_WMF renamed this task from Superset Presto LIMIT <10000 error to Superset Presto LIMIT >10000 error .May 12 2021, 12:52 AM
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

I'm just as lost as you are so far... it's expected behavior and not a bug, but I can't figure out what configuration triggers it and why "https://github.com/apache/superset/blob/9773aba522e957ed9423045ca153219638a85d2f/superset/translations/en/LC_MESSAGES/messages.json#L1017"

My guess is that it's some kind of quota like the timeout but for number of records, and maybe when the query is fresh it can "sneak by" a few more results the second time. Queries like this would probably benefit from a more official pipeline though, maybe? Like an AirFlow job running spark sql? Or is it pretty ad-hoc?

Hi @SNowick_WMF, I double checked the number of expected rows and got 11161, not 80633 as you mentioned.

I ran:

WITH parsed_json AS (
  SELECT
  event.app_install_id AS install_id,
  event.session_token AS session_id,
  CAST(JSON_PARSE(event.edit_tasks) AS MAP(VARCHAR, MAP(VARCHAR, INTEGER))) AS edit_tasks,
  event.time_spent
  FROM event.mobilewikiappsuggestededits
  WHERE year = 2021 AND month = 4
  AND event.edit_tasks != '{}'
  AND event.time_spent >= 0
  AND regexp_like(useragent.wmf_app_version, '-r-')
),

t AS (
SELECT
install_id,
session_id,
-- Session summary:
SUM(time_spent) AS total_time_spent, -- in seconds
COALESCE(SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-c'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 'a-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-d'), 'suc'), 0)) + SUM(COALESCE(ELEMENT_AT(ELEMENT_AT(edit_tasks, 't-c'), 'suc'), 0)), 0) AS n_edits 
FROM parsed_json
GROUP BY install_id, session_id
)

select count(1) FROM t

I let you close or update the task as needed.

Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.
Milimetric added a subscriber: Milimetric.

Hi thanks for looking into this - I ran the query again on Superset and got the 11161, but when I ran it from Jupyter I got the higher result but I omitted the event.edit_tasks, then I didn't follow up on this. I will close the task.

@Milimetric This query was part of a bigger report and I am in the process of moving everything, was thinking of using Reportupdater but Airflow would work too.

@SNowick_WMF, Reportupdater is fine, it's what's available right now. We don't want to slow you down waiting for AirFlow