Page MenuHomePhabricator

Efficiency of querying views
Closed, ResolvedPublic

Description

The query plan for the all_donations_cube_dev is pretty complex and does not allow for a 60 second execution in Superset for even a simple query. Comparing these two query plans, for instance:

explain select sum(total_amount) from analytics_ad_hoc.all_donations_cube_dev where receive_date>='2020-07-01';

and

explain select sum(total_amount) from analytics.all_donations_cube where receive_date>='2020-07-01';

The table from analytics executes in a few seconds while the view from analytics_ad_hoc spins (I haven't successfully executed this, I just know it takes a while).

Using views would be great when available if there is a way to compensate for this / help it along. If you have any ideas, that would be greatly appreciated!

Otherwise, tables work well, but of course take up more resources.

Related Objects

StatusSubtypeAssignedTask
ResolvedDwisehaupt

Event Timeline

This query takes 42.7 seconds on last run to execute in Superset:

SELECT donation_type AS donation_type,
       count(id) AS `Donations`,
       count(DISTINCT id) AS `of Donations`
FROM analytics.all_donations_cube
WHERE receive_date >= STR_TO_DATE('2019-07-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
  AND receive_date < STR_TO_DATE('2020-07-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
GROUP BY donation_type

This query takes 42.7 seconds on last run to execute in Superset:

SELECT donation_type AS donation_type,
       count(id) AS `Donations`,
       count(DISTINCT id) AS `of Donations`
FROM analytics.all_donations_cube
WHERE receive_date >= STR_TO_DATE('2019-07-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
  AND receive_date < STR_TO_DATE('2020-07-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
GROUP BY donation_type

I think you can cut the time significantly by adding an index to 'donation_type' and simplifying the query. Since 'id' is primary key, it's unique for every row so count(distinct(id)) is the same as count(id) and count(*). With added index this ran in about half the time for the same result:

SELECT donation_type, count(*) AS `Donations` FROM all_donations_cube WHERE receive_date BETWEEN '2019-07-01' AND '2020-07-01' GROUP BY donation_type;

I thought indexing receive_date would help but in my testing it did not!

This query from today's slow_log takes about 55s on frdev:

SELECT month_received_sorted AS month_received_sorted,
       gateway_not_null AS gateway_not_null,
       COUNT(DISTINCT id) AS total_donations
FROM all_donations_cube
WHERE receive_date >= STR_TO_DATE('2020-07-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
  AND receive_date < STR_TO_DATE('2021-07-01 00:00:00.000000', '%Y-%m-%d %H:%i:%s.%f')
GROUP BY month_received_sorted,
         gateway_not_null
ORDER BY total_donations DESC
LIMIT 10000;

Modified query runs in around 10s:

SELECT DATE_FORMAT(receive_date, '%y_%M') AS month_received_sorted,
gateway_not_null,
COUNT(*) AS total_donations
FROM all_donations_cube 
WHERE receive_date BETWEEN '2020-07-01' AND '2021-07-01'
GROUP BY month_received_sorted, gateway_not_null
ORDER BY total_donations DESC;

Correction: I later realized that I had misunderstood the month_received_sorted column, so the DATE_FORMAT clause is wrong. Revised, runs in about 13s:

SELECT month_received_sorted,
gateway_not_null,
COUNT(*) AS total_donations
FROM all_donations_cube 
WHERE receive_date BETWEEN '2020-07-01' AND '2021-07-01'
GROUP BY month_received_sorted, gateway_not_null
ORDER BY total_donations DESC;

I took a look at how the online_goal_tracking_FY2021 view is created and use. I'm not super familiar with mysql views but based on some reading my first thought was to add a DATE column to all_donations_cube, and index that as well as donation_type, and also to add an index that covers both. Testing a couple of queries from Big Fancy Dashboard the improvement was very modest, 10% at best.

So instead I tried replacing online_goal_tracking_FY2021 with an actual table populated by the same query that creates the view.

CREATE TABLE online_goal_tracking_FY2021_test2 ( 
  receive_day date DEFAULT NULL,
  daily_total decimal(42,2) DEFAULT NULL,
  cumulative_total decimal(64,2) DEFAULT NULL,
  distance_to_goal decimal(65,2) DEFAULT NULL,
  distance_to_stretch_goal decimal(65,2) DEFAULT NULL,
  KEY receive_day (receive_day)
) ENGINE=InnoDB DEFAULT CHARSET=binary

With the above indexes already added to all_donations_cube, it took about 16 seconds to populate the table. After doing so queries that were taking 5-10s using the view were reduced to the single millisecond range. My thinking is that unless there's a really good reason I'm missing to use a view here, it may be a whole lot more efficient to use regular tables and refresh them whenever all_donations_cube is updated.

Jgreen triaged this task as Medium priority.
Jgreen moved this task from Up Next to Done on the fundraising-tech-ops board.