Page MenuHomePhabricator

Investigate whether we can use mysql EXPLAIN checks to prevent runaway dedupe queries
Open, Stalled, MediumPublicSpike

Description

Spike?

Event Timeline

Dwisehaupt changed the subtype of this task from "Task" to "Spike".Feb 8 2024, 5:29 PM
Dwisehaupt subscribed.

Changing to spike after discussion in standup.

So, it looks like we can get advance warnings on queries that will be expensive to run. EXPLAIN gives us back two valuable bits of data that we can check before executing the query

In real terms this info looks like this:

MariaDB [civicrm]> EXPLAIN SELECT * FROM civicrm_contribution;
+------+-------------+----------------------+------+---------------+------+---------+------+------+-------+
| id   | select_type | table                | type | possible_keys | key  | key_len | ref  | rows | Extra |
+------+-------------+----------------------+------+---------------+------+---------+------+------+-------+
|    1 | SIMPLE      | civicrm_contribution | ALL  | NULL          | NULL | NULL    | NULL | 114  |       |
+------+-------------+----------------------+------+---------------+------+---------+------+------+-------+

Taken from MySQL.com regarding the use of ALL as an access type:

"A full table scan is done for each combination of rows from the previous tables. This is normally not good if the table is the first table not marked const, and usually very bad in all other cases. Normally, you can avoid ALL by adding indexes that enable row retrieval from the table based on constant values or column values from earlier tables."

As a starting point, we could test out making an EXPLAIN query in advance of any generated dedupe queries. We could detect the access type of ALL and determine a row count threshold that, if breached, will cause the UI to show a warning informing the user that they're potentially about to run a slow and expensive query.

greg triaged this task as Medium priority.Feb 27 2024, 10:33 PM

Okay, I'm giving up on this. Although it's a nice idea, it's probably going to be a ton of work and result in some ugly code. I'll leave my findings for posterity.

I did start adding a UI prompt to the angular code here prior to dispatching the dedupe API call and went down the road of adding a new API action called civicrm_api3_dedupe_getduplicates_explain just to test, but it's gonna be a real can of worms to turn the standard civicrm_api3('Contact', 'get'.. into an EXPLAIN query.

I started adding to https://github.com/wikimedia/wikimedia-fundraising-crm/blob/f04c98fa8834cf592cec68b3d471229d645aba99/drupal/sites/all/modules/civicrm/api/v3/Dedupe.php#L140:

function civicrm_api3_dedupe_getduplicates_explain($params) {
  $options = _civicrm_api3_get_options_from_params($params);
  $validFieldsForRetrieval = civicrm_api3('Contact', 'getfields', ['action' => 'get'])['values'];
  $filteredCriteria = isset($params['criteria']['contact']) ? array_intersect_key($params['criteria']['contact'], $validFieldsForRetrieval) : [];
  $count=0;

  // only EXPLAIN worthy queries
  if (!empty($params['criteria']) || !empty($options['limit']) || !empty($params['rule_group_id'])) {
    // somehow extract the SQL from this before it's executed and pipe it to EXPLAIN $query?
    $explainQueryResult = civicrm_api3('Contact', 'get', array_merge([
      'options' => ['limit' => $options['limit']],
      'return' => 'id',
      'check_permissions' => TRUE,
      'contact_type' => civicrm_api3('RuleGroup', 'getvalue',
        ['id' => $params['rule_group_id'], 'return' => 'contact_type']),
    ], $filteredCriteria));

    $count = $explainQueryResult['count'];
  }

  return civicrm_api3_create_success($count);
}

The problem is that we'd need to add code either via hooks or inside the API action to intercept the SQL, as I couldn't see an easy way to do this already. The query builder eventually builds the SQL down here but it's probably going to be a ton of work to cleanly extract that so for that reason I think we need to look at other options :(