Page MenuHomePhabricator

Investigate queue consumer slowdown in pretest of July 13 2022
Closed, ResolvedPublic

Description

From 16:00 to 19:00 UTC there was a banner test with lots of traffic.

The donation queue consumer was only processing about 3 per second, far too slow to keep up with donations. Some donors had to wait more than an hour for their thank you letters.

There was some manual merge activity going on in the Civi UI, but no long-running queries.

We turned on debug logging for the process-control output for a few of the consumer runs. debug output can be found in log files from
donations_queue_consume-20220713-180502.log to
donations_queue_consume-20220713-182302.log

The longest time gap between subsequent lines there is between the 'Beginning DB transaction' log line that includes a json blob of the donation and the 'Contribution array for civicrm_contribution_add()' log line. This is about 0.2 seconds. Let's see what's happening in there and if the June 30th Civi core update had anything to do with the slowdown.

Event Timeline

greg subscribed.

Is QC debug logging enabled?

After seeing the slowdown, we enabled debug logging for 20 minutes to see if the additional log lines would help pinpoint the slow part of the code. They didn't narrow it down a lot, and the code ran just as fast/slow with debug logging as it had been running before.

There's an en6C pre-test today so that would be a good time to do some investigation.

Change 817874 had a related patch set uploaded (by Ejegg; author: Ejegg):

[wikimedia/fundraising/crm@master] Debug logging for queue speed investigation

https://gerrit.wikimedia.org/r/817874

Change 817874 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Debug logging for queue speed investigation

https://gerrit.wikimedia.org/r/817874

Change 817888 had a related patch set uploaded (by Ejegg; author: Ejegg):

[wikimedia/fundraising/crm@deployment] Debug logging for queue speed investigation

https://gerrit.wikimedia.org/r/817888

Change 817888 merged by Ejegg:

[wikimedia/fundraising/crm@deployment] Debug logging for queue speed investigation

https://gerrit.wikimedia.org/r/817888

So we enabled some extra debug logging for some of today's test, and also looked more closely at the timers we've had all along.
Our average time for the 'create contact' timer is running 150-160 ms when traffic is heavy: https://frmon.frdev.wikimedia.org/d/Pq1YNMviz/fundraising-overview?orgId=1&from=1658952969524&to=1658968183364&viewPanel=33
Last December that timer was closer to 45 ms: https://frmon.frdev.wikimedia.org/d/Pq1YNMviz/fundraising-overview?orgId=1&from=1638543600000&to=1638640799000&viewPanel=33

The average 'create contact' time has indeed been higher since the CiviCRM core upgrade of June 30th.

The 'create contact' timer would better be called 'create or update contact', since it encompasses both the check for a matching contact (usually taking around 5ms in this test) and then either creating a new contact or updating the matching contact's address, custom fields, and preferred language.

In the current test we found that creating a new contact was still fairly fast (around 40ms), but that updating a contact took from 140-260 ms.

During this run, the majority of donations came from repeat donors, so the code more often went through the longer update path:
22,363 updates vs 13,185 new contacts

The extra logging was on from file
donations_queue_consume-20220727-212901.log
through to when the queue cleared out at
donations_queue_consume-20220728-003202.log

So we should probably dig into CiviCRM core changes in updating an existing contact between version 5.48 (deployed in April) and version 5.51.

Hmm interesting - that was a moderated version jump which makes it harder to bi-sect but will ponder what changed

Cstone renamed this task from Investigate queue consumer slowdown in pretest of June 13 2022 to Investigate queue consumer slowdown in pretest of July 13 2022.Jul 28 2022, 3:53 PM

So I realize we're missing a data point here - if the ration of repeat to new donors is somehow a lot higher now than it was in December, that might explain the increase in average processing time without blaming a big core change.

We used an old script on Staging to test performance and I was able to replicate a fairly consistent slow down in data processing for updates. I was also able to observe the greetings were populated for update mode but not create - the speed impact of that is worth checking out

modenumberspeedcommand
create2120 per minute.drush @wmff qperf-d 2 "Import 2 contacts to warm up the cache, create mode"
update2120 per minute.drush @wmff qperf-d 2 "Import 2 contacts to warm up the cache, update mode"
create500638 per minutedrush @wmff qperf-d 500 "Import 500 contacts -create mode"
update500231 per minutedrush @wmff qperf-d 500 "Import 500 contacts -update mode"

Change 819766 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Blunt force trauma

https://gerrit.wikimedia.org/r/819766

Hacking out greeting process (per the very blunt patch) put the update to the same speed as the create - so that seems to be where the slow down is. We are being asked for this data in Acoustic and for for major gifts so options are

  • say No
  • improve the speed of processing them ( I have some ideas about how to do this as the code duplicates effort)
  • do greeting processing as a 'do-later' cron job (ie add after the fact)

Change 819767 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Improve performance of greeting procssing

https://gerrit.wikimedia.org/r/819767

I tried a performance improvement on token processing per https://github.com/civicrm/civicrm-core/pull/24129 - but speed for update is fairly unchanged - so calling three times vs once is not the magic trick

Change 819768 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Update to latest civiToken

https://gerrit.wikimedia.org/r/819768

Hmm I also tried updating the tokens extension - no change

Average performance is 219 per minute

Marginal improvement disabling it - Average performance is 227 per minute - not the magic bullet (on a second run it was slightly slower - so no change IMHO)

Change 819770 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Do not merge - remove legacy token evaluation

https://gerrit.wikimedia.org/r/819770

Also no great change from hacking out legacy token eval from the processor Average performance is 222 per minute

Change 819771 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Do not merge - remove token processor smarty eval

https://gerrit.wikimedia.org/r/819771

Or even hacking out the smarty handling... Average performance is 222 per minute

Well - I guess if I can replicate it on staging I can replicate locally so time for some php profiling....

Change 819773 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Do not merge Cache getfields

https://gerrit.wikimedia.org/r/819773

OK - some more updates

Tech details (how)

  • I managed to find my code from before to analyse the query log - I generated it on Staging by setting

define('CIVICRM_DEBUG_LOG_QUERY', 'q'); in civicrm.settings.php ('q' is just a random string that is included in the output file name).

I then ran the qperf drush @wmff qperf-d 5 "Import 5" to import 5 contacts & commented the above out again.

I dug into the file created in drupal/sites/default/files/civicrm/ConfigAndLog and copied one row's worth of queries to a file called 'out.txt' on my local.

Also on my local I enabled the systemtools (Home for WMF helpers) extension and then ran the following

echo '{"fileName":"/srv/civi-sites/wmff/out.txt", "version":4}' | drush @wmff cvapi  Querylog.parse --in=json

With a couple of tweaks the above generated a csv of all the queries that ran -

https://docs.google.com/spreadsheets/d/1vKBDVlY38tp_zzoRo0w1OYA-edsGWwGWFutqbHB0O90/edit?usp=sharing

Analysis

  • the total query-time in the import for one update-row was 0.05122 seconds - so if it were just the queries & no php time we could do 20 rows per second
  • the total query-time in the import for queries relating to the greeting processing was 0.003453
  • the total query-time in the import for queries relating to truncating the acl cache was 0.009524 - see below for comments on this
  • it seems unlikely that the queries are a significant factor, although I did put up an upstream PR to reduce them - see https://github.com/civicrm/civicrm-core/pull/24129 - so I'm going to focus on php profiling tomorrow ( I always thought a php slow-down was more likely, I just did sql first as it is easier).

ACL truncate

  • the acl truncates account for nearly 20%
  • the queries are not called when the acl_cache_refresh_mode setting is deterministic

drush @wmff cvapi Setting.get sequential=1 return="acl_cache_refresh_mode"

  • it IS deterministic on prod - I can't quite see where that is being set - maybe somewhere outside the webroot
  • so I added the setting to civicrm.settings.php on staging - $civicrm_setting['domain']['acl_cache_refresh_mode'] = 'deterministic';

Change 820064 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Tweaks to query file parsing to get more data

https://gerrit.wikimedia.org/r/820064

Change 819766 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Blunt force trauma

Reason:

just testing

https://gerrit.wikimedia.org/r/819766

Change 820234 had a related patch set uploaded (by Ejegg; author: Ejegg):

[wikimedia/fundraising/crm@master] Add timers for insert + update API calls

https://gerrit.wikimedia.org/r/820234

Change 820258 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Improve performance of greeting procssing

https://gerrit.wikimedia.org/r/820258

Change 819766 restored by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Blunt force trauma

https://gerrit.wikimedia.org/r/819766

Change 820064 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Tweaks to query file parsing to get more data

https://gerrit.wikimedia.org/r/820064

Change 820576 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm/civicrm@master] Fix cache miss in FastArrays on 'has()'

https://gerrit.wikimedia.org/r/820576

Change 819770 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Do not merge - remove legacy token evaluation

Reason:

https://gerrit.wikimedia.org/r/819770

Change 819771 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Do not merge - remove token processor smarty eval

Reason:

https://gerrit.wikimedia.org/r/819771

Change 819766 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Blunt force trauma

Reason:

https://gerrit.wikimedia.org/r/819766

Change 820234 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Add timers for insert + update API calls

https://gerrit.wikimedia.org/r/820234

Change 820576 merged by jenkins-bot:

[wikimedia/fundraising/crm/civicrm@master] Fix cache miss in FastArrays on 'has()'

https://gerrit.wikimedia.org/r/820576

We just deployed the has patch - it affects update only & we got a clear win on update

image.png (1×2 px, 88 KB)

Hmm & then it all went haywire - possibly due to the dedupe job kicking off? - but, hard to see on the output but we had lower lows for create contact (marginally) after the second patch

image.png (1×2 px, 93 KB)

Hmm Update - the dip is the 'sweet spot' between the two patches .... it MUST be the flush but? I can see we are adding new option values (duplicate prefixes) way too often so I might check that first

NULL |      NULL |          NULL | 2022-08-17 22:25:26 | 62fd6a47258fdSXgz |        NULL | Insert     | NULL | NULL  |  34777 |
89996Dr.1188Dr.NULL001188NULL001NULLNULLNULL2022-08-17 22:38:5862fd6dcae37e6IjoxNULLInsertNULLNULL34778
90007Jr2901JrNULL002901NULL001NULLNULLNULL2022-08-17 22:43:3362fd6e7e5543emon9NULLInsertNULLNULL34779
90016Dr.1189Dr.NULL001189NULL001NULLNULLNULL2022-08-17 23:06:1062fd741fa87a2ORoSNULLInsertNULLNULL34780
90027Jr2902JrNULL002902NULL001NULLNULLNULL2022-08-17 23:15:1362fd763a590a8J5kyNULLInsertNULLNULL34781
90036Dr.1190Dr.NULL001190NULL001NULLNULLNULL2022-08-17 23:21:4562fd77a3a2bf0BfciNULLInsertNULLNULL34782

image.png (1×2 px, 85 KB)

UPDATE - my bad - the return to slowness was cos I messed up the submodule update and DROPPED the good patch - when deploying the medium patch. So the graph tells us what it looks like it tells us...

Change 819767 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Improve performance of greeting procssing

Reason:

https://gerrit.wikimedia.org/r/819767

Change 819773 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Do not merge Cache getfields

Reason:

https://gerrit.wikimedia.org/r/819773

Change 820258 abandoned by Eileen:

[wikimedia/fundraising/crm/civicrm@master] Improve performance of greeting procssing

Reason:

https://gerrit.wikimedia.org/r/820258

Change 819768 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Update to latest civiToken

https://gerrit.wikimedia.org/r/819768

Eileenmcnaughton claimed this task.
Eileenmcnaughton moved this task from Backlog to Done on the Fundraising Sprint Potato board.
Eileenmcnaughton set Final Story Points to 8.