Page MenuHomePhabricator

Track rate of new donations added to civicrm by the donations queue consumer, in prometheus
Closed, ResolvedPublic


In order to be able to get a sense for the absolute maximum number of donations our system can handle before messages start backing up in the queue, I would like to start tracking how quickly the donations queue consumer is adding new donations to civicrm.

Note: This is not total number of messages consumed in one run. We'd need a new counter for new donations added.

It would be especially useful, if this was done in such a way that an accurate rate of consumption could be captured regardless of the load volume during any particular run (assuming that there is at least one valid donation to work with). That way, we could detect performance hits in the off-season.

Event Timeline

K4-713 renamed this task from Track rate of new donations added to divicrm by the donations queue consumer, in prometheus to Track rate of new donations added to civicrm by the donations queue consumer, in prometheus.Oct 27 2017, 11:02 PM
K4-713 created this task.
Ejegg added a subscriber: Ejegg.Nov 7 2017, 6:06 PM

Oh hey, looks like we ARE already counting this correctly after all. The donations queue consumer is only incrementing counts on real donation imports - duplicate messages throw an exception and skip that part.

Ejegg closed this task as Invalid.Nov 8 2017, 3:06 AM
jgleeson reopened this task as Open.EditedNov 17 2017, 1:40 PM

I've reopened this task as we're currently not reporting the rate of consumption from the perspective of average messages consumed per time period.

I'm going to record and export metrics which report the number of donations processed per (1/5/10/30) seconds by using the following calculation:

// determine our donations processed per second rate and multiply that value by our metricTimePeriod
$donationsProcessedPerSecond = totalDonationsProcessed / totalProcessTimeInSeconds
$metric = $donationsProcessedPerSecond *  metricTimePeriod
jgleeson claimed this task.Nov 17 2017, 1:42 PM
jgleeson moved this task from Backlog to Doing on the Fundraising Sprint Vaporwerewolf board.
jgleeson added a comment.EditedNov 17 2017, 4:43 PM

There's a problem with my above calculation. The problem is that the formula for scaling up metricTimePeriods assumes that the same amount of donations are processed per second across all metricTimePeriods which is not guaranteed, in fact, it's probably unlikely.

For example, one particularly meaty messages may take 0.5 seconds to process whereas the following five messages could take 0.1 seconds each to be processed, but using the totalAmount/totalProcessingTime hides these differences in processing times distorting the real processing times per message.

We could make the metrics less general by recording gateway specific consumption rates which would give visibility to average processing times per gateway. This still doesn't solve the processing time blending problem but makes it potentially easier to identify slow messages by gateway.

jgleeson added a comment.EditedNov 17 2017, 6:12 PM

Using the above calculation, I am getting the following sample output data:

donations_average_processed_per_5_seconds 8.6004311140635
donations_average_processed_per_10_seconds 17.200862228127
donations_average_processed_per_30_seconds 51.602586684381
donations_average_processed_per_second 1.7200862228127

Note: I think this is a good opportunity to look at using metric labels so the output could look something like

donations_average_processing_rate{seconds=1} 1.7200862228127
donations_average_processing_rate{seconds=5} 8.6004311140635
donations_average_processing_rate{seconds=10} 17.200862228127
donations_average_processing_rate{seconds=30} 51.602586684381

Ejegg added a comment.Nov 17 2017, 6:33 PM

One more wrinkle to consider is that we're not running these jobs flat out. For example, we can send between 10 and 11 thank you emails per second when we're creating CiviMail records. However, we only run the thank you job for 40 seconds in every 2 minutes, so the overall average rate is more like 3.5 per second.

Ejegg added a comment.Nov 17 2017, 6:34 PM

The scheduling data is a bit scattered, however. The fact that it runs every 2 minutes is controlled by a crontab generated by process-control, and the fact that it only runs 40 seconds each time is controlled by a setting in CiviCRM

jgleeson added a comment.EditedNov 20 2017, 2:39 PM

One more wrinkle to consider is that we're not running these jobs flat out. For example, we can send between 10 and 11 thank you emails per second when we're creating CiviMail records. However, we only run the thank you job for 40 seconds in every 2 minutes, so the overall average rate is more like 3.5 per second.

Hey Elliot, for the first attempt at measuring the processing times, I have recorded the time during processing only (regardless of how long that is) so effectively wrapping the dequeuing process stage which hopefully should avoid any potential inconsistencies introduced by hard time limits.

jgleeson changed the task status from Open to Stalled.Nov 20 2017, 3:20 PM

tl;dr: Being iterative is okay; We can make more graphs for more context later, but I still want this one more or less as described.

I really like the idea of pulling apart the times to process messages on a per-gateway basis. However, we shouldn't go down any per-gateway roads until we get our private stats box. An average for all incoming donation messages will still be useful, even when we can also pull out specifics - it will enable us to identify things that are misbehaving by taking longer than the average message. And, an average would still let us see when we've introduced new performance problems.

Your points about a total system average are well-taken. Ideally, I would prefer having something that took the pauses between jobs into account, so any tuning we do on the job frequency and duration would show as a visible gain on a graph somewhere. I think we can go forward with this, and strip that one off into a new card, though. The thought occurs that this sort of ideal processing time would help us know how well we've tuned the system, when we have both next to eachother.

The following is a sample stats ouput including a first-attempt at handling Prometheus metric labels in the output:

donations_enqueued_age_amazon 2405424
donations_enqueued_age_amazon 2405425
donations_enqueued_age_globalcollect 2405425
donations_enqueued_age_worldpay 2405425
donations_enqueued_age_worldpay 2405426
donations_gateway_amazon 2
donations_gateway_globalcollect 1
donations_gateway_worldpay 2
donations_overall_donations 5
donations_overall_processing_time 2.4749479293823
donations_transaction_age_amazon 62587210
donations_transaction_age_amazon 62523625
donations_transaction_age_globalcollect 62415462
donations_transaction_age_worldpay 62366114
donations_transaction_age_worldpay 62575646
donations_average_donations_processed{filter="1s"} 2.0202445233859
donations_average_donations_processed{filter="5s"} 10.10122261693
donations_average_donations_processed{filter="10s"} 20.202445233859
donations_average_donations_processed{filter="30s"} 60.607335701578
donations_average_enqueued_age_amazon 2405424.5
donations_average_enqueued_age_globalcollect 2405425
donations_average_enqueued_age_worldpay 2405425.5
donations_average_transaction_age_amazon 62555417.5
donations_average_transaction_age_globalcollect 62415462
donations_average_transaction_age_worldpay 62470880
donations_overall_average_enqueued_age 2405425
donations_overall_average_transaction_age 62493611.4

I'm now thinking "filter" might not be the best label. Does anyone have any suggestions for a better naming convention?

jgleeson added a comment.EditedDec 31 2017, 5:37 PM

Filter default label name has been removed to allow more sensible key=value usage. Currently producing:

donations_average_processed{period="1s"} 3.2845685177583
donations_average_processed{period="5s"} 16.422842588792
donations_average_processed{period="10s"} 32.845685177583
donations_average_processed{period="30s"} 98.53705553275