Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T322791 Review our queue speed - is it OK? | |||
Resolved | None | T323081 Oh Behave Dedupe |
Event Timeline
Time boxed spike in current sprint.
Big email send on Thursday to use as a data point.
We just had a 1 hour test run - observations
The overall pattern was "around 6 donations per second, disrupted by a period of deadlocks". I feel like we were at 7 (up from 4) after I did the first round of Redis fixes & am hoping we might get that '1' back with the patch I hope to roll out during traffic tomorrow
- The deduper should not run during periods of high traffic - but it DID start running before the test was over
I'm gonna spin off a phab for that
- I analysed the Redis misses during the test. I rolled out one fix - but that was of limited impact (which I expected it to be) & mostly just made the output a little easier. I further identifed another fix that I think is likely to be more helpful - https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/crm/+/856703 - but the test ended before I could roll it out. It would be good to test that tomorrow. I will comment futher on T322172 but I think there is more room to move here.
Image of processing speed graph (linked above)
I've been looking at the raw stats generated during tonight's traffic. There's a reasonably consistent negative correlation between donations processed and processing rate, i.e. the more we process, the slower the processing rate. The performance is degrading for some reason. If we can exclude other factors and confirm this is definitely happening, possibly by recording the processing of only non-recurring donations, we could try thinking of ways to prevent this performance drop over time by maybe caching connections and such.
Note: there are lots of things that could influence the processing rates. e.g. a recurring donation import is slower than a one-time. Our import stages breakouts show us this https://frmon.frdev.wikimedia.org/d/Pq1YNMviz/fundraising-overview?refresh=1m&orgId=1&viewPanel=33&from=now-3h&to=now
Here's some sample data:
donations_count{gateway="adyen"} 412
donations_count{gateway="amazon"} 11
donations_count{gateway="paypal_ec"} 329
donations_count{gateway="all"} 752
donations_processing_rate{period="batch"} 172.2097260952
donations_processing_rate{period="1s"} 4.3667684575741
donations_count{gateway="adyen"} 383
donations_count{gateway="paypal_ec"} 356
donations_count{gateway="amazon"} 16
donations_count{gateway="all"} 755
donations_processing_rate{period="batch"} 172.62205195427
donations_processing_rate{period="1s"} 4.3737169814203
donations_count{gateway="adyen"} 162
donations_count{gateway="paypal_ec"} 155
donations_count{gateway="amazon"} 9
donations_count{gateway="all"} 326
donations_processing_rate{period="batch"} 48.713936090469
donations_processing_rate{period="1s"} 6.6921301410456
donations_count{gateway="paypal_ec"} 300
donations_count{gateway="adyen"} 372
donations_count{gateway="amazon"} 8
donations_count{gateway="ingenico"} 2
donations_count{gateway="all"} 682
donations_processing_rate{period="batch"} 157.74275708199
donations_processing_rate{period="1s"} 4.3234948635108
donations_count{gateway="adyen"} 278
donations_count{gateway="paypal_ec"} 186
donations_count{gateway="amazon"} 9
donations_count{gateway="ingenico"} 1
donations_count{gateway="astropay"} 1
donations_count{gateway="all"} 475
donations_processing_rate{period="batch"} 82.511066913605
donations_processing_rate{period="1s"} 5.7568035145802
donations_count{gateway="adyen"} 137
donations_count{gateway="paypal_ec"} 111
donations_count{gateway="amazon"} 4
donations_count{gateway="all"} 252
donations_processing_rate{period="batch"} 34.203144073486
donations_processing_rate{period="1s"} 7.3677437214126
There might be something in the mix that is slow & longer ones have more chance of hitting it?
Ok this graph does a better job of showing how things are slowing down as the volume picks up https://frmon.frdev.wikimedia.org/d/Pq1YNMviz/fundraising-overview?refresh=1m&orgId=1&from=now-2d&to=now&viewPanel=22
I've created T323273: Update DonationStatsCollector to factor in the duplicate donations when calculating processing rates and run time. as a follow on to @Ejegg's dequeue stats patch