Page MenuHomePhabricator

Fix stuck recurring contributions, add alarm
Closed, ResolvedPublic

Description

looks like 782 contribution_recur rows stuck in status 'Processing'

Many of them are old globalcollect recurrings that should have been charged on 2020-03-26. See if we can get them started again, hopefully tokenized for the new API. @MBeat33 what's your sense for these - should we just start charging them again after 4 months if the cards are still valid? Or is that too long? What would you consider the cutoff period for donations too old to try to restart?

We should also add an icinga alarm to alert us once a day if there are more than 5 rows in that status. @Jgreen
or @Dwisehaupt can you help with that? This is the query to get that count:

SELECT count(*)
FROM civicrm_contribution_recur
WHERE contribution_status_id = 14;

Event Timeline

Icinga isn't really designed to do an alarm in that fashion. It's more along the lines of alerting when something actively moves in and out of threshold.

I would definitely start capturing this as a metric in prometheus. Then we can add it to any current and future civicrm health dashboards. Additionally, we could add a once a day alert (most likely in the form of email) to report if it is out of threshold. Email may not be the best method, but it is easy to get started on that route until we find the best method.

I'll start a task for the metric collection.

@Ejegg sorry I missed the initial ping on this one - 782 is enough donors that I would think about asking the email team if we can send an explanation email along the lines of "Due to an error, your recurring donation paused for the last 4 months, but we have restarted it and apologize for the inconvenience."

Would there be a way to dedupe this list to avoid sending the email to any CIDs that have created a new recurring since February 26th?

@jkumalah just fyi this task may affect recurring donation metrics to some degreee

See if we can get them started again, hopefully tokenized for the new API.

@Ejegg how much certainty do we have that we can restart these? If there's technical risk maybe we could email the donors asking them to create new ones, as an alternative.

@MBeat33 Sure, it might be better to ask them to start new ones (and certainly would be easier for fr-tech!). I'll put on the fileserver a list of emails, names, & languages, with anyone who has a new donation removed.

Change 614898 had a related patch set uploaded (by Dwisehaupt; owner: Dwisehaupt):
[operations/puppet@production] Add icinga check for recurring contributions in a processing state

https://gerrit.wikimedia.org/r/614898

@Cstone in my own db on frdev I did this to pull details from these stuck contributions, then remove any email with a newer recurring donation.

create table T258013 as
select email, first_name, last_name, left(preferred_language, 2), next_sched_contribution_date
from civicrm.civicrm_contribution_recur cr
         inner join civicrm.civicrm_contact c on cr.contact_id = c.id
         inner join civicrm.civicrm_email em on em.contact_id = c.id and em.is_primary = 1
WHERE contribution_status_id = 14;

Query OK, 783 rows affected (0.04 sec)

delete T258013
from T258013
         inner join civicrm.civicrm_email em on em.email = T258013.email and em.is_primary = 1
         inner join civicrm.civicrm_contribution_recur cr on em.contact_id = cr.contact_id
where cr.start_date > '2020-02-26'
  and cr.contribution_status_id <> 14;

Query OK, 13 rows affected (0.02 sec)

Then I just manually deleted the 3 with very recent next_sched dates which I think are due to the deadlocks on old-to-new, leaving 767 rows. Does that look good to you?

If so we can go ahead and cancel the lot of them with

update civicrm_contribution_recur set contribution_status_id=3 where contribution_status_id=14 and next_sched_contribution_date < '2020-07-01';

@MBeat33 I uploaded the list to the fileserver under Tech/T258013.csv and Tech/T258013.xlsx in case it's tricky to get the CSV to show up right.

Icinga alarm added to check_fundraising_job where other recurring alerts are triggered. I have found a way to have it notify once per day.

The check is running and correctly pulling the contribution numbers. Alerting will be turned on after @Jgreen review.

Jul 20 23:07:11 frdb1002 nagios_nsca[28941]: frdb1002 check_recurring_contrib_processing 2 CRITICAL recurring_contrib_processing=783 [critical >=500]

@Ejegg Nice, looks good! Yeah there are 3 stuck from the deadlocks.

Thank you, @Ejegg we could probably divide the list among DS agents and just use the Send Failure Notification button, so if you could include the CID # in each line that would enable us to work the fastest

Even a list of just the CIDs would work, and could be put in a google spreadsheet rather than the server

Change 614898 merged by Jgreen:
[operations/puppet@production] Add icinga check for recurring contributions in a processing state

https://gerrit.wikimedia.org/r/614898

Icinga alarm added to check_fundraising_job where other recurring alerts are triggered. I have found a way to have it notify once per day.

The check is running and correctly pulling the contribution numbers. Alerting will be turned on after @Jgreen review.

Jul 20 23:07:11 frdb1002 nagios_nsca[28941]: frdb1002 check_recurring_contrib_processing 2 CRITICAL recurring_contrib_processing=783 [critical >=500]

It's merged as of this AM, and currently alerting: "CRITICAL recurring_contrib_processing=783 [critical =500]"

@Jgreen Icinga alarm added to check_fundraising_job where other recurring alerts are triggered. I have found a way to have it notify once per day.

Will this be an alert that allows us to know when a certain number of recurring donations fail, to avoid us finding ourselves in a similar situation again? Is this the type of solution you were looking for @MBeat33 ?

Thank you!

David and Dylan confirmed the alerts will take the form of Icinga failmails sent to the fr-tech list, which I'm on, so that works for me.

If there is a specific alert message I can keep an eye out for, please let me know, @Jgreen (will the failmails say CRITICAL recurring_contrib_processing?)

Also, it would be good to know at what point we should act on the alert emails. Am I correct that the threshold is 500 (or is that just what it says right now)? 500 might be on the high end in terms of a threshold. In terms of actual donors whose recurrings aren't processing, would it be possible to set that number lower so that we can detect any issues sooner?

also, @Ejegg do you know when you might be able to generate the list of CIDs?

@jrobell @MBeat33 The icinga alert is a second layer of alerting on top of any failmails. It is currently set to send email alerts if there are more than 5 donations in that particular state. The current high limit of 500 is when it will actually send a page to fr-tech-ops. Hopefully, with the failmails and email alerts, any issues would be addressed before we got to the point of paging. We can always adjust these values as needed.

Right now, the alert is acknowledged and isn't sending pages since we are aware of the issue and fr-tech are working to resolve it.

Many thanks for the clarification, @Dwisehaupt and of course now I see the 5 rows thing in the original ask ;)

@MBeat33 OK, those files have been updated with CIDs (and a direct URL in the last column). Looks like one of our 3-digit CID donors got hit by this!

Thanks, @Ejegg I d/l-ed both files but I don't see the URL column, I just get email firstname lastname language code and date - even in the .xls - am I missing something?

Oops, my file manager was waiting to confirm that I wanted to replace 'em. Sorry about that - they're actually updated now!

Got them now, thanks!

It looks like the existing recurring donations are still marked as Status: Processing. Does that matter, @Ejegg ? Should we cancel the existing ones before pressing Send Failure Notification?

Thank you for offering to batch cancel these, @Ejegg !

OK, they should all be marked as 'Cancelled' now, with cancel_date of 2020-07-22 and cancel_reason as 'Stuck for months in Processing status'

MBeat33 claimed this task.

Thank you for the followup. Adri is following up with these donors, so I'll change this to Resolved