Page MenuHomePhabricator

Civi unresponsive for Engage in evenings
Closed, ResolvedPublic

Description

Hi - Engage informed me that Civi has been unresponsive in the evenings their time and they aren't able to save contributions. Please see their note below and let me know if you need any additional info.

"We have run into an issue a few times this week where our operators are working in Civi and around 5:30 PM EST (it happened last night [12/8] at 5:45PM) the system becomes unresponsive when attempting to save a transactions (the circle just spins and spins). Our operators then close their browser and are able to sign in again, however they receive the same issue and are unable to save."

Event Timeline

@LeanneS is checking if this was just one day or if this occurs at the same time every day.

Update from Engage that it also happened on Monday afternoon:

"This happened on Monday (12/5) around 4:20 PM.

We worked in the system until 6:00 PM on 12/6 and 9:40 PM on 12/7 without any issues."

I took a look at the scheduled jobs and could see nothing that would cause a regular slow query at this time. I also trawled through the slow queries over the last 24 hours. I found one that can be improved easily and caused 12 instances of slowness to Megan or Michael over the period (CRM-19752). I also found that the query to determine whether to run dedupes is taking longer than it needs to as there is no index on civicrm_contact.created_date. I can see how to improve that.

The latter item will not have caused any slowness for Engage, the former probably will not have but it's possible.

I'm not sure where to find slow query logs older than 24 hours - @Jgreen ?

https://docs.google.com/spreadsheets/d/1PQBSB_cx6umWoCRTXJrjVE-BaPaTK3MNHF5dk7DOjFQ/edit#gid=0

I'll take another look at the log tomorrow to see what has gone on since now

grep 'Query_time: [2-9][0-9]' fundraisingdb-mysql-slow
grep 'Query_time: [1-9][0-9][0-9]' fundraisingdb-mysql-slow

I'm not sure where to find slow query logs older than 24 hours - @Jgreen ?

All the ./remote/* logs rotate into the archive, at indium:/a/archive/indium/logs

I'm pretty certain the issue on 8 Dec was caused by the silverpop export being run at an odd time - see the highlighted row below. These rows all refer to the timing of a mysql query in the silverpop export 'INSERT INTO silverpop_export_highest...' which happens about 1.5 hours in

query timequery_durationquery_start_time_estestimated silverpop start in estestimated silverpop end in est
2016-11-23 09:13:20798.6742016-11-23 04:13:202016-11-23 02:43:202016-11-23 07:43:20
2016-11-24 09:14:24812.4212016-11-24 04:14:242016-11-24 02:44:242016-11-24 07:44:24
2016-11-26 09:16:45809.7472016-11-26 04:16:452016-11-26 02:46:452016-11-26 07:46:45
2016-11-29 09:13:22803.0352016-11-29 04:13:222016-11-29 02:43:222016-11-29 07:43:22
2016-11-30 09:18:56847.0972016-11-30 04:18:562016-11-30 02:48:562016-11-30 07:48:56
2016-12-01 09:16:46828.3242016-12-01 04:16:462016-12-01 02:46:462016-12-01 07:46:46
2016-12-02 09:17:10853.2412016-12-02 04:17:102016-12-02 02:47:102016-12-02 07:47:10
2016-12-03 09:20:00879.4422016-12-03 04:20:002016-12-03 02:50:002016-12-03 07:50:00
2016-12-05 09:20:58864.7532016-12-05 04:20:582016-12-05 02:50:582016-12-05 07:50:58
2016-12-06 07:24:24854.9772016-12-06 02:24:242016-12-06 00:54:242016-12-06 05:54:24
2016-12-07 07:28:54911.5692016-12-07 02:28:542016-12-07 00:58:542016-12-07 05:58:54
2016-12-07 11:42:15918.3812016-12-07 06:42:152016-12-07 05:12:152016-12-07 10:12:15
2016-12-08 19:23:22930.3912016-12-08 14:23:222016-12-08 12:53:222016-12-08 17:53:22
2016-12-09 07:26:28915.4142016-12-09 02:26:282016-12-09 00:56:282016-12-09 05:56:28
2016-12-10 07:32:41931.1872016-12-10 02:32:412016-12-10 01:02:412016-12-10 06:02:41
2016-12-11 07:32:20917.3882016-12-11 02:32:202016-12-11 01:02:202016-12-11 06:02:20
2016-12-12 07:31:07918.0972016-12-12 02:31:072016-12-12 01:01:072016-12-12 06:01:07
DStrine moved this task from Sprint +1 to Triage on the Fundraising-Backlog board.

Please see the latest update from Engage today:

I am writing regarding a few errors we are encountering when processing in Civi. The movie and image attached are happening at random times during processing. The movie was recorded on Tuesday night (12/13) around 7:00 PM EST, it shows the screen in a constant loop unable to update transactions, in this situation we are not able enter any information. If we close the browser, reopen it, log back into Civi and attempt to update a record we received the same result.

The included image shows an error we are receiving when entering Benevity batches into Civi without a batch number. The error happened on and off yesterday, this picture was taken at 5:40 PM EST. The problem we are having is when the screen loads to enter a transaction in random cases there is not an area to enter a batch number, our operators are then forced to refresh the page until the batch number area appears (this can be anywhere from one refresh to twenty). We are then able to enter without a problem for a random amount of time before it appears again. If we try to enter the information when the batch number area is not visible we receive the attached error regarding batch number and reaching the server.

One additional item I would like to point out is that both computers used to access Civi experience these problems; however, they may not be at the same time. Meaning one operator may be enter without an issues while the other is experiencing the batch number/server issue. When the updating issue occurs it has effected both computers at the same time (though do to the time of day that it has happened only one operator is entering).

Could somebody text me the next time this happens? My cell is on the
collab wiki fundraising contacts page. Looking at database activity while
it's happening is often easier than trying to make sense of the past event
from logs, so I will try to take a look in realtime.

@Jgreen Definitely. I'll request that they notify me as soon as they experience any issues and then text you.

@LeanneS
FYI: In backlog grooming, we reviewed this ticket. We are going to leave it in the Analysis column for now. However, it happens again, we will pull it into the current sprint.

DStrine lowered the priority of this task from High to Low.Jan 26 2017, 12:34 AM

Just checking back on this since there hasn't been an update, is it still a problem?

@Jgreen Thanks for checking. I haven't heard of any issues recently.

Jgreen claimed this task.

@Jgreen Thanks for checking. I haven't heard of any issues recently.

Ok, I'll close it. Please reopen if it comes up again.