Page MenuHomePhabricator

Mailing data double counting in CiviCRM?
Closed, ResolvedPublic4 Story Points

Description

CPS, our analytics consultants, found what appears to be a problem with the mailing data in CiviCRM. I asked CPS to pull data showing how interacting with our emails impacts future donations. They were pulling a huge number of rows using the query below and parsing the raw data to try and answer my questions:

SELECT
        mp.email,
        mp.event_type,
        mp.mailing_identifier,
        mp.recipient_action_datetime
    FROM civicrm.civicrm_mailing_provider_data mp
    WHERE recipient_action_datetime >= '2017-07-01 00:00:00' AND recipient_action_datetime <= '2018-01-01 00:00:00';

As I understood it, the civicrm_mailing_provider_data table would only retain up to one engagement event per mailing_identifier per email address, i.e. if a donor RECEIVED and OPENED an email, the table would only retain the OPENED event since it happened more recently. However, CPS' analysis showed not only did the table have multiple events per email/mailing_id, but it suggested that the same user received the same email more than once. 89% of mailings show they were SENT twice, once at the time of the send and once about 7 hours later. This document helps detail how single mailings receive multiple engagements

.

I don't know how to troubleshoot this very well myself without breaking Civi/Excel by trying to deal with huge amounts of data, but what CPS found seems compelling. Are we double counting data somehow? It's important to us to be able to use this data to segment our lists along with the obvious analytical uses, so it would help to figure out what CPS is seeing.

Let me know how I can help and if you want CPS to come in on this task.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2018, 12:10 AM
CCogdill_WMF triaged this task as Normal priority.Jul 24 2018, 12:10 AM

So I don't think the spec is to only keep the most recent action - this was implemented to keep both received & opened actions in the DB (with them being filtered at the point of usage) - obviously we can discuss a change of spec at any time!

However, I agree about duplicate activities. The unique index for the table includes the timestamp and the fact there are consistent entries with a 7 hour gap suggests a change in timestamp handling at some point - resulting in entries being seen as different entries. So I guess what we need to do is

  1. see if timestamp is correctly recorded now
  2. prune incorrect entries if so (otherwise fix & start from 1 again :-)

Ah cool, good to know my understanding re: most recent action was wrong. That's good for us at the end of the day!

Also good to know CPS was reading the data right re: double counting. Let me know if there's anything I or CPS can do to help at this point.

@CCogdill_WMF I found one where it seems like more than timestamp - am hoping you can help me interpret where I would see the equivalent in the UI

SELECT contact_identifier, mailing_identifier, event_type, recipient_action_datetime, contact_id FROM civicrm_mailing_provider_data WHERE email = 'morena.chessa@gmail.com' AND mailing_identifier = 'sp55879559';
+--------------------+--------------------+------------+---------------------------+------------+

contact_identifiermailing_identifierevent_typerecipient_action_datetimecontact_id

+--------------------+--------------------+------------+---------------------------+------------+

227681638341sp55879559Sent2018-03-05 23:19:3321652642
227681638341sp55879559Open2018-03-06 02:14:4521652642
227681638341sp55879559Sent2018-03-29 13:41:5521652642

Hmm - I tracked this contact through & found 'View Contact Insights' civicrm/contact/view?reset=1&cid=21237654 - and it seems the same email really was sent to him twice?

Hmm - but for this contact contact/contactInsight.action?recipientId=283516586776&listId=9644238 I can't see any 'insights'

Eileenmcnaughton moved this task from Backlog to Doing on the Fundraising Sprint Owls board.

Change 447975 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update Omnimail silverpop to manage set date to GMT

https://gerrit.wikimedia.org/r/447975

Change 449944 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update Omnimail silverpop pluging

https://gerrit.wikimedia.org/r/449944

Change 447975 abandoned by Eileen:
Update Omnimail silverpop to manage set date to GMT

https://gerrit.wikimedia.org/r/447975

Change 450905 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update Omnimail test to relfect package update

https://gerrit.wikimedia.org/r/450905

Change 449944 abandoned by Eileen:
Update Omnimail silverpop plugin and update the patch to reflect change

https://gerrit.wikimedia.org/r/449944

Change 450905 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Update Omnimail test to relfect package update

https://gerrit.wikimedia.org/r/450905

Eileenmcnaughton added a comment.EditedSep 12 2018, 10:31 PM

I did some checks to see how far the cleanup script has gotten. It has cleaned up to the date of 2017-08-07 - here are the stats on how many records still need cleaning

Sep 20171338913
Oct 2017?* (4359110 as at 20 Sep)
Nov 20176251819
Dec 20170
20180
  • update - I am pretty sure I munged the October number as it seems too low & too similar to the Sep number missing a digit so I removed it

sql to generate above (for Sep)

SELECT count(*) FROM civicrm_mailing_provider_data mp INNER JOIN civicrm_mailing_provider_data mp2 ON mp.contact_identifier = mp2.contact_identifier AND mp.mailing_identifier = mp2.mailing_identifier AND mp.event_type = mp2.event_type AND mp.recipient_action_datetime < mp2.recipient_action_datetime WHERE mp2.contact_identifier IS NOT NULL AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) = 420 AND mp.event_type = 'Sent' AND mp.recipient_action_datetime BETWEEN '2017-09-01' AND '2017-09-30';

We are processing around 500k rows per day so are about 2 weeks away from data cleanup completion @CCogdill_WMF

Just did a check on progress

drush vget | grep omni
civicrm_repair_omnirecipient_last_processed_timestamp: "2017-10-01 00:00:00"

SO Oct & Nov 2017 to go

Eileenmcnaughton set the point value for this task to 4.Sep 18 2018, 5:40 AM

my updated figures suggest we are still 20 days away - looks like I stuff up Oct before @Ejegg what do you think - should we speed up the schedule since that seems like it will conflict with traffic. No evidence it has been upsetting the server so far

Ejegg added a comment.Sep 20 2018, 2:44 AM

Yeah, let's run this full tilt till it's done (with pauses for 1 hr tests)

@Ejegg I just pushed a commit to double the speed from 3000 to 6000 every 9 minutes - wanna check it & we'll deploy. If that doesnt' start tripping over itself we could increase more in a day or so

Note it has that threshhold handling so it shoudl self-pause for one hour tests like some of the other jobs

@CCogdill_WMF the job to remove duplicate entries has completed & all the rows where there are 2 entries 7 hours apart should be fixed. I'm now digging into whether there are other patterns that indicate issues

OK - job not done - what's happened is we have cross a daylight savings barrier & the difference is now 480 rather than 420 so I need to tweak to handle 480 & re-kick off

Ejegg closed this task as Resolved.Oct 2 2018, 7:37 PM
Eileenmcnaughton added a comment.EditedTue, Feb 12, 6:11 AM

I am just following up to check if all the relevant data was deleted. I can see that

  1. the job completed on the 8 hour (480 mins) gap (I ran it & output was
 drush civicrm-repair-omnirecipients --batch=6000 --threshold=550 --threshold_period=4

Processed omnirecipient fixes to 2019-02-16 00:00:00
  1. from commentary above it completed on the 7 hour (420 mins) gap
  1. I've established there are none of these EARLIER than 2017-04-17
SELECT
mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,
mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,
   TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff
  
     FROM civicrm_mailing_provider_data mp

   INNER JOIN civicrm_mailing_provider_data mp2
     ON mp.contact_identifier = mp2.contact_identifier
        AND mp.mailing_identifier = mp2.mailing_identifier
        AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime
  
       WHERE mp2.`contact_identifier` IS NOT NULL
       AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) < 1440 
       AND mp.event_type = 'Sent'
     AND mp.recipient_action_datetime < '2017-04-17'
   LIMIT 20;

Empty set (3 min 11.58 sec)

  1. I've found there are none after 2017-09-21
SELECT
   ->    mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,
   ->    mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,
   ->       TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff
   ->   
   ->         FROM civicrm_mailing_provider_data mp
   -> 
   ->       INNER JOIN civicrm_mailing_provider_data mp2
   ->         ON mp.contact_identifier = mp2.contact_identifier
   ->            AND mp.mailing_identifier = mp2.mailing_identifier
   ->            AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime
   ->   
   ->           WHERE mp2.`contact_identifier` IS NOT NULL
   ->           AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) < 1440 
   ->           AND mp.event_type = 'Sent'
   ->         AND mp.recipient_action_datetime > '2017-09-21'
   ->       LIMIT 20;

Empty set (6 min 18.63 sec)

  1. However there ARE some within that date range - I've spotted a 300 minute (5 hour) gap in April and a 840 minute (12 hour) gap in Sep.

Here is a sample - I spot checked a few

SELECT
    -> mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,mp.contact_id,
    -> mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,
    ->    TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff
    ->   
    ->      FROM civicrm_mailing_provider_data mp
    -> 
    ->    INNER JOIN civicrm_mailing_provider_data mp2
    ->      ON mp.contact_identifier = mp2.contact_identifier
    ->         AND mp.mailing_identifier = mp2.mailing_identifier
    ->         AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime
    ->   
    ->        WHERE mp2.`contact_identifier` IS NOT NULL
    ->        AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) < 1440 
    ->        AND mp.event_type = 'Sent'
    ->      AND mp.recipient_action_datetime > '2017-04-16'
    ->    LIMIT 20;
contact_identifierrecipient_action_datetimeevent_typecontact_idci2date_time2event_type2time_diff
1000831399532017-09-18 08:03:09Sent66992801000831399532017-09-18 13:03:09Sent300
1000831399562017-09-18 08:02:25Sent110670761000831399562017-09-18 13:02:25Sent300
1000831399602017-09-18 08:03:09Sent159196051000831399602017-09-18 13:03:09Sent300
1000831399642017-09-18 08:03:14Sent159516691000831399642017-09-18 13:03:14Sent300
1000831399652017-09-18 08:05:27Sent158815261000831399652017-09-18 13:05:27Sent300
1000831399682017-09-18 08:03:14Sent66993051000831399682017-09-18 13:03:14Sent300
1000831399692017-09-18 08:03:09Sent66993071000831399692017-09-18 13:03:09Sent300
1000831399742017-09-18 01:02:27Sent66993141000831399742017-09-18 15:02:27Sent840
1000831399752017-09-18 08:03:08Sent66993151000831399752017-09-18 13:03:08Sent300
1000831399762017-09-18 08:02:27Sent159369281000831399762017-09-18 13:02:27Sent300
1000831399812017-09-18 08:03:53Sent66993251000831399812017-09-18 13:03:53Sent300
1000831399832017-09-18 08:03:08Sent66993281000831399832017-09-18 13:03:08Sent300
1000831399852017-09-18 08:05:28Sent66993301000831399852017-09-18 13:05:28Sent300
1000831399862017-09-18 08:02:27Sent66993321000831399862017-09-18 13:02:27Sent300
1000831399902017-09-18 08:03:08Sent66993361000831399902017-09-18 13:03:08Sent300
1000831399922017-09-18 08:02:27Sent158708721000831399922017-09-18 13:02:27Sent300
1000831399952017-09-18 08:03:08Sent66993511000831399952017-09-18 13:03:08Sent300
1000831399962017-09-18 08:04:43Sent66993531000831399962017-09-18 13:04:43Sent300
1000831399972017-09-18 08:04:43Sent66993541000831399972017-09-18 13:04:43Sent300
1000831399982017-09-18 08:03:14Sent66993551000831399982017-09-18 13:03:14Sent300

+--------------------+---------------------------+------------+------------+--------------+---------------------+-------------+-----------+
20 rows in set (0.01 sec)

For now I propose to run the job again with a 5 hour gap & then do some checks - & run again as appropriate

In preparation I reset the drupal param

drush vset civicrm_repair_omnirecipient_last_processed_timestamp "2017-02-16 00:00:00"