Page MenuHomePhabricator

Mailing data double counting in CiviCRM?
Closed, ResolvedPublic4 Story Points

Description

CPS, our analytics consultants, found what appears to be a problem with the mailing data in CiviCRM. I asked CPS to pull data showing how interacting with our emails impacts future donations. They were pulling a huge number of rows using the query below and parsing the raw data to try and answer my questions:

SELECT
        mp.email,
        mp.event_type,
        mp.mailing_identifier,
        mp.recipient_action_datetime
    FROM civicrm.civicrm_mailing_provider_data mp
    WHERE recipient_action_datetime >= '2017-07-01 00:00:00' AND recipient_action_datetime <= '2018-01-01 00:00:00';

As I understood it, the civicrm_mailing_provider_data table would only retain up to one engagement event per mailing_identifier per email address, i.e. if a donor RECEIVED and OPENED an email, the table would only retain the OPENED event since it happened more recently. However, CPS' analysis showed not only did the table have multiple events per email/mailing_id, but it suggested that the same user received the same email more than once. 89% of mailings show they were SENT twice, once at the time of the send and once about 7 hours later. This document helps detail how single mailings receive multiple engagements

.

I don't know how to troubleshoot this very well myself without breaking Civi/Excel by trying to deal with huge amounts of data, but what CPS found seems compelling. Are we double counting data somehow? It's important to us to be able to use this data to segment our lists along with the obvious analytical uses, so it would help to figure out what CPS is seeing.

Let me know how I can help and if you want CPS to come in on this task.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2018, 12:10 AM
CCogdill_WMF triaged this task as Normal priority.Jul 24 2018, 12:10 AM

So I don't think the spec is to only keep the most recent action - this was implemented to keep both received & opened actions in the DB (with them being filtered at the point of usage) - obviously we can discuss a change of spec at any time!

However, I agree about duplicate activities. The unique index for the table includes the timestamp and the fact there are consistent entries with a 7 hour gap suggests a change in timestamp handling at some point - resulting in entries being seen as different entries. So I guess what we need to do is

  1. see if timestamp is correctly recorded now
  2. prune incorrect entries if so (otherwise fix & start from 1 again :-)

Ah cool, good to know my understanding re: most recent action was wrong. That's good for us at the end of the day!

Also good to know CPS was reading the data right re: double counting. Let me know if there's anything I or CPS can do to help at this point.

Eileenmcnaughton added a comment.EditedJul 24 2018, 11:51 PM

@CCogdill_WMF I found one where it seems like more than timestamp - am hoping you can help me interpret where I would see the equivalent in the UI

SELECT contact_identifier, mailing_identifier, event_type, recipient_action_datetime, contact_id FROM civicrm_mailing_provider_data WHERE email = 'example@gmail.com' AND mailing_identifier = 'sp55879559';
+--------------------+--------------------+------------+---------------------------+------------+

contact_identifiermailing_identifierevent_typerecipient_action_datetimecontact_id

+--------------------+--------------------+------------+---------------------------+------------+

227681638341sp55879559Sent2018-03-05 23:19:3321652642
227681638341sp55879559Open2018-03-06 02:14:4521652642
227681638341sp55879559Sent2018-03-29 13:41:5521652642

Hmm - I tracked this contact through & found 'View Contact Insights' civicrm/contact/view?reset=1&cid=21237654 - and it seems the same email really was sent to him twice?

Hmm - but for this contact contact/contactInsight.action?recipientId=283516586776&listId=9644238 I can't see any 'insights'

Eileenmcnaughton moved this task from Backlog to Doing on the Fundraising Sprint Owls board.

Change 447975 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update Omnimail silverpop to manage set date to GMT

https://gerrit.wikimedia.org/r/447975

Change 449944 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update Omnimail silverpop pluging

https://gerrit.wikimedia.org/r/449944

Change 447975 abandoned by Eileen:
Update Omnimail silverpop to manage set date to GMT

https://gerrit.wikimedia.org/r/447975

Change 450905 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Update Omnimail test to relfect package update

https://gerrit.wikimedia.org/r/450905

Change 449944 abandoned by Eileen:
Update Omnimail silverpop plugin and update the patch to reflect change

https://gerrit.wikimedia.org/r/449944

Change 450905 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Update Omnimail test to relfect package update

https://gerrit.wikimedia.org/r/450905

Eileenmcnaughton added a comment.EditedSep 12 2018, 10:31 PM

I did some checks to see how far the cleanup script has gotten. It has cleaned up to the date of 2017-08-07 - here are the stats on how many records still need cleaning

Sep 20171338913
Oct 2017?* (4359110 as at 20 Sep)
Nov 20176251819
Dec 20170
20180
  • update - I am pretty sure I munged the October number as it seems too low & too similar to the Sep number missing a digit so I removed it

sql to generate above (for Sep)

SELECT count(*) FROM civicrm_mailing_provider_data mp INNER JOIN civicrm_mailing_provider_data mp2 ON mp.contact_identifier = mp2.contact_identifier AND mp.mailing_identifier = mp2.mailing_identifier AND mp.event_type = mp2.event_type AND mp.recipient_action_datetime < mp2.recipient_action_datetime WHERE mp2.contact_identifier IS NOT NULL AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) = 420 AND mp.event_type = 'Sent' AND mp.recipient_action_datetime BETWEEN '2017-09-01' AND '2017-09-30';

We are processing around 500k rows per day so are about 2 weeks away from data cleanup completion @CCogdill_WMF

Just did a check on progress

drush vget | grep omni
civicrm_repair_omnirecipient_last_processed_timestamp: "2017-10-01 00:00:00"

SO Oct & Nov 2017 to go

Eileenmcnaughton set the point value for this task to 4.Sep 18 2018, 5:40 AM

my updated figures suggest we are still 20 days away - looks like I stuff up Oct before @Ejegg what do you think - should we speed up the schedule since that seems like it will conflict with traffic. No evidence it has been upsetting the server so far

Ejegg added a comment.Sep 20 2018, 2:44 AM

Yeah, let's run this full tilt till it's done (with pauses for 1 hr tests)

@Ejegg I just pushed a commit to double the speed from 3000 to 6000 every 9 minutes - wanna check it & we'll deploy. If that doesnt' start tripping over itself we could increase more in a day or so

Note it has that threshhold handling so it shoudl self-pause for one hour tests like some of the other jobs

@CCogdill_WMF the job to remove duplicate entries has completed & all the rows where there are 2 entries 7 hours apart should be fixed. I'm now digging into whether there are other patterns that indicate issues

OK - job not done - what's happened is we have cross a daylight savings barrier & the difference is now 480 rather than 420 so I need to tweak to handle 480 & re-kick off

Ejegg closed this task as Resolved.Oct 2 2018, 7:37 PM
Eileenmcnaughton added a comment.EditedFeb 12 2019, 6:11 AM

I am just following up to check if all the relevant data was deleted. I can see that

  1. the job completed on the 8 hour (480 mins) gap (I ran it & output was
 drush civicrm-repair-omnirecipients --batch=6000 --threshold=550 --threshold_period=4

Processed omnirecipient fixes to 2019-02-16 00:00:00
  1. from commentary above it completed on the 7 hour (420 mins) gap
  1. I've established there are none of these EARLIER than 2017-04-17
SELECT
mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,
mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,
   TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff
  
     FROM civicrm_mailing_provider_data mp

   INNER JOIN civicrm_mailing_provider_data mp2
     ON mp.contact_identifier = mp2.contact_identifier
        AND mp.mailing_identifier = mp2.mailing_identifier
        AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime
  
       WHERE mp2.`contact_identifier` IS NOT NULL
       AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) < 1440 
       AND mp.event_type = 'Sent'
     AND mp.recipient_action_datetime < '2017-04-17'
   LIMIT 20;

Empty set (3 min 11.58 sec)

  1. I've found there are none after 2017-09-21
SELECT
       mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,
       mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,
          TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff
      
            FROM civicrm_mailing_provider_data mp
    
          INNER JOIN civicrm_mailing_provider_data mp2
            ON mp.contact_identifier = mp2.contact_identifier
               AND mp.mailing_identifier = mp2.mailing_identifier
               AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime
      
              WHERE mp2.`contact_identifier` IS NOT NULL
             AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) < 1440 
              AND mp.event_type = 'Sent'
            AND mp.recipient_action_datetime > '2017-09-21'
         LIMIT 20;

Empty set (6 min 18.63 sec)

  1. However there ARE some within that date range - I've spotted a 300 minute (5 hour) gap in April and a 840 minute (12 hour) gap in Sep.

Here is a sample - I spot checked a few

SELECT
     mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,mp.contact_id,
     mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,
        TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff
       
         FROM civicrm_mailing_provider_data mp
     
        INNER JOIN civicrm_mailing_provider_data mp2
          ON mp.contact_identifier = mp2.contact_identifier
             AND mp.mailing_identifier = mp2.mailing_identifier
            AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime
       
            WHERE mp2.`contact_identifier` IS NOT NULL
            AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) < 1440 
            AND mp.event_type = 'Sent'
          AND mp.recipient_action_datetime > '2017-04-16'
       LIMIT 20;
contact_identifierrecipient_action_datetimeevent_typecontact_idci2date_time2event_type2time_diff
1000831399532017-09-18 08:03:09Sent66992801000831399532017-09-18 13:03:09Sent300
1000831399562017-09-18 08:02:25Sent110670761000831399562017-09-18 13:02:25Sent300
1000831399602017-09-18 08:03:09Sent159196051000831399602017-09-18 13:03:09Sent300
1000831399642017-09-18 08:03:14Sent159516691000831399642017-09-18 13:03:14Sent300
1000831399652017-09-18 08:05:27Sent158815261000831399652017-09-18 13:05:27Sent300
1000831399682017-09-18 08:03:14Sent66993051000831399682017-09-18 13:03:14Sent300
1000831399692017-09-18 08:03:09Sent66993071000831399692017-09-18 13:03:09Sent300
1000831399742017-09-18 01:02:27Sent66993141000831399742017-09-18 15:02:27Sent840
1000831399752017-09-18 08:03:08Sent66993151000831399752017-09-18 13:03:08Sent300
1000831399762017-09-18 08:02:27Sent159369281000831399762017-09-18 13:02:27Sent300
1000831399812017-09-18 08:03:53Sent66993251000831399812017-09-18 13:03:53Sent300
1000831399832017-09-18 08:03:08Sent66993281000831399832017-09-18 13:03:08Sent300
1000831399852017-09-18 08:05:28Sent66993301000831399852017-09-18 13:05:28Sent300
1000831399862017-09-18 08:02:27Sent66993321000831399862017-09-18 13:02:27Sent300
1000831399902017-09-18 08:03:08Sent66993361000831399902017-09-18 13:03:08Sent300
1000831399922017-09-18 08:02:27Sent158708721000831399922017-09-18 13:02:27Sent300
1000831399952017-09-18 08:03:08Sent66993511000831399952017-09-18 13:03:08Sent300
1000831399962017-09-18 08:04:43Sent66993531000831399962017-09-18 13:04:43Sent300
1000831399972017-09-18 08:04:43Sent66993541000831399972017-09-18 13:04:43Sent300
1000831399982017-09-18 08:03:14Sent66993551000831399982017-09-18 13:03:14Sent300

+--------------------+---------------------------+------------+------------+--------------+---------------------+-------------+-----------+
20 rows in set (0.01 sec)

For now I propose to run the job again with a 5 hour gap & then do some checks - & run again as appropriate

In preparation I reset the drupal param

drush vset civicrm_repair_omnirecipient_last_processed_timestamp "2017-02-16 00:00:00"

Thanks for continuing to check on this, Eileenmcnaughton! Do you have an
estimate on when this follow up job will finish? And should we expect the
data to need to be cleaned up on a consistent basis?

Maybe relatedly - I noticed we're pulling this data from IBM every 10
minutes. Is that frequency contributing at all to the double counting? I
don't see a reason we need to update Civi that frequently. I'd recommend
scaling down to 1-2 times per day, or even once an hour, just for the sake
of less stress on the system.

@CCogdill_WMF so

  1. the underlying reason for the double counting was poor timezone handling which we fixed around Nov 2017 I believe.
  2. The job was running last year but got turned off during Big English - we cleaned up most of the Sent ones then & the rest since but I realised this week we'd priorised Sent ones & not done the others yet
  3. the frequent job is just the little one grabbing high level info on mailings so they show up in Civi quickly - (mailing text, subject, stats) - it shouldn't be causing much load

This is a pretty good query to use to check status

SELECT      mp.contact_identifier, mp.recipient_action_datetime, mp.event_type,mp.contact_id,      mp2.contact_identifier as ci2, mp2.recipient_action_datetime as date_time2, mp2.event_type as event_type2,         TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) as time_diff                  FROM civicrm_mailing_provider_data mp               INNER JOIN civicrm_mailing_provider_data mp2           ON mp.contact_identifier = mp2.contact_identifier              AND mp.mailing_identifier = mp2.mailing_identifier             AND mp.event_type = mp2.event_type  AND mp.recipient_action_datetime < mp2.recipient_action_datetime                     WHERE mp2.`contact_identifier` IS NOT NULL             AND TIMESTAMPDIFF(MINUTE,mp.recipient_action_datetime , mp2.recipient_action_datetime) IN             (60,120,180,240,300, 360, 420,480,540, 600, 660, 720, 780, 840, 900, 960, 1020, 1080, 1140, 1200, 1260, 1320, 1480)            
 AND mp.event_type = mp2.event_type       
  
### the following 2 lines control scope to speed it up 
  AND mp.recipient_action_datetime > '2017-12-17'    
    LIMIT 20;
Eileenmcnaughton added a comment.EditedMar 7 2019, 12:57 AM

Note the above basically looks for gaps that line up with hours - some will be legit

Also I note lots of 'Open' actions with regular gaps later on - that might be client polling rather than our data

Ah okay, thanks for explaining re those frequent jobs.

I don't totally understand what the above query is looking for. I'm asking
because Jasmeet, our analyst, wants to do some work with this email data
but I'm not sure when I can tell her the data is usable. Do you think the
rate of these duplicate records is low enough for us to work with it as/is?

@CCogdill_WMF so on SENT actions we should be pretty much OK. For other event types there are probably some date ranges that are still pretty mucky - are you going to be focussing on particular event types / date ranges - I can do some more specific checks

Ah great, thanks! OPEN would be the other event type I want to focus on in
the near-term.

@CCogdill_WMF to be honest I'm not sure the duplicates MATTER for affecting the Open data. The reason is that we get a lot of 'real' duplicate opens - ie. the email clients for whatever reason send multiple Open requests back.( I think it depends on the email client ) So I kinda think your data querying will need to eliminate duplicates (real & otherwise) on Open somehow if you are checking the ones that got opened

Fair point, I'll let Jasmeet know. Thank you!

Change 503542 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Adjust omnirecipient repair to reflect a daylight savings change.

https://gerrit.wikimedia.org/r/503542

Change 503542 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Adjust omnirecipient repair to reflect a daylight savings change.

https://gerrit.wikimedia.org/r/503542