Page MenuHomePhabricator

Record duplicates when processing Engage import
Closed, ResolvedPublic

Description

Logging background for discussion on how to move forward with increasing amount of duplicates that popup while processing the Engage Individual import specifically - but potentially could effect other imports as we get larger gift files at end of year. My thought is that we want to be careful of how long the dedupe takes on our end before we can get the gifts into Civi.

After talking at fortnightly @Eileenmcnaughton has said she will look into a more automatic merge capability. If that is not possible we may have to move forward with letting the gifts through instead of deduping on the front end. We want to explore what Civi can do before we go that route.

Screenshot of what the error looks like:

Screenshot 2024-09-04 at 4.58.34 PM.png (400×962 px, 50 KB)

Event Timeline

Just looking at one of the rows - there are in fact a dozen duplicates - not just the identified 2 - for the contact in question - however the contact has the address (altered) in the Engage import of

123 W Main St

But there were unmatched variations like

	123 W Main St Apt 1 
	123 W Main St Apt #1
	123 W Main St Unit 1
  • we need to be sure that our address standardisation is consistent with what we get from Engage (I think in this case the one coming is in IS what we want to standardise on, although the others would all have come via them in one way or another in the past)

Another note - the rule in question is

email match
OR first + last + street address

Ok - so there are some specific historical data issues I'm seeing - noting them down as I look at the rows

  1. *Example 1** has donated online and via engage. The contact's name is clearly (anonymised to) Van Damme but in the Engage import it is 'Damme' - it seems they have given us both in the past

action - checking in with @MDemosWMF if Engage is being consistent with these - not contact merge link is https://civicrm.wikimedia.org/civicrm/contact/merge?reset=1&cid=63217599&oid=5522839

Status - merge pending - left for Melanie/ Ellen to eyeball

  1. Example 2 has 2 found matches but 12 in the DB, with different address variants. Of the 2 found matches 1 has the same email & one has the same address - but the one with the correct email has *nearly* the same address - ie "123 W Main St" vs "123 W Main #1"

These have both come from Engage in the past - the most recent for the 'incorrect' may be May 9th, 2023 and the correct in 2022

action - what does Engage's address standardisation look like?

status I merged them all & resolved the row

Example 3

Import Row
First Name : 'Bob & Mary'
Last Name: "Smith"
email: bob@example.org"
Street address : 123 Main Street

Existing contacts

1:
First Name : 'Bob'
Last Name: "Smith"
email: bob@example.org"
Street address : 123 Main Street

2:

First Name : 'Bob & Mary'
Last Name: "Smith"
email: "
Street address : 123 Main Street

What is the right outcome here? The 'Bob & Mary' has no email - I suspect these would be merged by DR as

First Name : 'Bob'
Last Name: "Smith"
email: "bob@example.org"
Street address : 123 Main Street
Partner: Mary

But notably there is no 'normal' way to find these - ie the 2 contacts do now have an email match (1 has no email) and they do not have a first+last+address match as one has that shared address so it is the 3rd contact record - ie the one coming from Engage that allows us to see they are a match....

Example 4 is an contact with prior engage giving

The contact has 2 matches based on Name + address.

Both have prior engage donations, they have different addresses - one is on hold. This duplication would be picked up by Sandra doing a dedupe pass for name & address over the Smart group I created - Engage. They are definitely mergable.

So I guess the question is could we merge them while doing the import.

I think it would be reasonable to attempt a safe more on potential duplicates found during import where there is a name + address match + any additional safety checks we want to try - that might be a question for @SHust - ie

"What additional checks should we do when doing a safe-mode merge on contacts with a name + address match?." Some thoughts are

  • that they both have prior engage donations (engage gateway only)
  • that the address has 'enough' data (City? Postal Code?)

Example 5 is a contact with prior online giving & no prior engage giving

There are 2 matches - which I will merge

  1. Name + Address + different email
  2. Address + email match. First name is 'B M' rather than 'Bob Mathew'

this is a duplicate that we would really struggle to pick up any other way. I guess what could happen is

  1. a new contact is created
  2. DR manually reviews & merges the contact with the emails, based on email is enough OR perhaps our rule can do that manually BUT
  3. DR manually reviews & merges based on an address match

it's a bit yuck. Thinking.

The last one was a match with 2 contacts where both had

  • the same name
  • the same address
  • previous engage donations
  • no email

One had only 1 donation, dating back to 2020

In this case there were mergeable with a safe merge & I did merge them

OK so my thinking is

  1. let's see if @SHust has any capacity to do some deduping of the Group I created 'Engage'
  2. let's consider doing a safe-merge attempt in the import code if there are multiple matches. This MIGHT be on contacts with an email match or first+last+address so let's just discuss if we want any additional safeguards before attempting it (@SHust will know what they consider)
  3. I noticed a flaw with the automated script - it automatically dedupes the 'latest' contact IDs - but it does this by tracking the contact ID - which means that we don't get dedupe running on 'just modified'

@Eileenmcnaughton Thanks for the thorough investigation here! With Engage I believe the historical rule for their address data entry has been to match how it is written on the check - this was decided before my time, but I'm guessing we thought this would stay pretty consistent (which it looks like it either hasn't or it's due to human error). Looks like some of these dupes stem from variations in data entry on Engage's side.

We've recognized this need for standardization and started discussing at the offsite. I think the work @NNichols is currently doing to figure out standardization of addresses in Civi will be implemented with Engage and also with our internal data entry. Once we have those rules set and we implement across the board to match with USPS best formatting practices, hopefully these instances will decrease over time. I think it is still worth cleaning up what we currently have in your Engage group if possible!

@Eileenmcnaughton and @MDemosWMF I reviewed the Engage group, and we can certainly assist by deduping records using the first+last name and address rule, as this appears to be the primary cause of many duplicates.

@MDemosWMF All duplicate records with mailing address conflicts from your 'Engage' group have been successfully deduped by Poliane. I hope this helps your team!

@SHust Thank you that's wonderful! @Eileenmcnaughton what do you suggest for next steps?

@MDemosWMF I'm very curious to see what happens with the next import - ie are there notably less duplcates

@Eileenmcnaughton I ran this by Erica today and she brought up the idea of possibly letting the gifts pass through the import, but having the duplicates funnel into some kind of 'Engage dedupe group' on the backend that could be worked on by DR. Do you think that is doable?

@Elbar53 Can you also let us know if you see less duplicate errors in the import moving forward to see what kind of effect this might have?

@Eileenmcnaughton In the most recent file we still got back 14 errors so I think we will have to try something else. Let me know what you think of the idea above!

@Eileenmcnaughton @dkozlowski as we are heading into peak season, any thoughts on this? As things stand, we don't want the up front dedupe for peak season given the resources available for data entry.

"@Eileenmcnaughton I ran this by Erica today and she brought up the idea of possibly letting the gifts pass through the import, but having the duplicates funnel into some kind of 'Engage dedupe group' on the backend that could be worked on by DR. Do you think that is doable?"

Change #1076895 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Minor tidy up in config function

https://gerrit.wikimedia.org/r/1076895

Change #1077120 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] WIP fix import handling of duplicates

https://gerrit.wikimedia.org/r/1077120

Change #1077121 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Fix our import hook for when contact ID is present

https://gerrit.wikimedia.org/r/1077121

Change #1077122 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Create a group to add imported duplicates to

https://gerrit.wikimedia.org/r/1077122

Change #1077124 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1077124

Change #1077121 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Fix our import hook for when contact ID is present

https://gerrit.wikimedia.org/r/1077121

Change #1077122 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Create a group to add imported duplicates to

https://gerrit.wikimedia.org/r/1077122

Change #1077124 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1077124

Change #1076895 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Minor tidy up in config function

https://gerrit.wikimedia.org/r/1076895

Change #1078489 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1078489

AKanji-WMF set Final Story Points to 4.

Change #1078489 merged by Ejegg:

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1078489

Change #1080842 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1080842

Change #1080842 abandoned by Eileen:

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1080842

Change #1082313 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1082313

Change #1082313 merged by Ejegg:

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1082313

Change #1117273 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1117273

Change #1117273 merged by Eileen:

[wikimedia/fundraising/crm@master] Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1117273

Change #1126717 had a related patch set uploaded (by Eileen; author: Eileen):

[wikimedia/fundraising/crm@master] Reapply our hack Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1126717

Change #1126717 merged by Eileen:

[wikimedia/fundraising/crm@master] Reapply our hack Temporary fix for handling duplicate contacts on import

https://gerrit.wikimedia.org/r/1126717