Page MenuHomePhabricator

Revisit deduper interface with a view to making it possible to do a search in search kit to dedupe
Open, Needs TriagePublic8 Estimated Story Points

Description

When the deduper was written apiv4 was not mature & the criteria involved are using apiv3 - that was a step forward at the time but now with apiv4 mature & search kit in play it would be better if search kit criteria could be used. I have some thoughts about how we could do this. I think it requires some work in core to make the code handle the apiv4 criteria & then some thought about how to deal with the angular interface in response

Event Timeline

@Eileenmcnaughton noting that in conversation with @SHust this was raised as a DR priority

Hi @SHust just want to flag that this is on our radar, however is relatively large in scope so need to determine when we can address.

Does addressing this issue help the runaway dedupe query lockups we're seeing during Sprint Y?

@AKanji-WMF not really - the issue is that that name & address rule was not really returning any matches anymore so it was being used with a very broad net to find the stragglers - I think @Wfan ran a query & found something like 150 potential duplicates in 100k contacts.

Originally this rule was used with Major Gift contacts because it is more important to find duplicates in those contacts. However, over time the DR team have gone through most of the database with this rule so it's very much diminishing returns. A while back I did some queries and put the likely matches into a group in CiviCRM which Sandra deduped so really the task of deduping by name & address in historial records can be seen as 'done'.

For perspective there was no deduping done at all for the first nearly 10 years of using CiviCRM and then when we started deduping various code & manual efforts were kicked off to go through the existing contacts. We are probably at the point now where we need to assess our goals & where the data integrity pain points lie (for example there are quite a few records in the database where the first name and last name are both in the first name field - but these might also be older records that perhaps we should not be investing time into)

FYI in case this info helps based on the issue found https://phabricator.wikimedia.org/T353971:

  1. When the query under 'name & address rule' crashed Civi we still had about 4.8k donors with different emails to dedupe.
  2. Three queries pulled on Friday under the 'Individual (Supervised) Name and Adress' + 50000 first matches + group + Eileen Merge-Picks, contain donors with different email addresses and those with the same emails—one set of CID examples: 70511 and 61477112 which wasn’t the case before, and thankfully these queries are working = not breaking Civi.
  3. I think that the mix between the CID > used as a clause + the high number to find the first matches has something to do with the odd behavior. I would love to test it with different parameters when someone is available to stop the queries if needed.
greg added a subscriber: Heatherjo550.