Page MenuHomePhabricator

Epic: Dedupe V2: resolve top conflicts
Closed, ResolvedPublic

Description

here is some analysis on names and addresses. We will start with section 1:
https://docs.google.com/document/d/1EuDVzvWip-UOQUN7V_voHrtvEnafocyDvb1lH4LQ3h0/edit#

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
DuplicateEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
OpenNone
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton

Event Timeline

  1. Conflicts on language ('en_US' vs. 'en', 'en_HK' vs. 'en', 'en_NO' vs. 'no_NO',, 'en_CA' vs. 'en'). Fix requires discussion (T146344)
  2. (DONE) Conflicts on privacy fields (do_not_trade, do_not_email). Fix is technical only - we have agreed to prioritise a 'yes' value for these fields (T143856 )
  3. Conflicts on no_bulk_email. Fix is technical only & differs from #2 in that a different table / part of code is involved
  4. (DONE) Conflicts on capitalisation of names. Fix is technical - major gifts, DS & CC all seemed to agree with prioritising based on the number of capital letters in the belief that shows more deliberation. Adam not sold on this, he has a higher tolerance for lack of caps & a lower tolerance for us making value judgements on data T145032
  5. Post code suffix being treated as a conflict rather than merged in - technical only
  6. Name data quality - e.g full name in first name field - discussion
  7. Name data common variations - Tim vs Timothy. Needs some digging / discussion. Caitlin C suggested she might be able to source a list of common variations. Adam suggested the Levenshtein algorithm
  8. Addresses - common variations - we would probably get a bunch just by stripping the '.'s out when comparing - e.g that would get 'Ave.' vs 'Ave' but not 'Avenue' vs 'Ave'. Once again the Levenshtein algorithm or sourcing a common list are options
  9. Addresses - exposing history. This is not so much a conflict as an alternative to resolving the conflicts. Currently we are throwing a confilct for major gifts but selecting the most recent for other donors. In discussion with Major Gifts it seems that if we expose address history (stored in out logging tables) they might be quite happy with just chosing the most recent per non-major gifts. This feature was also wanted by Caitlin & Michael. Note this would make #8 redundant but we might still want to look at #5 to ensure we are keeping the post code suffix where possible. Major Gifts may also feel less comfort with allowing less complete but more recent addresses to take priority - e.g country only, although the address history may be sufficient here. T142549
  10. not a conflict but a related feature request - find a way to make it clear which contacts have been rejected for merge based on conflicts - perhaps create a scheduled activity against them with the details and possibly have some way of visually flagging contacts with a non-completed activity of that type. We would have to figure out how they get cleaned up when resolved.
  11. (DONE) We might need to open a conversation about the anonymous ones & the nobody@wikimedia.org pseudo-email .... T143062

Other related issues on my radar

  1. the broken batch merge all button - probably a recent error
  2. fixing up the UI so it's possible to retrieve some (with limit) & then grab the next batch
  3. E-notices showing up in the jenkins output
Eileenmcnaughton closed subtask Restricted Task as Resolved.Sep 13 2016, 1:50 AM
Eileenmcnaughton closed subtask Restricted Task as Resolved.Jan 20 2022, 1:56 AM