[epic] Make deduping reversible
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Sep 7 2015, 4:05 PM

Description

We need to be able to undo dedupe merges, before we start merging large batches of contacts.

This means,

Keep an audit log of merges.
Keep records of which entities were associated with the deleted contact.
Provide a GUI for undoing a merge.

Description of CiviCRM audit log functionality

CiviCRM keeps a separate log table for each table in the database (except cache tables). By default the audit log engine is an archive table. Archives are optimised for writing but have poor read performance. The audit log tables hold a copy of every insert, update or delete action along with the initial status when the logging is turned on. In addition to the fields in the parent table the log tables hold the contact id of the person who made the change, the timestamp and the connection id. Changes across several tables can have the same connection id.

The log tables are kept up to date by triggers.

CiviCRM provides a tab in the interface to view the change log for a given contact along with a button to reverse any changes.

Configuration options

CiviCRM supports the log tables being in a different database.
Although CiviCRM uses the Archive engine I have in the past converted specific tables to INNODB and added specific indexes to them. In general this would be tables like log_civicrm_contact, log_civicrm_address etc but we probably would not alter tables like log_civicrm_option_value which would only ever be interrogated by mysql. Indexes would go on the contact_id fields, and connection_id.

Performance
Without converting some tables to INNODB and adding indexes we won't get acceptable performance through the UI for viewing changes. The write speed is in theory faster for ARCHIVE than INNODB but I have not previous found a noticeable performance issue with write. We should try to do some performance benchmarking.

Disk space
If we want to be able to have reversibility then storing a large amount of extra data is going to be necessary. Other sites I have dealt with don't cull this data but it's something we might need to consider. Planning for that is a little tricky since we probably need to keep the most recent copy of each record - rather than simply deleting before a certain date.

The size of our data will be less if we put less indexes on - but if we want to intermittently cull data we need to think about indexes to facilitate any delete queries.

Risks

On other sites I have occassionally seen archive table corruption. This usually happens on particular high use tables - like the group_contact table. I have a suspicion that the that dedupe speed improvement we did may also fix this. I think converting important tables to INNODB will address this.

Note that Fuzion uses this audit logging on every site we deal with - only the scale issue is different. On larger sites we do experience speed rendering the change log. Doing the conversions I mentioned help but I believe there is also a query that needs review in there.

Steps

We need to test turning on logging on Staging to see how it performs.
1. I need to provide a mysql file to Jeff to create the triggers & log tables
2. Jeff to run sql, determining how much extra disc is used
3. general click around by me
Identify and Address specific points of slowness in click-around (I can probably do a first hit of these on another test DB at around 500k contacts)
Review & address acceptability of the reversal interface
1. I'm pretty sure the interface for deletes is unacceptable.
Look at benchmarking our import jobs and comparing
A minor task identified by Adam previously is to add an activity to a contact on delete identifying the deleted contact as a merge-delete

The highest risk of this derailing is that it is just too unwieldly - which should reveal itself on staging

Related Objects
Search...

Status	Assigned	Task
Resolved	Eileenmcnaughton	T77938 [epic] Dedupe CiviCRM
Resolved	None	T111702 [epic] Dedupe exact matches
Duplicate	None	T124695 Civi: can the Change Log be used to recover contribs that have been inadvertently deleted?
Resolved	Eileenmcnaughton	T117630 Dedupe merge failed to relink a contribution?
Resolved	None	T117466 Q3 GOALS! (January-March) Keep at top of Q3 column
Resolved	None	T111704 [epic] Make deduping reversible
Resolved	Eileenmcnaughton	T124089 Spike: Scope & plan work for making manual deduplicates reversible.
Open	None	T121691 Merging Feature Needed: "Add New" for mailing addresses
Resolved	awight	T118336 Enable CiviCRM change logging on staging
Resolved	Eileenmcnaughton	T127133 Make dedupe reversible step 1 of 5 test turning on logging on Staging to see how it performs.
Resolved	Eileenmcnaughton	T127135 Make dedupe reversible step 2 of 5: Identify and Address specific points of slowness i
Resolved	Eileenmcnaughton	T119426 Civi should record merge activities on both old and new contacts
Resolved	Jgreen	T130039 Put faulkner database on ice
Resolved	Eileenmcnaughton	T130161 Fix up core logging spec
Resolved	Eileenmcnaughton	T131221 Backport CRM-18178 exclude _bak & _backup tables from trigger creation
Resolved	Eileenmcnaughton	T131222 Use a unique ID to store logging information per : https://issues.civicrm.org/jira/browse/CRM-18193
Resolved	Eileenmcnaughton	T131223 Sync relevant files with 4.7 to make porting easier
Resolved	Eileenmcnaughton	T130164 Improve the change log report
Resolved	Eileenmcnaughton	T131229 Fix slow activity query on logging tab
Resolved	Eileenmcnaughton	T131232 Figure out a process to set up our logging on people's dev installs.
Resolved	Eileenmcnaughton	T132056 Turn logging on on live
Resolved	Eileenmcnaughton	T130163 Finalise sql & process for turning logging on on live
Resolved	Eileenmcnaughton	T131224 Ffx the logging code so that the triggers are not created in realtime if a setting is in play CRM-18212.
Resolved	Eileenmcnaughton	T131228 Fix the logging code so that it is in sync with 4.7 & the enable can be done by script
Resolved	Eileenmcnaughton	T131225 Deploy setting metadata updates & refresh the cache table.
Resolved	Eileenmcnaughton	T131230 Agree how we will deal with the trigger create mysql when we enable triggers
Resolved	Eileenmcnaughton	T131231 Resolve whether putting the logging tables in a separate DB will cause ACID compliance /rollback issues
Resolved	Jgreen	T132527 Reassure ourselves about triggers & replication
Invalid	Jgreen	T133613 configure database backups for new log_civicrm database
Resolved	Eileenmcnaughton	T133617 create protocols and documentation for various civicrm database backup/restore/failover scenarios

Event Timeline

awight created this task.Sep 7 2015, 4:05 PM

awight raised the priority of this task from to High.

awight updated the task description. (Show Details)

awight added projects: Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM.

awight added subscribers: awight, Aklapper, • atgo.

• atgo moved this task from Triage to Q3 2021-2022 on the Fundraising-Backlog board.Oct 6 2015, 4:54 PM

• DStrine mentioned this in T117462: CiviCRM contact dedupe--get to a point where we can let people key in merge resolutions.Nov 2 2015, 7:25 PM

• DStrine moved this task from Q3 2021-2022 to Q4 FY21-22 on the Fundraising-Backlog board.Nov 24 2015, 6:43 PM

@Eileenmcnaughton suggests turning on CiviCRM's logging function, which inserts into audit tables on every insert, update, and delete. This should capture all the data we need to reverse a merge, though it will take some clever logic to group all of the merge's entity changes together.

Actually the grouping is not so tough - it's intrinsic in the connection_id which is stored to the logging table. I'm just not clear how it looks in a merge situation & how it would look in a batch merge. But if we get the mysql permission adde to turn it on we can test (or actually I could just test elsewhere but I'd like to see it on staging)

• atgo renamed this task from Make deduping reversible to [epic] Make deduping reversible.Jan 19 2016, 8:14 PM

• atgo set Security to None.

• atgo moved this task from Q4 FY21-22 to Epics in progress on the Fundraising-Backlog board.

awight added a parent task: T124695: Civi: can the Change Log be used to recover contribs that have been inadvertently deleted?.Jan 26 2016, 8:19 PM

awight added a parent task: T117630: Dedupe merge failed to relink a contribution?.

• atgo added a parent task: T117466: Q3 GOALS! (January-March) Keep at top of Q3 column.Feb 2 2016, 8:34 PM

Eileenmcnaughton updated the task description. (Show Details)Feb 16 2016, 10:15 PM

Eileenmcnaughton closed subtask T124089: Spike: Scope & plan work for making manual deduplicates reversible. as Resolved.Feb 16 2016, 11:43 PM

• DStrine added a subtask: T127133: Make dedupe reversible step 1 of 5 test turning on logging on Staging to see how it performs..Feb 16 2016, 11:48 PM

• DStrine added a subtask: T127135: Make dedupe reversible step 2 of 5: Identify and Address specific points of slowness i.

Eileenmcnaughton added a subtask: T119426: Civi should record merge activities on both old and new contacts.Feb 25 2016, 1:25 AM

Eileenmcnaughton closed subtask T127135: Make dedupe reversible step 2 of 5: Identify and Address specific points of slowness i as Resolved.Mar 10 2016, 11:50 PM

Logging is now on & functional on staging. I tinkered around somewhat and was able to successfully merge and un-merge my mother.

I have been tracking most of the changes I've made to get to this point against https://issues.civicrm.org/jira/browse/CRM-18105 - some of which I have already started upstreaming & some of which are in progress.

WRT the overall approach vs simply recording the data in a custom way for reversing:

Advantages

Approach in consistent with core - not custom code
We can put some unit tests for it into core.
We get additional audit tracking of everything that happens
Less reliance on us not missing something because actually everything is stored
Less new code to write (although a fair few things to fix)

Disadvantages

More disk space use
Risk of archive table crashing
Need to consider an archiving or data destruction strategy at some point

Seemingly neutral

performance - my performance tests have remained in the same ballpark after enabling logging. It might be slightly worse if we go down the merge hook path - but that isn't certain as it should only really affect merges

Disk space vs Archive tables
I have converted all the tables that impact on retrieval performance from the contact change-log / dedupe reversal point of view to indexed INNODB tables. The current disk space is around 20 GB. In addition any disk space for backups / replication will be required. If the concern about ARCHIVE tables being potentially less stable is more important than disk space we should convert the rest. I expect this will be another 2-4 GB since most of the large tables are already converted. (hard to make a case that 20 GB is OK but 25 isn't really). This might be offset by changing the length of the log_conn_id field to 17 - it's currently 24 but based on the latest discussions with Tim we are thinking 17 is enough - https://issues.civicrm.org/jira/browse/CRM-18193 )

Ongoing space use
Since the logging tables are create only & not delete they will keep growing. Currently we mostly insert & don't update so I think the growth will be similar to our main DB growth - although a big merge would create a chunk of growth. My feeling is this issue can be pushed out & if growth is a problem we can look at archiving / culling down the track - but others might have other thoughts. The logging tables take up less space than the main DB tables as they use a more compressed format and have less indexes, but they never remove data, only add it and add for each transaction

Here are a couple of screen shots showing Kermit Trump's change log after I merged Kermit Frog into him & the report of minor changes from the change log report you get by clicking on 'Insert' next to the contact merged activity.

Screenshot 2016-03-11 15.24.55.png (397×1 px, 104 KB)

Screenshot 2016-03-11 15.25.21.png (622×1 px, 132 KB)

Note - the entity tag for 'Sarah's contacts' on Kermit Trump civicrm/contact/view?reset=1&cid=14833281 went missing in the merge + umerge so I should re-test to try to replicate that - it's probably some argy bargey around not having the uniquID change on staging. However, this does highlight that I can actually go back to the log & do forensics here

@Jgreen how feasible is the sort of extra disk use mentioned above? https://phabricator.wikimedia.org/T111704#2110819

CiviCRM supports the log tables being in a different database.

That seems like a great idea... Also, we could potentially make our culling strategy to simply move old audit data to colder storage, say after 1 year.

On other sites I have occassionally seen archive table corruption.

I think this disqualifies the archive engine from consideration. See also the WP article, "The engine is not ACID compliant." Corruption probably came from simultaneous writes from multiple threads.

Can you confirm that the performance hit was also minimal when using indexed InnoDB tables as the audit storage backend? I couldn't quite parse whether you had done that experiment yet.

add an activity to a contact on delete identifying the deleted contact as a merge-delete

And maybe a re-dupe activity for good measure, so we can find the reverted merges easily.

length of the log_conn_id field

Could be a bigger problem. The MySQL manual says that CONNECTION_ID only gives us a number that is "unique among the set of currently connected clients". How about we throw the PID in there as well?

I tinkered around somewhat and was able to successfully merge and un-merge my mother.

That's just... a sci-fi no-no. I would stick to experimenting on frogs :p

The logging tables take up less space than the [....]

There was something missing here.

Culling - cold storage might make sense too. In general I feel we don't have to solve this yet
Yep - I'm still getting around 300 writes per second with most tables converted to INNODB. I'll convert the rest & retry
Yeah - that makes sense too - perhaps within core that should only be enabled when logging is enabled to reduce clutter (although not disabled when it is disabled)
Unique connection ID ticket is here https://issues.civicrm.org/jira/browse/CRM-18193 - let's keep discussion of that issue over there
My mum is doing OK :-)

In T111704#2120246, @Eileenmcnaughton wrote:

@Jgreen how feasible is the sort of extra disk use mentioned above? https://phabricator.wikimedia.org/T111704#2110819

Short term our most constrained machine is the master db, which is at 71% capacity on the 1.2TB data partition. Also we could reduce utilization for example if we can finally slay faulknerdb (354GB) or if we reduce expire_log_days from 10 days (~100GB) somewhat. So we can handle 20-30GB of live DB, and corresponding backups.

We will probably replace the db servers in FY2017 at which time we increase capacity and performance.

Also have you talked to Jaime about storage engines? He would be a great resource.

• DStrine added a subtask: T130039: Put faulkner database on ice.Mar 15 2016, 7:10 PM

Eileenmcnaughton created subtask T130161: Fix up core logging spec.Mar 16 2016, 11:33 PM

Eileenmcnaughton created subtask T130163: Finalise sql & process for turning logging on on live.Mar 16 2016, 11:35 PM

Eileenmcnaughton created subtask T130164: Improve the change log report.

Jgreen closed subtask T130039: Put faulkner database on ice as Resolved.Mar 25 2016, 12:09 PM

Eileenmcnaughton closed subtask T127133: Make dedupe reversible step 1 of 5 test turning on logging on Staging to see how it performs. as Resolved.Mar 30 2016, 2:36 AM

Eileenmcnaughton created subtask T131229: Fix slow activity query on logging tab.Mar 30 2016, 7:41 AM

• atgo unsubscribed.Mar 30 2016, 10:05 PM

Eileenmcnaughton added a subtask: T131232: Figure out a process to set up our logging on people's dev installs..Mar 31 2016, 12:09 AM

Eileenmcnaughton closed subtask T130161: Fix up core logging spec as Resolved.Apr 5 2016, 2:07 AM