Page MenuHomePhabricator

[epic] Make deduping reversible
Closed, ResolvedPublic

Description

We need to be able to undo dedupe merges, before we start merging large batches of contacts.

This means,

  • Keep an audit log of merges.
  • Keep records of which entities were associated with the deleted contact.
  • Provide a GUI for undoing a merge.

Description of CiviCRM audit log functionality

CiviCRM keeps a separate log table for each table in the database (except cache tables). By default the audit log engine is an archive table. Archives are optimised for writing but have poor read performance. The audit log tables hold a copy of every insert, update or delete action along with the initial status when the logging is turned on. In addition to the fields in the parent table the log tables hold the contact id of the person who made the change, the timestamp and the connection id. Changes across several tables can have the same connection id.

The log tables are kept up to date by triggers.

CiviCRM provides a tab in the interface to view the change log for a given contact along with a button to reverse any changes.

Configuration options

CiviCRM supports the log tables being in a different database.
Although CiviCRM uses the Archive engine I have in the past converted specific tables to INNODB and added specific indexes to them. In general this would be tables like log_civicrm_contact, log_civicrm_address etc but we probably would not alter tables like log_civicrm_option_value which would only ever be interrogated by mysql. Indexes would go on the contact_id fields, and connection_id.

Performance
Without converting some tables to INNODB and adding indexes we won't get acceptable performance through the UI for viewing changes. The write speed is in theory faster for ARCHIVE than INNODB but I have not previous found a noticeable performance issue with write. We should try to do some performance benchmarking.

Disk space
If we want to be able to have reversibility then storing a large amount of extra data is going to be necessary. Other sites I have dealt with don't cull this data but it's something we might need to consider. Planning for that is a little tricky since we probably need to keep the most recent copy of each record - rather than simply deleting before a certain date.

The size of our data will be less if we put less indexes on - but if we want to intermittently cull data we need to think about indexes to facilitate any delete queries.

Risks

On other sites I have occassionally seen archive table corruption. This usually happens on particular high use tables - like the group_contact table. I have a suspicion that the that dedupe speed improvement we did may also fix this. I think converting important tables to INNODB will address this.

Note that Fuzion uses this audit logging on every site we deal with - only the scale issue is different. On larger sites we do experience speed rendering the change log. Doing the conversions I mentioned help but I believe there is also a query that needs review in there.

Steps

  1. We need to test turning on logging on Staging to see how it performs.
    1. I need to provide a mysql file to Jeff to create the triggers & log tables
    2. Jeff to run sql, determining how much extra disc is used
    3. general click around by me
  2. Identify and Address specific points of slowness in click-around (I can probably do a first hit of these on another test DB at around 500k contacts)
  3. Review & address acceptability of the reversal interface
    1. I'm pretty sure the interface for deletes is unacceptable.
  4. Look at benchmarking our import jobs and comparing
  5. A minor task identified by Adam previously is to add an activity to a contact on delete identifying the deleted contact as a merge-delete

The highest risk of this derailing is that it is just too unwieldly - which should reveal itself on staging

Related Objects

StatusSubtypeAssignedTask
ResolvedEileenmcnaughton
ResolvedNone
DuplicateNone
ResolvedEileenmcnaughton
ResolvedNone
ResolvedNone
ResolvedEileenmcnaughton
OpenNone
Resolvedawight
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedJgreen
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedEileenmcnaughton
ResolvedJgreen
InvalidJgreen
ResolvedEileenmcnaughton

Event Timeline

awight raised the priority of this task from to High.
awight updated the task description. (Show Details)
awight added subscribers: awight, Aklapper, atgo.

@Eileenmcnaughton suggests turning on CiviCRM's logging function, which inserts into audit tables on every insert, update, and delete. This should capture all the data we need to reverse a merge, though it will take some clever logic to group all of the merge's entity changes together.

Actually the grouping is not so tough - it's intrinsic in the connection_id which is stored to the logging table. I'm just not clear how it looks in a merge situation & how it would look in a batch merge. But if we get the mysql permission adde to turn it on we can test (or actually I could just test elsewhere but I'd like to see it on staging)

atgo renamed this task from Make deduping reversible to [epic] Make deduping reversible.Jan 19 2016, 8:14 PM
atgo set Security to None.
atgo moved this task from Q4 FY21-22 to Epics in progress on the Fundraising-Backlog board.

Logging is now on & functional on staging. I tinkered around somewhat and was able to successfully merge and un-merge my mother.

I have been tracking most of the changes I've made to get to this point against https://issues.civicrm.org/jira/browse/CRM-18105 - some of which I have already started upstreaming & some of which are in progress.

WRT the overall approach vs simply recording the data in a custom way for reversing:

Advantages

  • Approach in consistent with core - not custom code
  • We can put some unit tests for it into core.
  • We get additional audit tracking of everything that happens
  • Less reliance on us not missing something because actually everything is stored
  • Less new code to write (although a fair few things to fix)

Disadvantages

  • More disk space use
  • Risk of archive table crashing
  • Need to consider an archiving or data destruction strategy at some point

Seemingly neutral

  • performance - my performance tests have remained in the same ballpark after enabling logging. It might be slightly worse if we go down the merge hook path - but that isn't certain as it should only really affect merges

Disk space vs Archive tables
I have converted all the tables that impact on retrieval performance from the contact change-log / dedupe reversal point of view to indexed INNODB tables. The current disk space is around 20 GB. In addition any disk space for backups / replication will be required. If the concern about ARCHIVE tables being potentially less stable is more important than disk space we should convert the rest. I expect this will be another 2-4 GB since most of the large tables are already converted. (hard to make a case that 20 GB is OK but 25 isn't really). This might be offset by changing the length of the log_conn_id field to 17 - it's currently 24 but based on the latest discussions with Tim we are thinking 17 is enough - https://issues.civicrm.org/jira/browse/CRM-18193 )

Ongoing space use
Since the logging tables are create only & not delete they will keep growing. Currently we mostly insert & don't update so I think the growth will be similar to our main DB growth - although a big merge would create a chunk of growth. My feeling is this issue can be pushed out & if growth is a problem we can look at archiving / culling down the track - but others might have other thoughts. The logging tables take up less space than the main DB tables as they use a more compressed format and have less indexes, but they never remove data, only add it and add for each transaction

Here are a couple of screen shots showing Kermit Trump's change log after I merged Kermit Frog into him & the report of minor changes from the change log report you get by clicking on 'Insert' next to the contact merged activity.

Screenshot 2016-03-11 15.24.55.png (397×1 px, 104 KB)

Screenshot 2016-03-11 15.25.21.png (622×1 px, 132 KB)

Note - the entity tag for 'Sarah's contacts' on Kermit Trump civicrm/contact/view?reset=1&cid=14833281 went missing in the merge + umerge so I should re-test to try to replicate that - it's probably some argy bargey around not having the uniquID change on staging. However, this does highlight that I can actually go back to the log & do forensics here

CiviCRM supports the log tables being in a different database.

That seems like a great idea... Also, we could potentially make our culling strategy to simply move old audit data to colder storage, say after 1 year.

On other sites I have occassionally seen archive table corruption.

I think this disqualifies the archive engine from consideration. See also the WP article, "The engine is not ACID compliant." Corruption probably came from simultaneous writes from multiple threads.

Can you confirm that the performance hit was also minimal when using indexed InnoDB tables as the audit storage backend? I couldn't quite parse whether you had done that experiment yet.

add an activity to a contact on delete identifying the deleted contact as a merge-delete

And maybe a re-dupe activity for good measure, so we can find the reverted merges easily.

length of the log_conn_id field

Could be a bigger problem. The MySQL manual says that CONNECTION_ID only gives us a number that is "unique among the set of currently connected clients". How about we throw the PID in there as well?

I tinkered around somewhat and was able to successfully merge and un-merge my mother.

That's just... a sci-fi no-no. I would stick to experimenting on frogs :p

The logging tables take up less space than the [....]

There was something missing here.

  1. Culling - cold storage might make sense too. In general I feel we don't have to solve this yet
  2. Yep - I'm still getting around 300 writes per second with most tables converted to INNODB. I'll convert the rest & retry
  3. Yeah - that makes sense too - perhaps within core that should only be enabled when logging is enabled to reduce clutter (although not disabled when it is disabled)
  4. Unique connection ID ticket is here https://issues.civicrm.org/jira/browse/CRM-18193 - let's keep discussion of that issue over there
  5. My mum is doing OK :-)

@Jgreen how feasible is the sort of extra disk use mentioned above? https://phabricator.wikimedia.org/T111704#2110819

Short term our most constrained machine is the master db, which is at 71% capacity on the 1.2TB data partition. Also we could reduce utilization for example if we can finally slay faulknerdb (354GB) or if we reduce expire_log_days from 10 days (~100GB) somewhat. So we can handle 20-30GB of live DB, and corresponding backups.

We will probably replace the db servers in FY2017 at which time we increase capacity and performance.

Also have you talked to Jaime about storage engines? He would be a great resource.