Page MenuHomePhabricator

map out a civicrm.log_% table data expiration/delete strategy
Closed, DeclinedPublic

Description

Civicrm's database uses log_% tables to store each change to corresponding tables. The log_% are populated by triggers that copy each write to a table to the corresponding log table. Row compression is in use. They grow without expiration, and with about four years of historical data they account for about 50% of the total disk footprint of the civicrm database.

Extrapolating from testing with log_civicrm_activity, if we were to trim these tables to records newer than 1 year, we would recover about 308GB or 37% the overall civicrm database footprint.

We've talked about expiration timelines and it sounds as though FR tech agrees that 1 year may be an acceptable max age for data in log_% tables. This task is to address that question and to develop a process to purge old data that can be automated. There are also several log tables that are not useful, which we should be able to disable altogether.

Here's a summary of action items:

  • deploy civicrm patch to disable trigger/table creation for unnecessary tables
  • drop trigger civicrm_financial_item_after_insert
  • drop trigger civicrm_financial_item_after_update
  • drop trigger civicrm_financial_item_after_delete
  • drop table log_civicrm_financial_item
  • drop trigger civicrm_entity_financial_trxn_after_insert
  • drop trigger civicrm_entity_financial_trxn_after_update
  • drop trigger civicrm_entity_financial_trxn_after_delete
  • drop table log_civicrm_entity_financial_trxn
  • delete from log_civicrm_entity_tag (actually this was done as a table swap and drop to recover the 14GB it consumed)
  • drop trigger civicrm_financial_trxn_after_insert
  • drop trigger civicrm_financial_trxn_after_update
  • drop trigger civicrm_financial_trxn_after_delete
  • drop table log_civicrm_financial_trxn
  • drop trigger civicrm_line_item_after_insert
  • drop trigger civicrm_line_item_after_update
  • drop trigger civicrm_line_item_after_delete
  • drop table log_civicrm_line_item
  • drop trigger civicrm_mailing_after_insert
  • drop trigger civicrm_mailing_after_update
  • drop trigger civicrm_mailing_after_delete
  • drop table log_civicrm_mailing
  • drop trigger civicrm_mailing_job_after_insert
  • drop trigger civicrm_mailing_job_after_update
  • drop trigger civicrm_mailing_job_after_delete
  • drop table log_civicrm_mailing_job
  • drop trigger civicrm_subscription_history_after_insert
  • drop trigger civicrm_subscription_history_after_update
  • drop trigger civicrm_subscription_history_after_delete
  • drop table log_civicrm_subscription_history

Additional things we haven't solved yet:

  • decide on date range to keep (>1y seems feasible)
  • map out a safe procedure to purge old data
  • script the purge procedure and time it on frdev
  • decide whether the purge procedure can happen live or requires an outage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jgreen triaged this task as Medium priority.Jul 6 2020, 5:54 PM
Jgreen moved this task from Triage to Up Next on the fundraising-tech-ops board.
Jgreen updated the task description. (Show Details)

Eileen notes (in tech talk): We can purge deleted contacts from 2016-one year ago (we had only purged those deleted before 2016 till now). Then we should also purge log table entries for those purged contacts.

We discussed the doc https://docs.google.com/document/d/1PAHrjf1Z6iBrpqGR5xGQWW89c7aT1MZs7DogRerY3Kg/edit#heading=h.4xybid309w6c and log tables in tech talk and identified 100GB of log table we can stop logging / drop (or in one case truncate to get rid of legacy data) - this is pretty quick to do

we also want to

  1. amp up our contact deletes - propose permanently deleting all contacts deleted > 1 year ago. This is not a big job (1) - just a comms exercise
  2. clean up logs on purged contacts - might be a bit bigger...

Tables to drop - requires small change, reloading of triggers & table drop / truncate in 1 case

log_civicrm_financial_item.ibd 23,425,048 25,849,868 0 25,849,868 1 Drop - little forensic value
log_civicrm_entity_financial_trxn.ibd 20,512,784 21,463,048 0 21,463,048 1 Drop - little forensic value
log_civicrm_entity_tag.ibd 14,549,000 14,266,372 0 14,266,372 2 Truncate - mostly full of discontinued 'silly' tags - let's just track what people actually add
log_civicrm_financial_trxn.ibd 13,127,692 13,709,320 0 13,709,320 1 Drop little forensic value
log_civicrm_line_item.ibd 10,817,548 11,407,368 0 11,407,368 1 Drop little forensic value
log_civicrm_mailing.ibd 5,435,400 5,541,892 0 5,541,892 1 Drop - this is a copy of a copy of silverpop data
log_civicrm_mailing_job.ibd 3,805,188 3,887,112 0 3,887,112 1 Drop - little forensic value

@Jgreen given it's just a drop are you OK 'just dropping' the triggers on the non 'log-version' of the above tables? (we will add commits but this is more about outage avoidance)

@Jgreen given it's just a drop are you OK 'just dropping' the triggers on the non 'log-version' of the above tables? (we will add commits but this is more about outage avoidance)

Yes, that sounds do-able!

@Eileenmcnaughton re. the list of tables to drop, can you clarify these two?

log_civicrm_entity_tag -- I'm not sure what truncate means, are we just dropping all the current rows and leaving the trigger in place?

log_civicrm_subscription_history -- This is listed as 'Drop' on the Civicrm Database Growth spreadsheet https://docs.google.com/spreadsheets/d/1FqQtn4XnXBnAIH8_FjFEp0mDty5RD85gJvWBh2xXyyY/edit?usp=sharing but not included in your list above.

Jgreen updated the task description. (Show Details)

@Jgreen I can take a stab at those answers
for log_civicrm_entity_tag, yep, that's the idea. We used to automatically apply a couple of tags to contacts and contributions during queue consumption, but we no longer do. So 99%+ of the existing log table is that automatic useless stuff. We can drop all the old stuff and just log new changes, which will presumable be human-initiated.

log_civicrm_subscription_history should definitely be dropped along with associated triggers - the subscription_history table itself is basically a log of what groups people have been assigned to, so a log table on those values is redundant.

@Jgreen I can take a stab at those answers
for log_civicrm_entity_tag, yep, that's the idea. We used to automatically apply a couple of tags to contacts and contributions during queue consumption, but we no longer do. So 99%+ of the existing log table is that automatic useless stuff. We can drop all the old stuff and just log new changes, which will presumable be human-initiated.

log_civicrm_subscription_history should definitely be dropped along with associated triggers - the subscription_history table itself is basically a log of what groups people have been assigned to, so a log table on those values is redundant.

Ok great, thanks!

Change 610359 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Exclude discussed tables from logs

https://gerrit.wikimedia.org/r/610359

I just pushed up the patch that will exclude the tables from the regenerated triggers

@Ejegg @Eileenmcnaughton should this be pending deployment? I see the patch above was +2ed.

Change 610359 merged by Eileen:
[wikimedia/fundraising/crm@master] Exclude discussed tables from logs

https://gerrit.wikimedia.org/r/610359

What's the definition of done on this task? I see a merged patch, a related task, and a lot of comments. I'm curious what the next steps are.

What's the definition of done on this task? I see a merged patch, a related task, and a lot of comments. I'm curious what the next steps are.

We are missing parameters and a process for expiring old log table data. At a high level I think we can script a table-by-table process that selects rows newer than X days into a new table, and replaces the old table. But we need FR-Tech input on how to minimize disruption, to make sure we're not breaking anything by doing so, and to make a final decision on how many days of data we need to keep.

My understanding was that we'd discussed a year for the log tables but the only question was whether it was a 1 year ago or 1 year as specified at a specific time of the year. I might be wrong though.

My understanding was that we'd discussed a year for the log tables but the only question was whether it was a 1 year ago or 1 year as specified at a specific time of the year. I might be wrong though.

That's my understanding too, I just wasn't confident we had arrived at a final decision.

I think we have a lot of flexibility here. I'm pretty sure we could choose either 1 exact year or 1 year from a date and be fine. @Ejegg @Eileenmcnaughton or @XenoRyet could confirm though.

My understanding is that we have a lot of flexibility here. I'm pretty sure we could choose either 1 exact year or 1 year from a date and be fine. @Ejegg @Eileenmcnaughton or @XenoRyet could confirm though.

Maybe the next step should be solving the mechanics of the old data purge and getting that process going. We can work with a semi-arbitrary range, like "550 days from today' and fine tune the timeframe afterward.

I support the rolling 550 day mark. That gives one full calendar year plus coverage to get a full fiscal year.

I think some tables have more value for old data than others - notably contact & email data

I think some tables have more value for old data than others - notably contact & email data

We could design a purge script that uses per-table max-age settings?

Testing on frdev1001/dev_civicrm, I wrote a script that creates log_*_postpurge tables for each log_ table, and populates them with the past 750 days of data, figuring 550 days discussed above plus ~200 days since dev_civicrm was reloaded. This takes about 3:40 to run and reduces the log table data footprint from 551GB to 151GB.

I'm thinking of a slightly different strategy which could possibly be done live with under a minute of disruption.

  1. flush tables with read lock
  2. move each log_ table aside and generate a new one with autoincrement matching the old table's current position
  3. unlock tables
  4. backfill each log_ table with the past 550 days from the old log table
  5. sanity check old vs new table (is row count good enough?)
  6. drop the old log table

The more I think about this the more I'd like to identify the tables that tell us about the contact & leave them out. I'm comfortable being more aggressive on all the other tables but I think the most likely thing for us to want to know after 6 months is address history & history of opt out etc.

I think contribution & activity data is large & I think they have less longer term value

The more I think about this the more I'd like to identify the tables that tell us about the contact & leave them out. I'm comfortable being more aggressive on all the other tables but I think the most likely thing for us to want to know after 6 months is address history & history of opt out etc.

I think contribution & activity data is large & I think they have less longer term value

No problem, this is baked into the script I'm working on. It gets the list of log_ tables from the database and then checks its config file for a TableMaxDays setting for each individual table. If a table is not in the configuration, it won't be touched. I'm testing with 750 days arbitrarily just to get a sense of things.

Testing on frdev1001 dev_civicrm database, purging to 750 days for all log tables reduced the database footprint from 862GB to 652GB, and took 3:29 to run. During that time specific tables are locked for a few seconds to do the swap, and most of the time is spent backfilling the freshly swapped-in table. Four tables are missing the log_id auto_increment column, the script passes over those by design.

@Jgreen those tables that we plan to drop all together were included in the 200GB? Will we drop those today on live during the outage?

@Jgreen those tables that we plan to drop all together were included in the 200GB? Will we drop those today on live during the outage?

I didn't drop them, we can recover an additional 25GB by dropping them. I didn't check their size beforehand, but they're 85GB in the live db.

We're not doing any database work today, I think there are campaigns running so we wanted to minimize disruption. We're still planning to do a database maintenance outage in the next month or so.

@Jgreen was I wrong in thinking the civi upgrade is today?

@Jgreen was I wrong in thinking the civi upgrade is today?

We're doing the OS (and thus php) upgrade today.

@Jgreen I think we should drop the triggers & tables too from the top of this phab today too - it's pretty cautious not dropping them without an outage IMHO since the tables are completely unused without the triggers & both should be super quick to drop - so it would be good to use the outage to do it

Jgreen updated the task description. (Show Details)

@Jgreen I think we should drop the triggers & tables too from the top of this phab today too - it's pretty cautious not dropping them without an outage IMHO since the tables are completely unused without the triggers & both should be super quick to drop - so it would be good to use the outage to do it

Done!

Cool - & that was 85 GB?

Should have been, yes. Civicrm database footprint is currently 753GB.

@Jgreen that is about 200GB less than is was at the start I think (+ 50GB off the drupal DB - definitely something)

@Jgreen that is about 200GB less than is was at the start I think (+ 50GB off the drupal DB - definitely something)

Plus the pgehres database (300GB, but still need to remove from frdb1002 etc) and faulkner (350GB). We're making progress!

Ok, another run:

  • dev_civicrm fresh restore from prod 2020-08-07 (after table truncating and drops on this task)
  • across-the-board keep <= 550 days
  • run time was 2:20
  • footprint reduction of 190GB (from 723GB to 533GB)

@Eileenmcnaughton you mentioned excluding some tables and being more aggressive with others. Assuming we're working with 550 days otherwise, can you let me know which tables should be handled differently?

I think the tables would be

civicrm_contact
civicrm_email
civicrm_phone
civicrm_address
civicrm_note
+ contact custom tables - ie
SELECT table_name FROM civicrm_custom_group WHERE extends IN ('Contact', 'Organization', 'Household', 'Individual') AND table_name != 'wmf_donor'

Marginal
civicrm_entity_tag
civicrm_group_contact
civicrm_relationship

@Ejegg @mepps - thoughts?

I think the tables would be

civicrm_contact
civicrm_email
civicrm_phone
civicrm_address
civicrm_note
+ contact custom tables - ie
SELECT table_name FROM civicrm_custom_group WHERE extends IN ('Contact', 'Organization', 'Household', 'Individual') AND table_name != 'wmf_donor'

Marginal
civicrm_entity_tag
civicrm_group_contact
civicrm_relationship

@Ejegg @mepps - thoughts?

Crunching the numbers on this, excluding these tables from a 550 day purge cycle makes a difference of about 5.6GB today, so excluding them or maybe purging them to more like 5 years seems feasible.

The three marginal tables are about 1.2GB total after 550 day purge, so I don't think it helps much to treat them more aggressively.

Here's the first pass at a purge script

and the config file for it

Jgreen lowered the priority of this task from Medium to Low.Mar 11 2021, 8:14 PM
Jgreen removed Jgreen as the assignee of this task.Sep 7 2021, 3:17 PM
Jgreen moved this task from Blocked to Done on the fundraising-tech-ops board.

Declining this task because the architecture of the log tables makes it impossible to delete old rows without breaking the diff history. The diffs as implemented rely on the oldest record as a starting reference, and expiring old revision data renders recent revision data unusable. The log table scheme will have to change before expire anything.

I am going through old tickets around this topic & collating links to them.

It seems worth ensuring that it is mentioned on this ticket that in the course of this sprint I have accessed log records more than one year ago at least 4 times in various forensics. In one case I was looking for data as far back as 2015 - which I didn't find. In at least 2 of these instances the data I did use was outside the 550 days mentioned here too.