Page MenuHomePhabricator

map out a civicrm.log_% table data expiration/delete strategy
Open, MediumPublic

Description

Civicrm's database uses log_% tables to store each change to corresponding tables. The log_% are populated by triggers that copy each write to a table to the corresponding log table. Row compression is in use. They grow without expiration, and with about four years of historical data they account for about 50% of the total disk footprint of the civicrm database.

Extrapolating from testing with log_civicrm_activity, if we were to trim these tables to records newer than 1 year, we would recover about 308GB or 37% the overall civicrm database footprint.

We've talked about expiration timelines and it sounds as though FR tech agrees that 1 year may be an acceptable max age for data in log_% tables. This task is to address that question and to develop a process to purge old data that can be automated. There are also several log tables that are not useful, which we should be able to disable altogether.

Here's a summary of action items:

  • deploy civicrm patch to disable trigger/table creation for unnecessary tables
  • drop trigger civicrm_financial_item_after_insert
  • drop trigger civicrm_financial_item_after_update
  • drop trigger civicrm_financial_item_after_delete
  • drop table log_civicrm_financial_item
  • drop trigger civicrm_entity_financial_trxn_after_insert
  • drop trigger civicrm_entity_financial_trxn_after_update
  • drop trigger civicrm_entity_financial_trxn_after_delete
  • drop table log_civicrm_entity_financial_trxn
  • delete from log_civicrm_entity_tag
  • drop trigger civicrm_financial_trxn_after_insert
  • drop trigger civicrm_financial_trxn_after_update
  • drop trigger civicrm_financial_trxn_after_delete
  • drop table log_civicrm_financial_trxn
  • drop trigger civicrm_line_item_after_insert
  • drop trigger civicrm_line_item_after_update
  • drop trigger civicrm_line_item_after_delete
  • drop table log_civicrm_line_item
  • drop trigger civicrm_mailing_after_insert
  • drop trigger civicrm_mailing_after_update
  • drop trigger civicrm_mailing_after_delete
  • drop table log_civicrm_mailing
  • drop trigger civicrm_mailing_job_after_insert
  • drop trigger civicrm_mailing_job_after_update
  • drop trigger civicrm_mailing_job_after_delete
  • drop table log_civicrm_mailing_job
  • drop trigger civicrm_subscription_history_after_insert
  • drop trigger civicrm_subscription_history_after_update
  • drop trigger civicrm_subscription_history_after_delete
  • drop table log_civicrm_subscription_history

Additional things we haven't solved yet:

  • decide on date range to keep (>1y seems feasible)
  • map out a safe procedure to purge old data
  • script the purge procedure and time it on frdev
  • decide whether the purge procedure can happen live or requires an outage

Event Timeline

Jgreen created this task.Jul 6 2020, 5:24 PM
Jgreen triaged this task as Medium priority.Mon, Jul 6, 5:54 PM
Jgreen moved this task from Triage to Up Next on the fundraising-tech-ops board.
Jgreen updated the task description. (Show Details)Mon, Jul 6, 7:57 PM
Jgreen updated the task description. (Show Details)
Jgreen updated the task description. (Show Details)Mon, Jul 6, 8:12 PM
DStrine moved this task from Triage to FR-Ops on the Fundraising-Backlog board.Mon, Jul 6, 8:18 PM
DStrine moved this task from FR-Ops to Triage on the Fundraising-Backlog board.Tue, Jul 7, 5:32 PM
DStrine removed a project: Epic.Tue, Jul 7, 5:58 PM
Ejegg added a subscriber: Ejegg.Tue, Jul 7, 10:10 PM

Eileen notes (in tech talk): We can purge deleted contacts from 2016-one year ago (we had only purged those deleted before 2016 till now). Then we should also purge log table entries for those purged contacts.

Eileenmcnaughton added a comment.EditedTue, Jul 7, 10:51 PM

We discussed the doc https://docs.google.com/document/d/1PAHrjf1Z6iBrpqGR5xGQWW89c7aT1MZs7DogRerY3Kg/edit#heading=h.4xybid309w6c and log tables in tech talk and identified 100GB of log table we can stop logging / drop (or in one case truncate to get rid of legacy data) - this is pretty quick to do

we also want to

  1. amp up our contact deletes - propose permanently deleting all contacts deleted > 1 year ago. This is not a big job (1) - just a comms exercise
  2. clean up logs on purged contacts - might be a bit bigger...

Tables to drop - requires small change, reloading of triggers & table drop / truncate in 1 case

log_civicrm_financial_item.ibd 23,425,048 25,849,868 0 25,849,868 1 Drop - little forensic value
log_civicrm_entity_financial_trxn.ibd 20,512,784 21,463,048 0 21,463,048 1 Drop - little forensic value
log_civicrm_entity_tag.ibd 14,549,000 14,266,372 0 14,266,372 2 Truncate - mostly full of discontinued 'silly' tags - let's just track what people actually add
log_civicrm_financial_trxn.ibd 13,127,692 13,709,320 0 13,709,320 1 Drop little forensic value
log_civicrm_line_item.ibd 10,817,548 11,407,368 0 11,407,368 1 Drop little forensic value
log_civicrm_mailing.ibd 5,435,400 5,541,892 0 5,541,892 1 Drop - this is a copy of a copy of silverpop data
log_civicrm_mailing_job.ibd 3,805,188 3,887,112 0 3,887,112 1 Drop - little forensic value

Eileenmcnaughton added a comment.EditedTue, Jul 7, 10:58 PM

@Jgreen given it's just a drop are you OK 'just dropping' the triggers on the non 'log-version' of the above tables? (we will add commits but this is more about outage avoidance)

Jgreen updated the task description. (Show Details)Wed, Jul 8, 1:55 PM

@Jgreen given it's just a drop are you OK 'just dropping' the triggers on the non 'log-version' of the above tables? (we will add commits but this is more about outage avoidance)

Yes, that sounds do-able!

Jgreen updated the task description. (Show Details)Wed, Jul 8, 2:06 PM
Jgreen added a comment.Wed, Jul 8, 2:10 PM

@Eileenmcnaughton re. the list of tables to drop, can you clarify these two?

log_civicrm_entity_tag -- I'm not sure what truncate means, are we just dropping all the current rows and leaving the trigger in place?

log_civicrm_subscription_history -- This is listed as 'Drop' on the Civicrm Database Growth spreadsheet https://docs.google.com/spreadsheets/d/1FqQtn4XnXBnAIH8_FjFEp0mDty5RD85gJvWBh2xXyyY/edit?usp=sharing but not included in your list above.

Jgreen updated the task description. (Show Details)Wed, Jul 8, 2:16 PM
Jgreen updated the task description. (Show Details)
Ejegg added a comment.Wed, Jul 8, 4:16 PM

@Jgreen I can take a stab at those answers
for log_civicrm_entity_tag, yep, that's the idea. We used to automatically apply a couple of tags to contacts and contributions during queue consumption, but we no longer do. So 99%+ of the existing log table is that automatic useless stuff. We can drop all the old stuff and just log new changes, which will presumable be human-initiated.

log_civicrm_subscription_history should definitely be dropped along with associated triggers - the subscription_history table itself is basically a log of what groups people have been assigned to, so a log table on those values is redundant.

Ejegg updated the task description. (Show Details)Wed, Jul 8, 4:17 PM
Jgreen added a comment.Wed, Jul 8, 5:13 PM

@Jgreen I can take a stab at those answers
for log_civicrm_entity_tag, yep, that's the idea. We used to automatically apply a couple of tags to contacts and contributions during queue consumption, but we no longer do. So 99%+ of the existing log table is that automatic useless stuff. We can drop all the old stuff and just log new changes, which will presumable be human-initiated.

log_civicrm_subscription_history should definitely be dropped along with associated triggers - the subscription_history table itself is basically a log of what groups people have been assigned to, so a log table on those values is redundant.

Ok great, thanks!

Jgreen updated the task description. (Show Details)Wed, Jul 8, 5:14 PM
mepps added a subscriber: mepps.Wed, Jul 8, 7:18 PM

Change 610359 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Exclude discussed tables from logs

https://gerrit.wikimedia.org/r/610359

I just pushed up the patch that will exclude the tables from the regenerated triggers

Jgreen updated the task description. (Show Details)Wed, Jul 8, 8:23 PM
mepps added a comment.Thu, Jul 9, 5:25 PM

@Ejegg @Eileenmcnaughton should this be pending deployment? I see the patch above was +2ed.

Jgreen updated the task description. (Show Details)Thu, Jul 9, 6:40 PM

Change 610359 merged by Eileen:
[wikimedia/fundraising/crm@master] Exclude discussed tables from logs

https://gerrit.wikimedia.org/r/610359

mepps added a comment.Mon, Jul 20, 7:17 PM

What's the definition of done on this task? I see a merged patch, a related task, and a lot of comments. I'm curious what the next steps are.

What's the definition of done on this task? I see a merged patch, a related task, and a lot of comments. I'm curious what the next steps are.

We are missing parameters and a process for expiring old log table data. At a high level I think we can script a table-by-table process that selects rows newer than X days into a new table, and replaces the old table. But we need FR-Tech input on how to minimize disruption, to make sure we're not breaking anything by doing so, and to make a final decision on how many days of data we need to keep.

mepps added a comment.Tue, Jul 21, 2:27 PM

My understanding was that we'd discussed a year for the log tables but the only question was whether it was a 1 year ago or 1 year as specified at a specific time of the year. I might be wrong though.

My understanding was that we'd discussed a year for the log tables but the only question was whether it was a 1 year ago or 1 year as specified at a specific time of the year. I might be wrong though.

That's my understanding too, I just wasn't confident we had arrived at a final decision.

mepps added a subscriber: XenoRyet.EditedTue, Jul 21, 4:21 PM

I think we have a lot of flexibility here. I'm pretty sure we could choose either 1 exact year or 1 year from a date and be fine. @Ejegg @Eileenmcnaughton or @XenoRyet could confirm though.

My understanding is that we have a lot of flexibility here. I'm pretty sure we could choose either 1 exact year or 1 year from a date and be fine. @Ejegg @Eileenmcnaughton or @XenoRyet could confirm though.

Maybe the next step should be solving the mechanics of the old data purge and getting that process going. We can work with a semi-arbitrary range, like "550 days from today' and fine tune the timeframe afterward.

Jgreen updated the task description. (Show Details)Tue, Jul 21, 5:09 PM
mepps added a comment.Tue, Jul 21, 5:09 PM

That sounds good to me @Jgreen.

I support the rolling 550 day mark. That gives one full calendar year plus coverage to get a full fiscal year.

I think some tables have more value for old data than others - notably contact & email data

I think some tables have more value for old data than others - notably contact & email data

We could design a purge script that uses per-table max-age settings?

Jgreen added a comment.EditedThu, Jul 23, 9:15 PM

Testing on frdev1001/dev_civicrm, I wrote a script that creates log_*_postpurge tables for each log_ table, and populates them with the past 750 days of data, figuring 550 days discussed above plus ~200 days since dev_civicrm was reloaded. This takes about 3:40 to run and reduces the log table data footprint from 551GB to 151GB.

I'm thinking of a slightly different strategy which could possibly be done live with under a minute of disruption.

  1. flush tables with read lock
  2. move each log_ table aside and generate a new one with autoincrement matching the old table's current position
  3. unlock tables
  4. backfill each log_ table with the past 550 days from the old log table
  5. sanity check old vs new table (is row count good enough?)
  6. drop the old log table

The more I think about this the more I'd like to identify the tables that tell us about the contact & leave them out. I'm comfortable being more aggressive on all the other tables but I think the most likely thing for us to want to know after 6 months is address history & history of opt out etc.

I think contribution & activity data is large & I think they have less longer term value

The more I think about this the more I'd like to identify the tables that tell us about the contact & leave them out. I'm comfortable being more aggressive on all the other tables but I think the most likely thing for us to want to know after 6 months is address history & history of opt out etc.

I think contribution & activity data is large & I think they have less longer term value

No problem, this is baked into the script I'm working on. It gets the list of log_ tables from the database and then checks its config file for a TableMaxDays setting for each individual table. If a table is not in the configuration, it won't be touched. I'm testing with 750 days arbitrarily just to get a sense of things.