Page MenuHomePhabricator

Answer some data size/churn questions for the Donor ID export
Closed, ResolvedPublic

Description

Questions from Jon's continuity document during his 3-week vacation that we should answer before he returns:

  • Stability for the format: What is the format of the CSV provided by CiviCRM that the MediaWiki maintenance script will need to parse?
  • How many database rows do we expect to require writes to the initial execution of the maintenance script?
  • Following the initial execution, if we operate on diffs with a regular cadence (E.g. only rows that have changed in the CSV) what is the rate with which we expect rows to change for:
    • All types of donors
    • Only lapsed donors

Event Timeline

I can't comment on that document, but here are some rough numbers:

Export format is contact_id INT, email_address VARCHAR(255), donor_status_id INT, do_not_solicit TINYINT(1)

Expect ~40m initial rows

Currently for Silverpop, we export daily, but send any contacts modified in past 7 days. This includes ~250k contacts normally, ~1m when donation volume is higher. We could shorten the modified filter and reduce this to a daily ~50k in normal conditions. Most of these will not have changed in any of the fields included in the export, so we could reduce further if needed, but we haven't currently set up anything to handle this.

If we only want to look at lapsed donors, expect about 10% of the above numbers.

Let's add the answers to this to the existing README so it doesn't get lost in a phab task over time.

Thanks in advance for your answers!

This includes ~250k contacts normally, ~1m when donation volume is higher

Could you categorize this based on donor segment ? For example I imagine certain segment changes may have fewer rows. If early versions of the product only care about active and lapsed donors, we may choose to only send diffs for changes between those segments e.g. lapsed > active; active -> lapsed so it will be useful to have these numbers broken down!

We could shorten the modified filter and reduce this to a daily ~50k in normal conditions.

Could you expand on what shortening the modified filter would mean?

Expect ~40m initial rows

FYI I think we can work with diffs from the beginning and there will be no need for an initial load. Since we need user consent, nobody (or at least very few) will have consented on the first run so all the data will be ignored. This does mean changes will be slow to propagate but I think that should be fine. We may need to do a manual one-off transfer at some future data so that number is still useful!

@Jdlrobson-WMF

Could you categorize this based on donor segment ? For example I imagine certain segment changes may have fewer rows. If early versions of the product only care about active and lapsed donors, we may choose to only send diffs for changes between those segments e.g. lapsed > active; active -> lapsed so it will be useful to have these numbers broken down!

Our export that includes 250k - 1m rows is made up of contacts who have been modified in the 7 day window. Since the overwhelming majority will have been modified because of a donation, these are going to be almost entirely active donors (consecutive, new, active) or active recurring donors (monthly or annual), specifically from lapsed/LYBUNT/etc -> active/consecutive/new, plus some active recurring -> delinquent recurring and delinquent recurring -> lapsed recurring when recurring donations fail or are cancelled. I imagine we are interested in all of these. There will also be lots of active -> active who have been modified on fields we don't care about, but as we have written the export currently, we don't differentiate, we just export contacts who have been modified in some way. This is definitely not optimized, in other words.

The only time when the one-time donor statuses are otherwise changing is once per year at the fiscal year start (when many donors will go from active -> LYBUNT, LYBUNT -> lapsed). All of the one-time statuses are based on having made a donation within a fiscal year or not, so one-time donors are only stepping downwards in status when the fiscal year rolls over.

As mentioned, these statuses will be changing when we update the framework at the start of the fiscal.

Could you expand on what shortening the modified filter would mean?

Currently, for Acoustic, we export any contacts who have been modified within 7 days (I'm not entirely clear on why we do this). But we could change that to two days here or even store the last export date so that we can give an exact diff. Fewer days, fewer rows.

FYI I think we can work with diffs from the beginning and there will be no need for an initial load. Since we need user consent, nobody (or at least very few) will have consented on the first run so all the data will be ignored. This does mean changes will be slow to propagate but I think that should be fine. We may need to do a manual one-off transfer at some future data so that number is still useful!

This part isn't entirely clear to me. How do we ensure that a donor who consents has their status appropriately set on the MW end within a reasonable time frame after consenting? If MW is ignoring any donor in our file that has not consented at the time and we are only sending MW donors who have been modified within a day or two in CiviCRM, then there could be up to a year delay (until the next fiscal year) before the donor's status is updated on MW. E.g. if someone donates on Dec 15 and consents on Jan 1, their status will be empty in MW until July 1, unless their contact record is modified in CiviCRM for some other reason.

This part isn't entirely clear to me. How do we ensure that a donor who consents has their status appropriately set on the MW end within a reasonable time frame after consenting? If MW is ignoring any donor in our file that has not consented at the time and we are only sending MW donors who have been modified within a day or two in CiviCRM, then there could be up to a year delay (until the next fiscal year) before the donor's status is updated on MW. E.g. if someone donates on Dec 15 and consents on Jan 1, their status will be empty in MW until July 1, unless their contact record is modified in CiviCRM for some other reason.

This still needs to clarified. I am waiting on some clarity on the product side of things (including confirmation of the donor segments and initial experiments that we'll use in the first use cases) before recommending one but I have a few ideas (sharing to help inform the export script.

For someone who donates on Dec 15 and consents on Jan 1 there are a few scenarios:

  1. If they consent via user preferences, they will be identifiable in the table by donor segment (e.g. let's say Consenting User is a segment and has the value 0.5). In addition to the daily/weekly diff transfer, a monthly transfer of all active donor statuses could be sent. We find all users with segment 0.5 and then filter the CSV to find corresponding users and update just those entries.
  2. If we detect the presence of the donation cookie we can determine how recent their donation was (up to 150 days) and assign them to one of the segments. The segment may be incorrect but that might be acceptable.
  3. We can time shift the diffs to allow time for consent e.g. for Jan 15th, perhaps we send diffs from previous month (Dec 1st-December 15th) and for Jan 30th, we send December 15th-31st etc.. etc.. (provided we are okay with the trade off of losing accuracy for certain users)

Hope this makes sense.

@Jdlrobson-WMF I think your option 1 could make sense (and would be easy for us to provide).

Two other options we could consider are:

  1. In the future, we would like to get the fact that the donor has consented on MW back to CiviCRM. If we have that information, we could include those contacts in the next export from Civi to MW. That might indicate we should move to sending data back to Civi from MW sooner rather than later.
  2. If we decide in the future that we want to be able to provide the donor status on MW immediately after opt in, we could provide an Civi API endpoint that takes an email and returns donor status (and do_not_solicit). This would be fairly simple to implement.

@Lars great, yes these are great other options to consider. Note for end of Q4, we just need to prove out that we can flow data from CiviCRM to MediaWiki so let's focus on diffs for now.

Do we have full control over when we send the diffs? e.g. on 15th March we send diffs for the period of February and 15th April we send for the period of March etc (for example) ?

@Jdlrobson-WMF In theory, yes, we could specify an arbitrary range of modified dates, however we only use this broader export script by specifying the number of days to look back, so that use would be safer (and has been our operating assumption here). If there's a strong need to specify other date ranges, we'd have to take a look at the details.