Page MenuHomePhabricator

Huge decrease in contacts from the main database update
Closed, ResolvedPublic

Description

Hi there,

The _all_wikimedia database update is showing a massive decrease in contacts that surpasses the amount the Master Suppression List gained. Usually, we think a small drop is fine as long as the difference is not larger than the new MSL contacts, but as you can see from the screenshot below, the main database dropped by over 1 million. This issue is specifically from yesterday (11/24) to today (11/25). Please see attached screenshot for more details.

Please let us know if you have any questions. Setting priority as high for now.

Event Timeline

KHaggard created this task.Nov 25 2019, 5:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2019, 5:05 PM
KHaggard triaged this task as High priority.Nov 25 2019, 5:06 PM
jgleeson added a subscriber: jgleeson.EditedNov 25 2019, 10:48 PM

It looks like the latest export contained within the file DatabaseUpdate-20191125082239.csv contains 17,052,039 records which is higher than the total count in the screen grab on this ticket by a million and change. According to the logs the upload of that file over to silverpop's servers completed successfully so I'm wondering if something has happened in the steps that take place after we upload the list to silverpop?

Do we have access to any logs on the silverpop side to help give us an idea why they're only seeing ~16mln rows vs our ~17mln

I checked the file that has been sftp'd to silverpop and the file size is 3885445068 - which is the same as our one locally which has 17052083 rows in it - so what is getting uploaded seems right

And the file we upload has no duplicate email addresses?

The code should have pre-merged those

The file is still present in the sftp folder - I thought they were deleted on upload?

They should be... we aren't able to access the files after upload.

hmm I can definitely see it there now... is it accessible now?

Does the upload start automatically? Could it be a timing issue - e.g the file was incompletely uploaded when the silverpop process happened?

Updating that the _all_wikimedia data job runs successfully but the Master Suppression List does not, but I am able to manually run the MSL job with success.

I got today's imports and attaching below. The numbers look healthier but you can see that the main database took yesterday's file 11/25, so a partial upload yesterday seems possible. I wonder if 9am GMT is too early for _all_wikimedia and the Master Suppression List, or maybe something is slowing the process down to where silverpop starts when the files are not ready?

I'm interested to see what today's upload count is. According to the source data we should have 16,952,513 records in the _all_ list.

@KHaggard looks like you posted just as I did :)

Ok so it looks like today's import is actually yesterdays then... which feels like a timing problem. @KHaggard can you show us how you are running these imports on the silverpop side maybe?

Sure, let me know the best way I can show you, or I will try writing it out here and just let me know if it makes sense.

I set up a recurring job for the main database and the unsubscribes. This happens daily at 9am GMT. and I get an email of the job completion every morning. If a job fails, it lets me know and I try to manually run it on this screen:

If it runs successfully, that means the file was ready to go and didn't get picked up at the right time by silverpop. If it doesn't run, then there's a deeper problem I can't troubleshoot (file is missing, etc).

Ultimately, it's an automated process that I just pop into whenever there's a problem. Is this helpful? Let me know if you need to know how I set up the recurring job or if you need any other info :)

Thanks @KHaggard

How do you select the files to be used for those jobs? Does it ask you to specify a path or something like that?

KHaggard added a comment.EditedNov 26 2019, 9:36 PM

We tell IBM to look for “DatabaseUpdate-*” and “Unsubscribes-*” which matches the beginning nomenclature of the FTP file. It then grabs the oldest file first since we don't specify the date. So in the case of multiple days backed up, we manually run the data job via that screenshot I added however many times it takes to catch up.

Ejegg added a subscriber: Ejegg.Nov 27 2019, 4:19 PM

OK, the timing of the upload does seem to be the issue. It was finishing before 9:00 till the morning of Nov 25, but on that day it was still uploading at 9:00. Yesterday and today it's been uploading late too.

It looks like we've been running another intensive database job (repairing old bulk mailing records that were imported FROM silverpop) around the clock since Nov 24th. I'd guess that's why the export and upload has taken so much longer these past few days.

We will move the export start time back one hour while the repair job is running.

Thanks @Ejegg I'll update this thread letting you know if the totals look healthy again tomorrow morning :)

Ejegg added a comment.Nov 27 2019, 5:30 PM

@KHaggard since today's upload was late too, this morning's file might still be sitting there waiting for processing.

To make sure it doesn't mess us up tomorrow morning, can you process and delete it via the Silverpop UI, or should we go in via SFTP to do that?

Hey @Ejegg I manually ran the import job again so it should be out of the way for the new file tomorrow, but would you be willing to confirm that it has been deleted from SFTP?

@KHaggard confirmed, the upload directory looks empty via SFTP.

Thanks - the jobs this morning ran without failure and the numbers look healthy to me. Attaching below.

@Ejegg One thing I would like to ask about: even though the jobs are running smoothly again, the total # rows is still roughly 1 million below what it used to be. Just wondering if this is worth looking into further. Attaching the jobs from 11/21 (before the timing issue started). You can see the total is 17 mil and today's current total is 16 mil.

Ejegg added a comment.Dec 2 2019, 1:24 PM

@KHaggard thanks for including the unsubscribes in that screenshot. It looks like a drop of 401,614 donors on the master list is matched by a rise of 426,685 donors on the unsubscribes list. Does that sound like a reasonable number of unsubscribes for 10 days of heavy email?

@Ejegg nope, that sounds really high. We have a total suppression list of
about 2M -- 426k would be a 25% increase in our total suppression list.

Ejegg added a comment.EditedDec 2 2019, 2:06 PM

@CCogdill_WMF those import screenshots show it was 3.27M on Nov 21st and 3.69M on Dec 1st. I'm sure it jumped quite a bit when we started putting people with latest_optin_response=NO on the suppression list (Oct 31st).

Ah, there must be duplicate rows on that import or something. We show
3,065,196 in IBM now. Higher than when I last checked :)

The optin_response change makes sense. Just wanted to flag that wasn't a
normal increase for 10 days of sending.

Ejegg added a comment.Dec 2 2019, 2:18 PM

OK, so the new issue is not missing donors, but an unexpected number of donors marked as unsubscribed in the database. Maybe we should make a new ticket so as not to muddy the issue?
I wonder if these are related to the jobs @Eileenmcnaughton has been running over the past couple weeks to catch up on importing missing events (including unsubscribes) FROM silverpop.

I can open a new phab task - just confirming that: the contacts in the MSL should not be on the _all_wikimedia database contact source update job, is that correct? So when there’s an increase in the Master Suppression List, a decrease in the main database should be expected?

Ejegg added a comment.Dec 2 2019, 5:00 PM

Yep, we only export a given email address to one list or the other, never both.

KHaggard added a comment.EditedDec 2 2019, 5:44 PM

Thanks! New ticket is here: T239635

We are purging the MSL from the _all_wikimedia database to see if that affects tomorrow's totals or if it will look the same/reimport the purged items. I'll comment in the jobs tomorrow morning in the new ticket.

DStrine closed this task as Resolved.Dec 10 2019, 8:57 PM