Page MenuHomePhabricator

Nurse love & iterate on our targetsmart batch job
Closed, ResolvedPublic4 Estimated Story Points


My initial plan it to run it just one at a time but depending how long it is really taking we might need to investigate a more multithreaded approach. Alternatively we might decide it's ticking along . fine & will get there by the time we need it

Event Timeline

Eileenmcnaughton updated the task description. (Show Details)
Eileenmcnaughton set the point value for this task to 4.

@DStrine should we bump this out of the sprint - it's really the phab I created to track things once we've started off the background task so I would not expect to start it until next sprint

Change 532475 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Add ability to track multiple targetsmart jobs

Change 532475 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Add ability to track multiple targetsmart jobs

Change 532492 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@deployment] Add ability to track multiple targetsmart jobs

Change 532492 merged by Eileen:
[wikimedia/fundraising/crm@deployment] Add ability to track multiple targetsmart jobs

I now have 8 jobs running so all 8 csv files are importing - that puts us at around 80k per hour being imported. About 2.5 days I think

We have been hitting rows that require manual intervention but at this stage there doesn't seem to be a remedy that is better than intervening & editing the tsv file once a row reveals itself as having poor data (often escaping).

A lesser number of fails related to contention.

A bit more than 50% of the data is in at the moment

Here is a list of the files we are currently importing

Filerowscurrent rowID of last contact in the filehas finished?

Note that targetsmart1.tsv is actually the targetsmart9 zip - the actual targetsmart 1 zip is fully imported

Also note I'm seeing smarty template cruft clog up the templates_c dir - I think we just bear with it & clear caches at the end. I cleared them just now & it caused a bit of a server hang while it processed

Update on where the various jobs are at

[targetsmart_progress1] => 436000
[targetsmart_progress2] => 366000
[targetsmart_progress3] => 373247
[targetsmart_progress4] => 389091
[targetsmart_progress5] => 402500
[targetsmart_progress6] => 377977
[targetsmart_progress7] => 358240
[targetsmart_progress8] => 386410

I think by the end of my day I should be able to start turning them off

I just checked the last contact in each file (since the csv rows & wc rows don't quite add up) and all but 2 files have finished - 4 & 6 (updated the grid above too)

[targetsmart_progress4] => 525091
[targetsmart_progress6] => 513977

Great news! Were there many rows with bad data that caused issues?

I would say dozens of rows in the end. A bunch of them were like first name = 'William "Bill"' & for some I moved 'Bill' to the nickname field when I dug into them

Ah got it. Well that's not TOO bad then.