Page MenuHomePhabricator

Nurse love & iterate on our targetsmart batch job
Closed, ResolvedPublic4 Estimated Story Points

Description

My initial plan it to run it just one at a time but depending how long it is really taking we might need to investigate a more multithreaded approach. Alternatively we might decide it's ticking along . fine & will get there by the time we need it

Event Timeline

Eileenmcnaughton updated the task description. (Show Details)
Eileenmcnaughton set the point value for this task to 4.

@DStrine should we bump this out of the sprint - it's really the phab I created to track things once we've started off the background task so I would not expect to start it until next sprint

Change 532475 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Add ability to track multiple targetsmart jobs

https://gerrit.wikimedia.org/r/532475

Change 532475 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Add ability to track multiple targetsmart jobs

https://gerrit.wikimedia.org/r/532475

Change 532492 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@deployment] Add ability to track multiple targetsmart jobs

https://gerrit.wikimedia.org/r/532492

Change 532492 merged by Eileen:
[wikimedia/fundraising/crm@deployment] Add ability to track multiple targetsmart jobs

https://gerrit.wikimedia.org/r/532492

I now have 8 jobs running so all 8 csv files are importing - that puts us at around 80k per hour being imported. About 2.5 days I think

We have been hitting rows that require manual intervention but at this stage there doesn't seem to be a remedy that is better than intervening & editing the tsv file once a row reveals itself as having poor data (often escaping).

A lesser number of fails related to contention.

A bit more than 50% of the data is in at the moment

Here is a list of the files we are currently importing

Filerowscurrent rowID of last contact in the filehas finished?
targetsmart1.tsv50806726500013301238yes
targetsmart2.tsv538865237000680546yes
targetsmart3.tsv56071620899215797606yes
targetsmart4.tsv55516622087110347071no
targetsmart5.tsv55847023430018092693yes
targetsmart6.tsv55381921697724462112no
targetsmart7.tsv56103320824013642012yes
targetsmart8.tsv5558352222004643306yes

Note that targetsmart1.tsv is actually the targetsmart9 zip - the actual targetsmart 1 zip is fully imported

Also note I'm seeing smarty template cruft clog up the templates_c dir - I think we just bear with it & clear caches at the end. I cleared them just now & it caused a bit of a server hang while it processed

Update on where the various jobs are at

[targetsmart_progress1] => 436000
[targetsmart_progress2] => 366000
[targetsmart_progress3] => 373247
[targetsmart_progress4] => 389091
[targetsmart_progress5] => 402500
[targetsmart_progress6] => 377977
[targetsmart_progress7] => 358240
[targetsmart_progress8] => 386410

I think by the end of my day I should be able to start turning them off

I just checked the last contact in each file (since the csv rows & wc rows don't quite add up) and all but 2 files have finished - 4 & 6 (updated the grid above too)

[targetsmart_progress4] => 525091
[targetsmart_progress6] => 513977

Great news! Were there many rows with bad data that caused issues?

I would say dozens of rows in the end. A bunch of them were like first name = 'William "Bill"' & for some I moved 'Bill' to the nickname field when I dug into them

Ah got it. Well that's not TOO bad then.