My initial plan it to run it just one at a time but depending how long it is really taking we might need to investigate a more multithreaded approach. Alternatively we might decide it's ticking along . fine & will get there by the time we need it
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Eileenmcnaughton | T227513 Investigate: TargetSmart Civi import | |||
Resolved | Eileenmcnaughton | T228715 Nurse love & iterate on our targetsmart batch job |
Event Timeline
@DStrine should we bump this out of the sprint - it's really the phab I created to track things once we've started off the background task so I would not expect to start it until next sprint
Change 532475 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@master] Add ability to track multiple targetsmart jobs
Change 532475 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Add ability to track multiple targetsmart jobs
Change 532492 had a related patch set uploaded (by Eileen; owner: Eileen):
[wikimedia/fundraising/crm@deployment] Add ability to track multiple targetsmart jobs
Change 532492 merged by Eileen:
[wikimedia/fundraising/crm@deployment] Add ability to track multiple targetsmart jobs
I now have 8 jobs running so all 8 csv files are importing - that puts us at around 80k per hour being imported. About 2.5 days I think
We have been hitting rows that require manual intervention but at this stage there doesn't seem to be a remedy that is better than intervening & editing the tsv file once a row reveals itself as having poor data (often escaping).
A lesser number of fails related to contention.
A bit more than 50% of the data is in at the moment
Here is a list of the files we are currently importing
File | rows | current row | ID of last contact in the file | has finished? |
targetsmart1.tsv | 508067 | 265000 | 13301238 | yes |
targetsmart2.tsv | 538865 | 237000 | 680546 | yes |
targetsmart3.tsv | 560716 | 208992 | 15797606 | yes |
targetsmart4.tsv | 555166 | 220871 | 10347071 | no |
targetsmart5.tsv | 558470 | 234300 | 18092693 | yes |
targetsmart6.tsv | 553819 | 216977 | 24462112 | no |
targetsmart7.tsv | 561033 | 208240 | 13642012 | yes |
targetsmart8.tsv | 555835 | 222200 | 4643306 | yes |
Note that targetsmart1.tsv is actually the targetsmart9 zip - the actual targetsmart 1 zip is fully imported
Also note I'm seeing smarty template cruft clog up the templates_c dir - I think we just bear with it & clear caches at the end. I cleared them just now & it caused a bit of a server hang while it processed
Update on where the various jobs are at
[targetsmart_progress1] => 436000 [targetsmart_progress2] => 366000 [targetsmart_progress3] => 373247 [targetsmart_progress4] => 389091 [targetsmart_progress5] => 402500 [targetsmart_progress6] => 377977 [targetsmart_progress7] => 358240 [targetsmart_progress8] => 386410
I think by the end of my day I should be able to start turning them off
I just checked the last contact in each file (since the csv rows & wc rows don't quite add up) and all but 2 files have finished - 4 & 6 (updated the grid above too)
[targetsmart_progress4] => 525091 [targetsmart_progress6] => 513977
I would say dozens of rows in the end. A bunch of them were like first name = 'William "Bill"' & for some I moved 'Bill' to the nickname field when I dug into them