Page MenuHomePhabricator

Matching Gifts database data cleanup
Closed, ResolvedPublic

Description

The file that was pulled has a lot of bad unusable data in it.

We have some recommendations from our Trilogy rep to remove some columns altogether including the most problematic column (contains concatenated XML data that reads as bad data in Acoustic): the last column (H - subsidiaries).

There are also a lot of empty and blank rows (first 34 rows in the file and a lot more sprinkled in the middle).

Nora will be able to provide more information as to which columns are actually needed.

Event Timeline

From my understanding, the only usable columns we have that can be pulled into the email automation are:

column B: id (should be named employer_id)
column D: matching_gifts_provider_info_url
column E: minimum_gift_matched_usd
column F: name (should be named employer_name)

Other columns have data that contain broken links or blank cells, but I want to defer to Nora on if she still wants to keep the column or get it removed altogether since the automation won't be able to tell which links are broken and just display anything that is not_blank.

Just spoke with Nora on the Civi call - big green light to get rid of the subsidiaries column and _only_ the subsidiaries column. All the other columns are fine.

If possible, I'd still like if the column name for id --> employer_id and name --> employer_name for consistency (and to prevent confusion for future folks that might touch this campaign/import in Acoustic).

cc-ing @Ejegg

Thank you!

Change 623432 had a related patch set uploaded (by Ejegg; owner: Ejegg):
[wikimedia/fundraising/tools@master] Remove subsidiaries, rename 2 employer fields

https://gerrit.wikimedia.org/r/623432

@MNoorWMF if you have an example of a bad link handy, could you paste it here?

The patch in review will fix the two column names requested changed and will remove 'subsidiaries'.

Fixing the bad links and removing the blank lines will need another little fix.

@Ejegg sure, I've got a handful just from clicking randomly around the database:

Column G URLs that are bad links:
http://forms.matchinggifts.com/Arete Associates Matching Gift Form.pdf
http://forms.matchinggifts.com/Autozone_Matching Gift Form.pdf
https://www.easymatch.com/alexion/

Column A URLs that are bad links:
http://forms.matchinggifts.com/Allegis Group Matching Gift Guidelines.pdf
https://secure3.easymatch.com/alexiongive/Applications/MatchingGifts/default.aspx?skip=guideline
http://www.anz.com/about-us/corporate-responsibility/employees/attracting-retaining-talent/remuneration-benefits/

"bad link" reasons:
-spaces in URL leading to broken link (most common)
-page simply does not exist
-404 error

Change 623432 merged by jenkins-bot:
[wikimedia/fundraising/tools@master] Remove subsidiaries, rename 2 employer fields

https://gerrit.wikimedia.org/r/623432

Hi @MNoorWMF - So a lot of those 'bad links' are actually working. Spaces are valid characters in URLs, though they can confuse Phabricator's auto-linkifier.

http://forms.matchinggifts.com/Arete Associates Matching Gift Form.pdf
http://forms.matchinggifts.com/Autozone_Matching Gift Form.pdf

The alexion form redirecting to a 404 page seems to reflect that HEPdata is out-of-date for some companies. It's the same link at https://javamatch.matchinggifts.com/search/companyprofile/wikimedia_iframe/66157 .

We just deployed the column change today, so the export tomorrow morning will have the 'subsidiaries' column deleted and will have the ID and Name columns prefixed by employer_.

I'll keep looking at why we might have completely blank rows.

Thank you so much Elliott!

I'll take a look at the new dB tomorrow after the import and create the relational table for this then. cc @KHaggard :)

Hmm, looks like the blanks are actually stored in the DB rather than being introduced in the export. I can probably just delete those rows, then look at the process that pulls the info in from HEPdata to see about discarding them on the way in.

I went ahead and deleted the current blank rows. Here's the task to make sure new ones don't creep in: T264944

DStrine subscribed.

@MNoorWMF we think this is good to close for now. Could you confirm? We also have this as a follow on. https://phabricator.wikimedia.org/T264944 but in general the weird rows won't be available in the drop down in the UI so they won't affect outgoing emails.