Page MenuHomePhabricator

Coinbase import fails with database error
Closed, ResolvedPublic4 Estimated Story Points

Description

See https://docs.google.com/document/d/1qVkza3ia-43WtqoRIpyn3wxrHxqThE5-71NbZ8tKbD8/edit

Unknown database error when running the import on the staging server.

Event Timeline

awight raised the priority of this task from to High.
awight updated the task description. (Show Details)
awight added subscribers: Aklapper, atgo, awight and 3 others.

Looking in the logs, the problem was an insert into civicrm_address with a Swedish address that contained ä characters in the street and city. However, I can import a similar / anonymized record locally with no error. 'Describe civicrm_address' gives identical results locally and on staging.

It may be the db charset/collation. On production and on my local machine:

SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| latin1                   | latin1_swedish_ci    |
+--------------------------+----------------------+

but on staging:

SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| binary                   | binary               |
+--------------------------+----------------------+

Why production and my local machine have latin1_swedish_ci for collation is a mystery to me, but it seems to let us insert funky characters.

The binary charset should be more forgiving, if anything. Good to catch this bug before altering the production charset to binary, though.

My local box matches staging:

show create table civicrm_address;
.... DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Cannot reproduce the error yet.

Another theory: we're seeing a secondary error caused while formatting an initial exception, or during watchdog logging. check_plain is used in both cases.

Aargh, in fact, the first line of the screenshot is a warning:

Warning: htmlspecialchars(): Invalid multibyte sequence in argument in check_plain() (line 1567 of /X/drupal/includes/bootstrap.inc).

Another warning appears in the logs,

Cannot find state: [Stockholm] (country 1204)

Here's the error:

$backTrace = string(3521) "#0 /X/civicrm/CRM/Core/Error.php(897): CRM_Core_Error::backtrace("backTrace", TRUE)
#1 [internal function](): CRM_Core_Error::exceptionHandler(Object(DB_Error))
#2 /X/civicrm/packages/PEAR.php(931): call_user_func((Array:2), Object(DB_Error))
#3 /X/civicrm/packages/DB.php(976): PEAR_Error->PEAR_Error("DB Error: unknown error", -1, 16, (Array:2), "INSERT INTO civicrm_address (contact_id , location_type_id , is_primary , is_...") 
#4 /X/civicrm/packages/PEAR.php(564): DB_Error->DB_Error(-1, 16, (Array:2), "INSERT INTO civicrm_address (contact_id , location_type_id , is_primary , is_...") 
#5 /X/civicrm/packages/DB/common.php(1927): PEAR->raiseError(NULL, -1, NULL, NULL, "INSERT INTO civicrm_address (contact_id , location_type_id , is_primary , is_...", "DB_Error", TRUE)
#6 /X/civicrm/packages/DB/mysqli.php(929): DB_common->raiseError(-1, NULL, NULL, NULL, "1366 ** Incorrect string value: '\xCC_gen ...' for column 'street_address' at...")
#7 /X/civicrm/packages/DB/mysqli.php(403): DB_mysqli->mysqliRaiseError()
#8 /X/civicrm/packages/DB/common.php(1238): DB_mysqli->simpleQuery("INSERT INTO civicrm_address (contact_id , location_type_id , is_primary , is_...") 
#9 /X/civicrm/packages/DB/DataObject.php(2442): DB_common->query("INSERT INTO civicrm_address (contact_id , location_type_id , is_primary , is_...") 
#10 /X/civicrm/packages/DB/DataObject.php(1060): DB_DataObject->_query("INSERT INTO civicrm_address (contact_id , location_type_id , is_primary , is_...") 
#11 /X/civicrm/CRM/Core/DAO.php(450): DB_DataObject->insert() 
#12 /X/civicrm/CRM/Core/BAO/Address.php(165): CRM_Core_DAO->save()
#13 /X/sites/all/modules/wmf_civicrm/wmf_civicrm.module(1221): CRM_Core_BAO_Address::add((Array:13), FALSE)
#14 /X/sites/all/modules/wmf_civicrm/wmf_civicrm.module(1165): wmf_civicrm_message_address_update((Array:51), 11455211)
#15 /X/sites/all/modules/wmf_civicrm/wmf_civicrm.module(1151): wmf_civicrm_message_location_update((Array:51), (Array:5))
#16 /X/sites/all/modules/wmf_civicrm/wmf_civicrm.module(221): wmf_civicrm_message_location_insert((Array:51), (Array:5))
#17 /X/sites/all/modules/offline2civicrm/ChecksFile.php(79): wmf_civicrm_contribution_message_import((Array:51))
#18 /X/sites/all/modules/offline2civicrm/offline2civicrm.module(168): ChecksFile->import()
#19 /X/drupal/includes/form.inc(1513): offline2civicrm_import_checks_form_submit((Array:28), (Array:20))

I can reproduce the error using raw SQL: insert into civicrm_address set street_address = x'CC';

This does look like a charset encoding mismatch like @Ejegg was speculating. I checked staging select @@character_set_connection; using drush civicrm-sql-cli, and it was binary. My local box says utf8, which sounds like a more correct setting.

This is getting kinda scary. Drupal includes code to always set the connection charset to utf8, so how can we be getting binary? That would lead to exactly the data corruption scenario we're most concerned about during the upgrade.

Err--nvm about that, the Drush command is starting a mysql client and cmdline params are only loosely based on configuration and drupal/database/mysql/database.inc

In terms of priority - is this something to dig more on while the upgrade is running - ie. not a blocker but needing fairly immediate action?

Agreed! We might also find that it's only an issue on the staging server...

Eileenmcnaughton added a subscriber: LeanneS.

Leanne - are you able to put a copy of the csv you were trying to import on the file server & give me the location? It seems to have had Swedish characters in the address from current investigation

@Eileenmcnaughton Of course. This is the file path: Fundraising - Tech - Major Gifts - Coinbase. Thanks!

@LeanneS
Coinbase gives us a CSV file, right? Or are you opening an XLS and resaving it before the import?

The root issue seems to be a bad file encoding, that makes it impossible to read non-Latin letters. Do you see "Järfälla" as the street address for that Stockholm donor, or is it garbled for you?

@awight It comes as an encrypted csv. I then decrypt and resave for import. The address for the Swedish donor shows as garbled.

@LeanneS
Sort of great news! Please tell Coinbase to send us "UTF-8" encoded CSVs, and that should solve our problem.

awight lowered the priority of this task from High to Medium.Oct 26 2015, 10:20 PM
awight moved this task from Triage to Sprint +2 on the Fundraising-Backlog board.
awight moved this task from Sprint +2 to Blocked or not fr-tech on the Fundraising-Backlog board.

Sounds good! I'll make the request.

@awight I know the action item lies with us (MGF) now, but can we wait to deprioritize this until we hear back from Coinbase? My goal is for this to stay on both of our team's radars' until it's conclusively resolved.

awight raised the priority of this task from Medium to High.Oct 26 2015, 10:56 PM

Yes, actually it's best if your team manages the priorities! I had reduced the priority to make the other issues pop out, but realize now this wasn't very helpful.

Coinbase was able to make the change to UTF-8 encoded CSVs. There still appears to be some odd looking characters in it, not just with Swedish donors, but others as well. @Eileenmcnaughton I can provide on the server for reference.

@LeanneS: Please do upload to the fileshare when you get the chance! It sounds like tech should take a look at the new file before you import it, so we don't get garbled donor data.

@awight I've added this latest file to the same folder on the server.

The file from the 19th looks like it's valid UTF-8 too. The problematic character noted in the error message was encoded as C3A4, which is correct for an a with umlaut. There's no byte order mark at the beginning of either file, which might be our issue.

@Ejegg
I see 0xcc5f for the umlaut, in the CSV file, which is also what shows up in the stack trace. Where do you see 0xc3a4 in an error message? I don't know much about charset encoding though.. I'm probably missing something huge.

The new file has 0xccc7 where they probably meant "ë", plus some amazing unescaped HTML injection test attacks. A type of improvement!

I'm using xxd and libreoffice.

@awight, I was using bless, but xxd gives the same result when I run it on Report_orders_10-19-2015.csv straight from the file server. (slightly redacted)

00001070: XXXX 20XX XX2c 2c4a c3a4 7266 c3a4 6c6c  XX XX,,J..rf..ll

Similarly, I'm seeing the ë in the 10/26 file dumped as c3ab. Are your file dates 10/20 14:45 and 10/28 19:32 ?

@Ejegg
What is going on...

90202e2afe451ff450cf63c28b720a2d  Report_orders_10-19-2015.csv
9a6608a6aee57d7f35838f90a969d249  Report (Orders) - 10-26-15.csv

Okaaaay, who's mutating our files?

4937917d7e4ed292d1fe066cdad96a47  Report_orders_10-19-2015.csv
0c7e6631295200b7ff3dd4bb5dd9645a  Report_Orders_10-26-2015.csv

Your files have different names... They must have re-exported the 10-19 file, and we're looking in different dirs?

I'm pulling from smb://filesrv1/fundraising/Tech/Major%20Gifts/Coinbase

Aha! I snagged 'em from Tech/CoinBase. Guess someone's resaving them into the MG folder with a UTF-unaware application. @LeanneS, it looks like the original downloads are OK to import, but some copies got corrupted. If you need to edit any of them before upload, make sure any special characters and accents show up correctly before and after saving.

@LeanneS
Try this ghastly workaround: http://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically#answer-6488070

One thing I'm still concerned about is that saving the file from Excel might leave it mangled again. We can help you check for this, just open a file using the UTF-8 hack above, then save a copy and upload that to the file server where we can check the new encoding.

Update: it turns out we don't have to open and resave the file in Excel at all. In the future, let it be known that Excel is just completely incapable of importing a UTF-8 CSV. Use LibreOffice instead if you need to edit one of these files.

@LeanneS can you document this in Collab and/or on your training document?

@awight Is this a problem we will encounter with any other imports or MGF data?

@awight Is this a problem we will encounter with any other imports or MGF data?

Very possible--the risk factors that combine to make this a problem are:

  • Batch file contains international characters.
  • File comes from the processor in CSV format.
  • Data needs to be edited by our staff.

@LeanneS can you look at the comment above and tell @awight and I if you foresee this being an issue with other imports? If so, let's examine changing our processes (re excel v. libre office), to avoid this.

@CaitVirtue and @awight the only other import I could foresee this potentially being an issue with is the foreign check import if there are international characters. Looking at the last few import files from Engage, there weren't any international characters though. From my knowledge, I can't remember seeing any checks being sent with international characters, so it might be a nonissue, but I would have to review more international checks to confirm.

Adam's short term workaround is fine for now, but is not a good long term solution. Can we table this until Q3?

Totally. For the record, the correct solution is to have Coinbase save in a CSV format that Excel can understand. I believe that would be UTF-16, we'll get into the details in Q3 though.

@awight should we take this out of sprint and/or update it otherwise?

Work will continue in the subtask.