Page MenuHomePhabricator

Wmfdata-Python's CSV loading cannot handle standard quoted CSV values
Closed, DeclinedPublic

Description

The canonical country dataset is stored as a CSV; after changes, it needs to be reloaded into the Data Lake. So far, we have used Wmfdata-Python's hive.load_csv function to do this.

However, after a recent change, the dataset has a value containing a comma (Bonaire, Sint Eustatius, and Saba). By default, Pandas handles this by surrounding the field in double quotes. This is in line with RFC 4180, which is the only formal specification for CSV and very useful to follow.

However, Hive cannot handle this properly. ROW FORMAT DELIMITED cannot handle quoting, only escaping using a specified character. There is a CSV SerDe which can handle quoting, but it treats all fields as strings, which does not work with (for example) the new boolean is_eu field.

The best long-term solution would be to rebuild the CSV loading functionality based on the PySpark's CSV functionality. This would pair very well with finishing the Parquet loading patch since they require the same machinery for loading data into HDFS.

As a short-term workaround, it is possible to write CSV files that Hive can load by skipping quoting in favor of escaping. In Pandas, for example: df.to_csv("df.csv", quoting=csv.QUOTE_NONE, escapechar="\\").

Event Timeline

nshahquinn-wmf renamed this task from Wmfdata-Python's CSV loading cannot handle standard quoted CSV fields to Wmfdata-Python's CSV loading cannot handle standard quoted CSV values.Jan 27 2023, 12:26 AM
mpopov moved this task from Triage to Backlog on the Product-Analytics board.

As @mforns reminded me, another workaround for this is to simply use TSV instead of CSV.

We will soon deprecate the whole Hive module (T384541), so there's no point fixing this.