The canonical country dataset is stored as a CSV; after changes, it needs to be reloaded into the Data Lake. So far, we have used Wmfdata-Python's hive.load_csv function to do this.
However, after a recent change, the dataset has a value containing a comma (Bonaire, Sint Eustatius, and Saba). By default, Pandas handles this by surrounding the field in double quotes. This is in line with RFC 4180, which is the only formal specification for CSV and very useful to follow.
However, Hive cannot handle this properly. ROW FORMAT DELIMITED cannot handle quoting, only escaping using a specified character. There is a CSV SerDe which can handle quoting, but it treats all fields as strings, which does not work with (for example) the new boolean is_eu field.
The best long-term solution would be to rebuild the CSV loading functionality based on the PySpark's CSV functionality. This would pair very well with finishing the Parquet loading patch since they require the same machinery for loading data into HDFS.
As a short-term workaround, it is possible to write CSV files that Hive can load by skipping quoting in favor of escaping. In Pandas, for example: df.to_csv("df.csv", quoting=csv.QUOTE_NONE, escapechar="\\").