Page MenuHomePhabricator

Excel does not recognize Quarry CSV output as UTF-8
Closed, ResolvedPublic

Description

Please set CVS download as unicode
for example see
http://quarry.wmflabs.org/query/1105
and compare the results on the web output and CVS downloads file

Event Timeline

Yamaha5 raised the priority of this task from to Needs Triage.
Yamaha5 updated the task description. (Show Details)
Yamaha5 added a project: Tool-Database-Queries.
Yamaha5 changed Security from none to None.
Yamaha5 added a subscriber: Yamaha5.

Looks fine to me, can you re-check? I fixed a few Unicode issues in the past, this might have been fixed alongside.

Now I get query and download the cvs file and the result is
روز_آزادی_(آفریقای_جنوبی)
روز_آزادی_مستندات
روز_آزادی_نرم‌افزار
روز_آفریقا
روز_آنزاک
روز_آنلاین
روز_استقلال
روز_استقلال_(هند)
روز_استقلال_ایالات_متحده_آمریکا
روز_افتتاحیه_کلوزآپ
روز_اژدها
روز_باران_(فیلم)

and it should be

روز_آزادی_(آفریقای_جنوبی)
روز_آزادی_مستندات
روز_آزادی_نرم‌افزار
روز_آفریقا
روز_آنزاک
روز_آنلاین
روز_استقلال
روز_استقلال_(هند)
روز_استقلال_ایالات_متحده_آمریکا
روز_افتتاحیه_کلوزآپ
روز_اژدها
روز_باران_(فیلم)

How are you opening the CSV file? I'm getting the latter when I open a CSV file.

CSV stores no encoding information, so you should tell Excel the file is UTF-8 when opening it. This seems to be non-trivial (see https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically )

One option might be to add .xlsx output, which probably supports non-MBCS encodings.

yuvipanda renamed this task from Query tool's CVS download is not unicode to Excel does not recognize Quarry CSV output as UTF-8.Dec 2 2014, 10:35 AM

Saving as UTF-8, with 'BOM', with a .txt extension also seems to work (but not with a .csv extension)

Hmm, so perhaps we just need to output a BOM. But that makes me feel all kinds of weird for UTF-8, for unknown reasons :)

After some more testing, Excel seems to understand the CSV if it's in UTF-16le instead of utf-8. That should be easy to add as an extra option (one could call it 'UTF-16 CSV' or 'Excel' ;-))

Change 217962 had a related patch set uploaded (by Merlijn van Deen):
Add UTF-16 CSV output option

https://gerrit.wikimedia.org/r/217962

Change 217962 merged by jenkins-bot:
Add UTF-16 CSV output option

https://gerrit.wikimedia.org/r/217962

UTF-8 with BOM also works fine, and should be easier to implement.

So a patch for UTF-16 was merged, is Excel happy now?

No, the 'UTF-16' seems to actually be UTF-8...

Btw, separate from the encoding... what Excel considers to be a "CSV" actually depends on language settings in Windows. For example if you use a German Windows, the delimiter character by default will be a ";" and not a "," , So technically it would then be "semicolon seperated values" ("SSV"). Then if you double click on it Excel just opens it.. otherwise you have to manually specify which character is the delimeter and "import" the file in Excel or change it in Windows settings itself.

One option might be to add .xlsx output, which probably supports non-MBCS encodings.

+1

I cannot think of a clean way to hack a BOM into the response, or cleanly set the response encoding in the generator. Will add a https://github.com/jmcnamara/XlsxWriter-based xlsx format output.

Wierd. There is no way to append three characters string to the file at start?

Wierd. There is no way to append three characters string to the file at start?

The response is constructed from a generator that's unaware of outside parameters (encoding / BOM), and once the response is constructed, I'm unaware of a way to modify it (for some code that's aware of them).

And you can't catch the file after the generator? Curiouser and curiouser.

And regarding why not unconditionally add BOM:

Hmm, so perhaps we just need to output a BOM. But that makes me feel all kinds of weird for UTF-8, for unknown reasons :)

... because BOM doesn't make sense for UTF-8 encodings (the UTF-8 byte-order is constant, unlike in UTF-16). There may be csv parsers that does not expect a BOM and may have unexpected outputs when BOM is included (BOM in first header); it's usually Microsoft programs that wouldn't function without BOM.

Also considering T76126#2327571 & T76126#800965, why bother to add such a feature if it won't work consistently?

So this why I asked you to add this format, not replace. For me it always worked, so you don't remove anything, there will be at least the same functionality. But you are the boss.

Change 367331 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[analytics/quarry/web@master] output: Add xlsx with the support of xlsxwriter

https://gerrit.wikimedia.org/r/367331

^ Will test, right now I'm downloading the vagrant box.

Test with SELECT "首页" AS page_title; and latest patch I got:

Looks nice.
I don't know why, but it is opened by 7-Zip, not Excel. I hope it's phabticator problem, not Quarry's.

Mentioned in SAL (#wikimedia-cloud) [2017-07-30T20:10:23Z] <zhuyifei1999_> Yuvi gave me access after I asked about T76126

Change 368597 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[operations/puppet@production] Quarry: Add package 'python-xlsxwriter'

https://gerrit.wikimedia.org/r/368597

Change 367331 merged by jenkins-bot:
[analytics/quarry/web@master] output: Add xlsx with the support of xlsxwriter

https://gerrit.wikimedia.org/r/367331

Mentioned in SAL (#wikimedia-cloud) [2017-07-30T21:14:51Z] <zhuyifei1999_> sudo git fetch; sudo git checkout 172eb7e on /srv/quarry T76126

Mentioned in SAL (#wikimedia-cloud) [2017-07-30T21:17:07Z] <zhuyifei1999_> sudo service uwsgi restart T76126

Package python-xlsxwriter was hand installed on quarry-main-01 with sudo apt install python-xlsxwriter.

This should be now fixed; please reopen if you experience issues.

@zhuyifei1999, thank you very much! I have not excel today, so I asked somebody to check. He is absolutely happy.

Change 368597 merged by Rush:
[operations/puppet@production] Quarry: Add package 'python-xlsxwriter'

https://gerrit.wikimedia.org/r/368597