Page MenuHomePhabricator

Query results are downloaded in wrong encoding
Closed, ResolvedPublic

Description

The menu "Download" of the Wikidata Query Service (WDQS) UI lets users export the results of their queries... in an unknown encoding.

This encoding should be UTF-8.

A few days ago, this used to be in UTF-8.

wdqs-download.png (342×324 px, 22 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It doesn't appear to be correct Windows-1252 either.

Capture du 2017-05-19 22-57-49.png (702×869 px, 102 KB)

Sample query

abian renamed this task from Use UTF-8, not Windows-1252, for files with query results from the WDQS to Use UTF-8 for files with query results from the WDQS.May 19 2017, 9:43 PM
abian updated the task description. (Show Details)

I've tried several encodings, including all the ISO-8859 (from ISO-8859-1 to ISO-8859-15) but none seems to match the encoding used...

@abian could you please add:

  • query that you were running
  • browser (type & version) that you were using?
  • which of the download options did you choose?
Smalyshev triaged this task as Medium priority.May 20 2017, 6:40 PM

@Smalyshev in my case, I tried several queries on Ubuntu and Windows, with Chrome and Firefox and for every option (JSON TSV, CSV, verbose or not), but always opened with LibreOffice Calc (version: 5.1.6.2), the problem is always the same.

The original query (on Wikidata:Bistro) was this one (instance of family name with writing system Latin script)

I can reproduce it with the link Ash_Crow shared. I run that query and choose download -> CSV in the menu. I've attached the resulting file. When I open the file in Notepad++ it already looks strange with line break within values.
I'm using Firefox 53.0.2 (64-bit) on Windows 10. Doing the same on Chrome 58.0.3029.110 (64-bit) on the same computer results in the same result. Values with line breaks in them.

None of the provided formats (verbose or non-verbose JSON file, verbose or non-verbose TSV file, CSV file) is correct. This problem doesn't seem to depend on the web browser nor on the operating system.

You can also use this query for testing. You should be able to download and properly read all the test characters from the results using UTF-8 encoding and without any line break between them.

However, these are my current results.

Esc3300 renamed this task from Use UTF-8 for files with query results from the WDQS to [bug] Use UTF-8 for files with query results from the WDQS.May 21 2017, 4:26 AM
Esc3300 updated the task description. (Show Details)
Smalyshev renamed this task from [bug] Use UTF-8 for files with query results from the WDQS to Query results are downloaded in wrong encoding.May 21 2017, 7:05 AM

I suspect wrong version of download.js was deployed on last GUI deployment. I'll redeploy GUI and see if it fixes things.

Mentioned in SAL (#wikimedia-operations) [2017-05-21T09:06:45Z] <smalyshev@tin> Started deploy [wdqs/wdqs@227ab25]: Redeploy GUI due to breakage in T165228

Mentioned in SAL (#wikimedia-operations) [2017-05-21T09:07:04Z] <smalyshev@tin> Finished deploy [wdqs/wdqs@227ab25]: Redeploy GUI due to breakage in T165228 (duration: 00m 19s)

Smalyshev claimed this task.

Seems to be fixed now after redeploy. Please reload GUI (clean cache, etc.) and try again. If it still happens, please reopen.

I continue getting the same results with any computer. Should we wait a few hours or days?

Mentioned in SAL (#wikimedia-operations) [2017-05-22T06:00:29Z] <smalyshev@tin> Started deploy [wdqs/wdqs@e4301da]: Redeploy GUI due to breakage in T165228

Mentioned in SAL (#wikimedia-operations) [2017-05-22T06:02:19Z] <smalyshev@tin> Finished deploy [wdqs/wdqs@e4301da]: Redeploy GUI due to breakage in T165228 (duration: 01m 50s)

Definitely resolved. Thank you.

Definitely resolved. Thank you.

It seems that this bug is back... Or a very similar one at least (maybe not the same cause but clearly the same effect).

Today I did this query : http://tinyurl.com/yad7ah6w and it's apparently not UTF-8.

Looks like there was some breakage between 1.4.4 (which worked) and 1.4.7 (which doesn't) in download.js. I'll try to figure out where it was broken and downgrade the build to a fixed working version.
1.4.4 seems to work fine, 1.4.6 is broken.
Reported it as: https://github.com/rndme/download/issues/56

Change 364832 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui@master] Fix downloader.js to 1.4.4 to resolve bad non-ASCII downloads

https://gerrit.wikimedia.org/r/364832

Change 364832 merged by jenkins-bot:
[wikidata/query/gui@master] Fix downloader.js to 1.4.4 to resolve bad non-ASCII downloads

https://gerrit.wikimedia.org/r/364832

Smalyshev raised the priority of this task from Medium to High.Jul 13 2017, 12:36 AM

Mentioned in SAL (#wikimedia-operations) [2017-07-13T19:59:16Z] <smalyshev@tin> Started deploy [wdqs/wdqs@a32dbeb]: Redeploy GUI due to breakage in T165228

Mentioned in SAL (#wikimedia-operations) [2017-07-13T20:01:35Z] <smalyshev@tin> Finished deploy [wdqs/wdqs@a32dbeb]: Redeploy GUI due to breakage in T165228 (duration: 02m 19s)

AJF subscribed.

Results are downloading in non-UTF-8 encoding. Please see output of this example query.

@Lucas_Werkmeister_WMDE TSV download format. Opened in Notepadd++, and encoding is detailed as ANSI when it has always been UTF-8. Some of the encoding issues are causing new lines in the download, corrupting both the entities and the file, for example (sample included from example query linked previously):

http://www.wikidata.org/entity/Q3157864 Jacques-Antoine-Marie Lemoine 3 Jacques-antoine-marie lemoine
http://www.wikidata.org/entity/Q3157864 Jacques-Antoine-Marie Lemoine 3 Jacques-Antoine-Marie Lemoyne
http://www.wikidata.org/entity/Q3161723 Jan Ml
och 3 Jan Mlcoch
http://www.wikidata.org/entity/Q1964408 Nan Hoover 6 Nancy Dodge Browne

This is even prior to converting back to UTF-8.

Smalyshev removed a project: Patch-For-Review.

This is still occurring. I just queried and downloaded (in chrome, TSV) 4 different results. 2 of the outputs were encoded in UTF-8, 2 in ANSI.

Example queries:

Output in ANSI
Output in UTF-8

Hmm I think this may be related to version of downloadjs being bumped to 1.4.7 in 56f9d9aea62e2b4100ea3be3fd728c5fd2116082. 1.4.7 I think is buggy - see https://github.com/rndme/download/issues/56.

Change 395591 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui@master] DownloadJS back to 1.4.4

https://gerrit.wikimedia.org/r/395591

Hm, that might also be why T178564: SVG Image query result downloads use incorrect encoding still seems to be broken. I’ll try it out tomorrow.

Most likely. I think package.json allows comments, so I'll leave a comment there.

Change 395595 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui-deploy@production] Merging from 36c776f28febfa6e837c099a5f479f63c35ff225:

https://gerrit.wikimedia.org/r/395595

Change 395596 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui@master] Add comment about downloadjs bug

https://gerrit.wikimedia.org/r/395596

Change 395595 merged by Smalyshev:
[wikidata/query/gui-deploy@production] Merging from 36c776f28febfa6e837c099a5f479f63c35ff225:

https://gerrit.wikimedia.org/r/395595

Change 395591 merged by jenkins-bot:
[wikidata/query/gui@master] DownloadJS back to 1.4.4

https://gerrit.wikimedia.org/r/395591

Change 395596 merged by jenkins-bot:
[wikidata/query/gui@master] Add comment about downloadjs bug

https://gerrit.wikimedia.org/r/395596

Okay, switching between downloadjs 1.4.4 and 1.4.7 fixes and breaks T178564: SVG Image query result downloads use incorrect encoding locally, respectively. But it’s still broken on query.wikidata.org – I take it the version change isn’t deployed yet?

Suggestion for the future: would it be possible to add an automated test to find out possible encoding issues before they are discovered by users?

Change 396325 had a related patch set uploaded (by Jonas Kress (WMDE); owner: Jonas Kress (WMDE)):
[wikidata/query/gui@master] Add test for DownloadJS utf-8

https://gerrit.wikimedia.org/r/396325

I wrote a test to check the encoding.

Change 396325 merged by jenkins-bot:
[wikidata/query/gui@master] Add test for DownloadJS utf-8

https://gerrit.wikimedia.org/r/396325

Smalyshev claimed this task.

Should be fine now.