Page MenuHomePhabricator

Some charsets characters not converted into UTF-8 correctly
Closed, ResolvedPublic8 Story Points

Event Timeline

LuisVilla raised the priority of this task from to Needs Triage.
LuisVilla updated the task description. (Show Details)
LuisVilla added a project: Citoid.
LuisVilla added a subscriber: LuisVilla.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2015, 2:09 AM
Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Apr 13 2015, 7:11 AM
Mvolz added a subscriber: Elitre.
Mvolz renamed this task from Character problem in citoid-generated citation to French special characters not represented correctly.Apr 13 2015, 7:14 AM
Mvolz updated the task description. (Show Details)
Mvolz set Security to None.
JeanFred moved this task from IO Tasks to Backlog on the Citoid board.Apr 13 2015, 4:59 PM
Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Apr 13 2015, 9:28 PM
Mvolz added a subscriber: Mvolz.Apr 13 2015, 9:30 PM

Additional info:

Quote characters seem to come through fine on localhost using curl, so that might be an extension-end problem.

Accent characters not so much- maybe a known issue with the request library? Suggestion is to just take in the binary and use iconv to do all character decoding/encoding: https://github.com/request/request/issues/118

Accent characters not so much- maybe a known issue with the request library?

Nope, the actual problem is that that site is served by MS IIS, and the encoding of the page is set to latin1 (see the page's Content-Type meta tag).

Suggestion is to just take in the binary and use iconv to do all character decoding/encoding

Yup. iconv-lite seems to be a good choice for that (that is already a dep of body-parser which we pull in, but cannot reuse it due to the way the deploy repo works).

IIS being IIS, it does not set the correct Content-Type header (it lacks the charset part), which means that a way to get around this is:

  • fetch the HTML into a Buffer object
  • get the charset (no parsing, just RegEx)
  • decode the buffer via iconv-lite

Personally, I don't like it as it introduces quite a big overhead, but currently do not see another way around it (trying to get the French government to either switch to something other than IIS or get them to configure it correctly is a lost cause IMO :P)

Small OT: @LuisVilla funny coincidence you chose the city where I used to live (Rennes) :)

It looks like this isn't limited to French characters. In another diff, the title of the page (a page from the website of Tokyo's police department) was mangled:

  • Page title: グラフ警視庁 組織図・体制 :警視庁
  • Citoid output: �O���t�x�����@�g�D�}�E�̐��@�F�x����

Am I correct in assuming this is the same issue?

Am I correct in assuming this is the same issue?

Yup, the content is not utf8-encoded. Thnx for reporting!

Jdforrester-WMF renamed this task from French special characters not represented correctly to UTF8 characters not represented correctly.Apr 21 2015, 5:35 PM
Jdforrester-WMF triaged this task as Unbreak Now! priority.
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Argh.

Mvolz claimed this task.Apr 23 2015, 4:34 PM
Mvolz updated the task description. (Show Details)Apr 28 2015, 9:51 AM
Mvolz renamed this task from UTF8 characters not represented correctly to Some charsets characters not converted into UTF-8 correctly.Apr 28 2015, 10:57 AM

So the three urls we have so far, in the html

2 set (it/fr) to charset=iso-8859-1 in the html, fr: no encoding set in response; it: charset=ISO-8859-1 set
1 set (jp) to charset=Shift_JIS in html, content-type set, no encoding set in response

So it makes sense that the two with no encoding set in the response we have no choice but to try to read it directly from the html, and the fact that the italian one
http://www.corriere.it/esteri/15_marzo_27/aereo-germanwings-indizi-interessanti-casa-copilota-ff5e34f8-d446-11e4-831f-650093316b0e.shtml looked okay to me from curl when I looked at it before is consistent with that...

....but now it's not working again, so I'm not sure what's going on there.

Change 207071 had a related patch set uploaded (by Mvolz):
[WIP] encoding things

https://gerrit.wikimedia.org/r/207071

Change 207071 merged by Mobrovac:
Improve reading encoding from scraped pages

https://gerrit.wikimedia.org/r/207071

mobrovac closed this task as Resolved.Apr 29 2015, 7:51 PM
mobrovac removed a project: Patch-For-Review.
mobrovac removed a subscriber: gerritbot.

The fix has been deployed, so resolving. Please check (in prod or beta, but also citoid.wmflabs.org) and reopen the issue if it persists (or other instances of the same problem are found).