Some charsets characters not converted into UTF-8 correctly
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	LuisVilla
	Apr 12 2015, 2:09 AM

Description

Put source with "légales" in the title into citoid:

http://www.insee.fr/fr/ppp/bases-de-donnees/recensement/populations-legales/departement.asp?dep=35#top

Leads to a citation with "l�gales"

https://en.wikipedia.org/w/index.php?title=User%3ALuisVilla%2Fsandbox%2Fcitoidtest2&diff=656057913&oldid=656057041

Same issue with French quotes:

http://www.corriere.it/esteri/15_marzo_27/aereo-germanwings-indizi-interessanti-casa-copilota-ff5e34f8-d446-11e4-831f-650093316b0e.shtml

http://en.wikipedia.beta.wmflabs.org/w/index.php?title=User%3AElitest%2Fsandbox&diff=212317&oldid=84590

Details

	Subject	Repo	Branch	Lines +/-
	Improve reading encoding from scraped pages	mediawiki/services/citoid	master	+165 -66

Customize query in gerrit

Related Objects

Mentioned In: rGCIT3733cbf88ff3: Improve reading encoding from scraped pages
T94767: Perform a weekly review of edits made with VisualEditor

Event Timeline

LuisVilla created this task.Apr 12 2015, 2:09 AM

LuisVilla raised the priority of this task from to Needs Triage.

LuisVilla updated the task description. (Show Details)

LuisVilla added a project: Citoid.

LuisVilla subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2015, 2:09 AM

Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Apr 13 2015, 7:11 AM

Mvolz merged a task: T94213: Unrecognized quotes.

Mvolz added a subscriber: • Elitre.

Mvolz renamed this task from Character problem in citoid-generated citation to French special characters not represented correctly.Apr 13 2015, 7:14 AM

Mvolz updated the task description. (Show Details)

Mvolz set Security to None.

JeanFred moved this task from IO Tasks to Backlog on the Citoid board.Apr 13 2015, 4:59 PM

Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Apr 13 2015, 9:28 PM

Additional info:

Quote characters seem to come through fine on localhost using curl, so that might be an extension-end problem.

Accent characters not so much- maybe a known issue with the request library? Suggestion is to just take in the binary and use iconv to do all character decoding/encoding: https://github.com/request/request/issues/118

In T95833#1204831, @Mvolz wrote:

Accent characters not so much- maybe a known issue with the request library?

Nope, the actual problem is that that site is served by MS IIS, and the encoding of the page is set to latin1 (see the page's Content-Type meta tag).

Suggestion is to just take in the binary and use iconv to do all character decoding/encoding

Yup. iconv-lite seems to be a good choice for that (that is already a dep of body-parser which we pull in, but cannot reuse it due to the way the deploy repo works).

IIS being IIS, it does not set the correct Content-Type header (it lacks the charset part), which means that a way to get around this is:

fetch the HTML into a Buffer object
get the charset (no parsing, just RegEx)
decode the buffer via iconv-lite

Personally, I don't like it as it introduces quite a big overhead, but currently do not see another way around it (trying to get the French government to either switch to something other than IIS or get them to configure it correctly is a lost cause IMO :P)

Small OT: @LuisVilla funny coincidence you chose the city where I used to live (Rennes) :)

It looks like this isn't limited to French characters. In another diff, the title of the page (a page from the website of Tokyo's police department) was mangled:

Page title: グラフ警視庁　組織図・体制　：警視庁
Citoid output: �O��t�x��@�g�D�}�E�̐��@�F�x��

Am I correct in assuming this is the same issue?

In T95833#1224909, @gpaumier wrote:

Am I correct in assuming this is the same issue?

Yup, the content is not utf8-encoded. Thnx for reporting!

Argh.

gpaumier mentioned this in T94767: Perform a weekly review of edits made with VisualEditor.Apr 21 2015, 10:36 PM

Mvolz claimed this task.Apr 23 2015, 4:34 PM

• Rdicerb awarded a token.Apr 23 2015, 8:13 PM

Mvolz updated the task description. (Show Details)Apr 28 2015, 9:51 AM

Mvolz renamed this task from UTF8 characters not represented correctly to Some charsets characters not converted into UTF-8 correctly.Apr 28 2015, 10:57 AM

So the three urls we have so far, in the html

2 set (it/fr) to charset=iso-8859-1 in the html, fr: no encoding set in response; it: charset=ISO-8859-1 set
1 set (jp) to charset=Shift_JIS in html, content-type set, no encoding set in response

So it makes sense that the two with no encoding set in the response we have no choice but to try to read it directly from the html, and the fact that the italian one
http://www.corriere.it/esteri/15_marzo_27/aereo-germanwings-indizi-interessanti-casa-copilota-ff5e34f8-d446-11e4-831f-650093316b0e.shtml looked okay to me from curl when I looked at it before is consistent with that...

....but now it's not working again, so I'm not sure what's going on there.

Change 207071 had a related patch set uploaded (by Mvolz):
[WIP] encoding things

https://gerrit.wikimedia.org/r/207071

gerritbot added a project: Patch-For-Review.Apr 28 2015, 1:18 PM

Change 207071 merged by Mobrovac:
Improve reading encoding from scraped pages

https://gerrit.wikimedia.org/r/207071

Mvolz mentioned this in rGCIT3733cbf88ff3: Improve reading encoding from scraped pages.Apr 29 2015, 7:46 PM

The fix has been deployed, so resolving. Please check (in prod or beta, but also citoid.wmflabs.org) and reopen the issue if it persists (or other instances of the same problem are found).

Jdforrester-WMF added projects: VisualEditor 2014/15 Q4 blockers, WMF-deploy-2015-04-29_(1.26wmf4).Apr 29 2015, 8:30 PM

Jdforrester-WMF edited a custom field.

Jdforrester-WMF moved this task from Nominated to Done on the VisualEditor 2014/15 Q4 blockers board.

Some charsets characters not converted into UTF-8 correctlyClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Some charsets characters not converted into UTF-8 correctly
Closed, ResolvedPublic8 Estimated Story Points
Actions