Put source with "légales" in the title into citoid:
Leads to a citation with "l�gales"
Same issue with French quotes:
Put source with "légales" in the title into citoid:
Leads to a citation with "l�gales"
Same issue with French quotes:
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Improve reading encoding from scraped pages | mediawiki/services/citoid | master | +165 -66 |
Additional info:
Quote characters seem to come through fine on localhost using curl, so that might be an extension-end problem.
Accent characters not so much- maybe a known issue with the request library? Suggestion is to just take in the binary and use iconv to do all character decoding/encoding: https://github.com/request/request/issues/118
Nope, the actual problem is that that site is served by MS IIS, and the encoding of the page is set to latin1 (see the page's Content-Type meta tag).
Suggestion is to just take in the binary and use iconv to do all character decoding/encoding
Yup. iconv-lite seems to be a good choice for that (that is already a dep of body-parser which we pull in, but cannot reuse it due to the way the deploy repo works).
IIS being IIS, it does not set the correct Content-Type header (it lacks the charset part), which means that a way to get around this is:
Personally, I don't like it as it introduces quite a big overhead, but currently do not see another way around it (trying to get the French government to either switch to something other than IIS or get them to configure it correctly is a lost cause IMO :P)
Small OT: @LuisVilla funny coincidence you chose the city where I used to live (Rennes) :)
It looks like this isn't limited to French characters. In another diff, the title of the page (a page from the website of Tokyo's police department) was mangled:
Am I correct in assuming this is the same issue?
So the three urls we have so far, in the html
2 set (it/fr) to charset=iso-8859-1 in the html, fr: no encoding set in response; it: charset=ISO-8859-1 set
1 set (jp) to charset=Shift_JIS in html, content-type set, no encoding set in response
So it makes sense that the two with no encoding set in the response we have no choice but to try to read it directly from the html, and the fact that the italian one
http://www.corriere.it/esteri/15_marzo_27/aereo-germanwings-indizi-interessanti-casa-copilota-ff5e34f8-d446-11e4-831f-650093316b0e.shtml looked okay to me from curl when I looked at it before is consistent with that...
....but now it's not working again, so I'm not sure what's going on there.
The fix has been deployed, so resolving. Please check (in prod or beta, but also citoid.wmflabs.org) and reopen the issue if it persists (or other instances of the same problem are found).