GWT created a page that had "�" in the text
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	MarkTraceur
	Feb 26 2015, 4:14 PM

Description

https://commons.wikimedia.org/w/index.php?oldid=122700114

I'm not sure if this is because the source material was bad, but I'd appreciate a double-check on this, I'd hate for our servers to be spitting out pages that indicate poor Unicode support.

Related Objects

Mentioned Here: T337062: Archive the GWToolset extension
T89252: ABF is ignoring wikitext during upload

Event Timeline

MarkTraceur created this task.Feb 26 2015, 4:14 PM

MarkTraceur assigned this task to dan-nl.

MarkTraceur raised the priority of this task from to Medium.

MarkTraceur updated the task description. (Show Details)

MarkTraceur added a project: MediaWiki-extensions-GWToolset.

MarkTraceur subscribed.

Restricted Application added a project: Multimedia. · View Herald TranscriptFeb 26 2015, 4:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@MarkTraceur,
it looks like there was a problem with the original metadata that fæ cleaned-up; see this revision comparison. but i’d double-check with fæ to make sure. not sure how to add @fæ to this task. will @Fae do it?

dan-nl added a subscriber: Fae.Feb 26 2015, 5:29 PM

Yeah, this is not a GWT "bug" but is likely to be a stumbling block for users. The source website (NPYL) had catalogue pages with inconsistent character encoding, probably because they had been created at different times from different original systems before being put on-line in one database. In some cases the html meta tag actually contradicted the encoded content of the page. As a result, even though I had test runs of several hundred descriptions, this still caught me out and was a complex bit of "housekeeping".

I find this a horrid problem, it might be an idea for the manual to recommend tools (like iconv) and provide some examples of how to correct source text files so that they are all "UTF-8 ready" before feeding into GWT.

The HTML source contains the byte sequence EF BF BD, that is, a proper replacement character; it's not the browser that transforms an invalid byte sequence into a replacement mark. It would be easy to set up an abusefilter rule to catch edits which add such characters so errors of this kind don't go unnoticed.

It would be interesting to know at what point exactly the original byte sequence is transformed (it's probably libxml in GWT that does it).

As for fixing manually, chardet claims to autodetect the real encoding with some sophistication.

In T90887#1072134, @Tgr wrote:

It would be easy to set up an abusefilter rule to catch edits which add such characters so errors of this kind don't go unnoticed.

AbuseFilter ignores wikitext during upload: T89252

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:25 PM

Restricted Application added subscribers: Steinsplitter, Matanya. · View Herald TranscriptSep 4 2015, 6:25 PM

MarkTraceur unsubscribed.Jan 27 2017, 10:11 PM

Restricted Application added a project: Commons. · View Herald TranscriptJan 27 2017, 10:11 PM

zhuyifei1999 moved this task from Incoming to Uploading (upload_by_url and server side upload) on the Commons board.Jul 25 2017, 9:55 AM

@dan-nl: I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action... → Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Per T337062: Archive the GWToolset extension.

GWT created a page that had "�" in the textClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

GWT created a page that had "�" in the text
Closed, DeclinedPublic
Actions