Page MenuHomePhabricator

CopyPatrol incorrectly encodes non-ASCII letters (with diacritics) in article titles, so the links do not work
Open, Needs TriagePublicBUG REPORT

Description

Steps to Reproduce:

  1. Open CopyPatrol for Czech Wikipedia
  2. Look for a page name which contains a letter with a diacritic, it's displayed like "�"
  3. Click on it

Actual Results:
Wikipedia page is opened with:
Bad title
The requested page title contains an invalid UTF-8 sequence.

Event Timeline

Janbery created this task.Feb 9 2020, 6:51 PM
Restricted Application added a project: Community-Tech. · View Herald TranscriptFeb 9 2020, 6:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Janbery: Ahoj, your task description above shows � (a square with a question mark in it). Is that really intentionally the character that you were looking for? Just trying to make sure this is not a browser issue on your side. Note that there is no such page either for that character: https://cs.wikipedia.org/wiki/%EF%BF%BD , so I do not see any bug in CopyPatrol here but expected and correct behavior...

@Aklapper: Ahoj, I mean when is in article title diacritics, for example "í". It looks like this:

Janbery updated the task description. (Show Details)Feb 9 2020, 8:31 PM

Ah, thanks! Sorry, I misunderstood. I can confirm the problem.

Aklapper renamed this task from Unable to open article links with diacritic in CopyPatrol to CopyPatrol incorrectly encodes non-ASCII letters (with diacritics) in article titles, so the links do not work.Feb 9 2020, 8:51 PM
Aklapper updated the task description. (Show Details)
Aklapper added a project: I18n.

It looks like the incorrect page title is in the EranBot database:

MariaDB [s51306__copyright_p]> select page_title from copyright_diffs where page_title like 'Usne%';
+-------------------------+
| page_title              |
+-------------------------+
| Usnesen�_zastupitelstva  |
+-------------------------+
1 row in set (0.06 sec)
Janbery updated the task description. (Show Details)Feb 13 2020, 6:34 PM

Any news for this problem?

It might be a simple issue of changing the db charset, or adding a SET NAMES to the client.