Page MenuHomePhabricator

CopyPatrol incorrectly encodes non-ASCII letters (with diacritics) in article titles, so the links do not work
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:

  1. Open CopyPatrol for Czech Wikipedia
  2. Look for a page name which contains a letter with a diacritic, it's displayed like "�"
  3. Click on it

image.png (332×1 px, 69 KB)

Actual Results:
Wikipedia page is opened with:
Bad title
The requested page title contains an invalid UTF-8 sequence.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Janbery: Ahoj, your task description above shows � (a square with a question mark in it). Is that really intentionally the character that you were looking for? Just trying to make sure this is not a browser issue on your side. Note that there is no such page either for that character: https://cs.wikipedia.org/wiki/%EF%BF%BD , so I do not see any bug in CopyPatrol here but expected and correct behavior...

@Aklapper: Ahoj, I mean when is in article title diacritics, for example "í". It looks like this:

Ah, thanks! Sorry, I misunderstood. I can confirm the problem.

Aklapper renamed this task from Unable to open article links with diacritic in CopyPatrol to CopyPatrol incorrectly encodes non-ASCII letters (with diacritics) in article titles, so the links do not work.Feb 9 2020, 8:51 PM
Aklapper updated the task description. (Show Details)
Aklapper added a project: I18n.

It looks like the incorrect page title is in the EranBot database:

MariaDB [s51306__copyright_p]> select page_title from copyright_diffs where page_title like 'Usne%';
+-------------------------+
| page_title              |
+-------------------------+
| Usnesen�_zastupitelstva  |
+-------------------------+
1 row in set (0.06 sec)

It might be a simple issue of changing the db charset, or adding a SET NAMES to the client.

It affects overall, not only Czech Wikipedia.

I think the character set on the database is fine. I manually updated a row to use the right character and it stored properly. Example: https://copypatrol.toolforge.org/fr/?id=64032167

So the issue must be with the EranBot code.

MusikAnimal claimed this task.
MusikAnimal moved this task from Backlog to Done on the CopyPatrol board.

Looks like the fix for T273017 also fixed this! See for example https://copypatrol.toolforge.org/fr/?id=68138278 and https://copypatrol.toolforge.org/cs/?id=68154284

Historical records will still show the � in place of characters with diacritics, but everything should appear correct from now on.

Sorry it took so long to fix this! Resolving.