Page MenuHomePhabricator

wrong encoding of the header for reflinks.py
Closed, ResolvedPublicBUG REPORT

Description

For example,

this edit:
https://ru.wikipedia.org/w/index.php?diff=109706980&diffmode=source

For this link https://www.pravda.ru/culture/music/21-01-2008/252537-mandrik-0/ it added the header:

п║п╣я─пЁп╣п╧ п°п╟п╫п╢я─п╦п╨: п╒п╟п╫я├я▀ п╡ я┬п╬я┐-п╠п╦п╥п╫п╣я│п╣ - п╢п╣п╩п╬ п©п╣я─я│п©п╣п╨я┌п╦п╡п╫п╬п╣

But it should have been

Сергей Мандрик: Танцы в шоу-бизнесе - дело перспективное

Event Timeline

If no charset is given with that page the following encodings are supported ['koi8-r', 'windows-1251', 'utf-8'] but only the first one is tried which gives this result mentioned above. The second fails with Decoding error and the third would work. How can I find out that 'utf-8' is right in this case but not 'koi8-r' for encoding?

Xqt triaged this task as Medium priority.Oct 30 2020, 1:04 PM

Change 637696 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Try all encodings until the result looks valid

https://gerrit.wikimedia.org/r/637696

The result after this pathc:

C:\pwb\GIT\core>pwb reflinks -page:user:xqt/Test -simulate
Retrieving 1 pages from wikipedia:de.
No charset found for https://www.pravda.ru/culture/music/21-01-2008/252537-mandrik-0/
https://www.pravda.ru/culture/music/21-01-2008/252537-mandrik-0/ : Decoding error - 'koi8-r' codec can't decode bytes in position 1566-1567:  п╢п╣п╩п╬ п©п╣я─я│п©
https://www.pravda.ru/culture/music/21-01-2008/252537-mandrik-0/ : Decoding error - 'charmap' codec can't decode byte 0x98 in position 14272: character maps to <undefined>
Found no section that can be preceded by a new references section.
Placing it before interwiki links, categories, and bottom templates.


>>> Benutzer:Xqt/Test <<<
@@ -131 +131 @@
- * 3 муж — Сергей Мандрик — хореограф, художественный руководитель балета «Street Jazz»<ref>[https://www.instagram.com/mandriksj/ Сергей Мандрик (@mandriksj) • Instagram photos and videos<!-- Заголовок добавлен ботом -->]</ref>, руководил танцевальными номерами участников «[[Фабрика звёзд (Россия)|Фабрики звезд-7]]»<ref>[https://www.pravda.ru/culture/music/21-01-2008/252537-mandrik-0/]</ref>
+ * 3 муж — Сергей Мандрик — хореограф, художественный руководитель балета «Street Jazz»<ref>[https://www.instagram.com/mandriksj/ Сергей Мандрик (@mandriksj) • Instagram photos and videos<!-- Заголовок добавлен ботом -->]</ref>, руководил танцевальными номерами участников «[[Фабрика звёзд (Россия)|Фабрики звезд-7]]»<ref>[https://www.pravda.ru/culture/music/21-01-2008/252537-mandrik-0/ Сергей Мандрик: Танцы в шоу-бизнесе - дело перспективное<!-- Automatisch generierter Titel -->]</ref>

@@ -312 +312,4 @@
- [[Категория:Мираж (группа)]]
+ [[Категория:Мираж (группа)]]
+
+ == Einzelnachweise ==
+ <references />

Edit summary: Bot: Korrektes Referenzformat (siehe [[mw:Manual:Pywikibot/refLinks]])
Do you want to accept these changes? ([y]es, [N]o, [a]ll, [q]uit): q

User quit ReferencesRobot bot run...

0 pages read
0 pages written
0 pages skipped
Execution time: 5 seconds
Script terminated successfully.

C:\pwb\GIT\core>

Change 637696 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] use chardet to find a valid encoding

https://gerrit.wikimedia.org/r/637696

Xqt claimed this task.