Page MenuHomePhabricator

Problem with encoding in Zotero results coming from Citoid
Closed, ResolvedPublic

Description

We have noticed that Citoid now starts experiencing problems with certain news websites when users try to generate a reference using URL. Some characters are not displayed properly.

Examples of URLs which do not work properly:

Czech:
https://zpravy.idnes.cz/george-soros-osobnost-roku-the-financial-times-frg-/zahranicni.aspx?c=A181219_100000_zahranicni_kha
https://domaci.ihned.cz/c1-66399770-v-dole-na-karvinsku-explodoval-metan-na-miste-zahynul-jeden-hornik-dalsi-se-pohresuji

Arabic:
http://nna-leb.gov.lb/ar/show-report/371/

Hebrew:
https://www.ynet.co.il/articles/0,7340,L-5037054,00.html

French:
https://www.insee.fr/fr/statistiques/zones/2021173

Screenshot of reference generated from the first above-mentioned URL:

Bez názvu.png (537×988 px, 32 KB)

This is due to zotero which passes the buffer to jsdom but doesn't pass along content-type headers to decode it properly as jsdom doesn't accept them:

https://github.com/zotero/translation-server/issues/77
https://github.com/jsdom/jsdom/issues/2495

Event Timeline

An example would be http://nna-leb.gov.lb/ar/show-report/371/, from which Citoid extracts metadata containing UTF-8 replacement characters.

[
  {
    "key": "U5U8J9AW",
    "version": 0,
    "itemType": "webpage",
    "tags": [],
    "title": "������� ���� ������ ������ ������ ������ ��� ����",
    "websiteTitle": "������� ������� �������",
    "url": "http://nna-leb.gov.lb/ar/show-report/371/",
    "abstractNote": "\t����� ����� �������\t����� - ������� ���� ������� ������ \"��� ������\" � ���� ����� ���� ����� ������ ������ �����. ���� �� ��� ����� 126 ���ǡ �� ���� �� ������ ������� ����� ��� ����",
    "language": "ar",
    "accessDate": "2019-01-17",
    "author": [
      [
        "",
        "������� ������� �������"
      ]
    ],
    "source": [
      "Zotero"
    ]
  }
]

Not sure if this is different issue but for: https://www.ynet.co.il/articles/0,7340,L-5037054,00.html (correct encoding: Unicode)
Citoid generates wrong encoding too:

[
  {
    "key": "XXVKJEYY",
    "version": 0,
    "itemType": "newspaperArticle",
    "tags": [],
    "title": "מ\"×�×ž×–×•× ×¡\" ועד \"קמיצ'לי\": ×”×ª×•×›× ×™×•×ª שבערוץ 2 ישמחו לשכוח",
    "publicationTitle": "Ynet",
    "url": "https://www.ynet.co.il/articles/0,7340,L-5037054,00.html",
    "language": "he",
    "abstractNote": "מ\"שישי שו×�ו\" ×¢×� צביקה הדר, של×� שרדה יותר מחמש ×ª×•×›× ×™×•×ª, דרך \"דרוש ×ž× ×”×™×’\", שזרקה לפח פרקי×� שהיו ×ž×•×›× ×™×� לשידור ועד \"המשימה: ×�×ž×–×•× ×¡\" של×� הצדיקה ×�ת העלויות הגבוהות - ×�×¡×¤× ×• בעבורכ×� ×�ת ×”×ª×•×›× ×™×•×ª והרגעי×� שג×� בערוץ 2 היו מעדיפי×� לשכוח. צפו",
    "date": "2017-01-11",
    "libraryCatalog": "Ynet",
    "accessDate": "2019-01-19",
    "shortTitle": "מ\"×�×ž×–×•× ×¡\" ועד \"קמיצ'לי\"",
    "author": [
      [
        "",
        "סמדר ×©×™×œ×•× ×™"
      ]
    ],
    "source": [
      "Zotero"
    ]
  }
]
Mvolz triaged this task as High priority.
Mvolz moved this task from Backlog to Zotero on the Citoid board.
Mvolz renamed this task from Problem with encoding in Citoid - generation of reference from certain websites to Problem with encoding in Zotero results coming from Citoid.Jan 25 2019, 10:45 AM
Mvolz added subscribers: jeblad, Danmichaelo.

The problem seems to be that the library Zotero is using, jsdom, doesn't accept content-type headers through to the library and is automatically decoding things as windows-1252 :/

https://github.com/jsdom/jsdom/issues/2495

I think this is somewhat new regression. can we revert to older zotero till it is fixed? (or older jsdom, as long as it doesn't involves security issues or break dependencies)

I think this is somewhat new regression. can we revert to older zotero till it is fixed? (or older jsdom, as long as it doesn't involves security issues or break dependencies)

Unfortunately not. This is a regression with translation-server version 2. We can't revert to the older zotero because the virtual machine that used to run it has literally been destroyed. The old one only ran on ubuntu and now everything is debian. (T204500)

I wonder if this should have higher priority, as it is pretty damaging for the editors. It is already pretty difficult to convince them to add sources, and fixing encoding errors isn't fun at all.

Using a lib that only uses Windows-1252 is a pretty weird thing to do…

Good work on figuring it out! :D

Mvolz added a subscriber: marcella.

I wonder if this should have higher priority, as it is pretty damaging for the editors. It is already pretty difficult to convince them to add sources, and fixing encoding errors isn't fun at all.

Using a lib that only uses Windows-1252 is a pretty weird thing to do…

Good work on figuring it out! :D

Well Martynas at zotero figured it out :).

This is the highest priority it can be reasonably be given I think. Unbreak Now! are mid deployment cycle type fixes. We could consider a hotfix that returns two citations, one from zotero and one from citoid and the user could pick the one that does the better job, but that would mean all requests will show at least two citations and some might appear identical - @marcella thoughts?

Change 487124 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/zotero@master] Update Zotero; fix encoding issues

https://gerrit.wikimedia.org/r/487124

Change 487130 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] Update tests for zotero encoding issue

https://gerrit.wikimedia.org/r/487130

Change 487130 merged by jenkins-bot:
[mediawiki/services/citoid@master] Update tests for zotero encoding issue

https://gerrit.wikimedia.org/r/487130

Change 487124 merged by jenkins-bot:
[mediawiki/services/zotero@master] Update Zotero; fix encoding issues

https://gerrit.wikimedia.org/r/487124

Mvolz removed a project: Patch-For-Review.

Zotero did a quick fix for us and I've now deployed it, hopefully things should be all set now.

I confirm that the links above which did not work properly previously are now loading properly. Thank you and thanks Zotero :)