Page MenuHomePhabricator

U+0200F in query service response for GND ID
Closed, ResolvedPublicBUG REPORT

Description

Problem:
About 60 GND ID statements have an invisible UTF8 markers at the end of the value when accessed via the query service. It's U+0200F (“RIGHT-TO-LEFT MARK”, https://decodeunicode.org/en/u+0200F).
These values have been entered by different users.
The values are fine in the Item UI as well as the JSON and TTL exports. They show the issue in the csv export and query service UI.

Example:
This file contains Items that have a GND ID statement with the issue as well as the values for these statments:

Possibly related:

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptJun 11 2019, 3:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elal added a subscriber: elal.Jun 11 2019, 3:14 PM
Smalyshev changed the subtype of this task from "Task" to "Bug Report".Jun 11 2019, 9:42 PM
Smalyshev added subscribers: Igorkim78, Smalyshev.

It's the same issue as T197447. We probably need to think about more systemic solution for this... Changing ICU collation may increase data size and lead to other complications, so maybe filtering the data and removing characters like U+0200F?

Smalyshev triaged this task as Medium priority.Jun 12 2019, 4:57 AM

Change 516981 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Filter invisible characters that cause trouble in Blazegraph ICU collation

https://gerrit.wikimedia.org/r/516981

Change 516981 merged by jenkins-bot:
[wikidata/query/rdf@master] Filter invisible characters that cause trouble in Blazegraph ICU collation

https://gerrit.wikimedia.org/r/516981

Smalyshev closed this task as Resolved.Jun 17 2019, 9:57 PM
Smalyshev claimed this task.