Page MenuHomePhabricator

U+0200F in query service response for GND ID
Closed, ResolvedPublicBUG REPORT

Description

Problem:
About 60 GND ID statements have an invisible UTF8 markers at the end of the value when accessed via the query service. It's U+0200F (“RIGHT-TO-LEFT MARK”, https://decodeunicode.org/en/u+0200F).
These values have been entered by different users.
The values are fine in the Item UI as well as the JSON and TTL exports. They show the issue in the csv export and query service UI.

Example:
This file contains Items that have a GND ID statement with the issue as well as the values for these statments:

Possibly related:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev changed the subtype of this task from "Task" to "Bug Report".Jun 11 2019, 9:42 PM
Smalyshev added subscribers: Igorkim78, Smalyshev.

It's the same issue as T197447. We probably need to think about more systemic solution for this... Changing ICU collation may increase data size and lead to other complications, so maybe filtering the data and removing characters like U+0200F?

Change 516981 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Filter invisible characters that cause trouble in Blazegraph ICU collation

https://gerrit.wikimedia.org/r/516981

Change 516981 merged by jenkins-bot:
[wikidata/query/rdf@master] Filter invisible characters that cause trouble in Blazegraph ICU collation

https://gerrit.wikimedia.org/r/516981

Smalyshev claimed this task.