Page MenuHomePhabricator

WDQS confuses strings with and without U+00AD
Closed, ResolvedPublic

Description

Last february I removed all soft hyphens (U+00AD) from the labels on Q57730933, but over time, different users have (inadvertently) added them back in. The third user who did this, was able to tell me that their source for the labels is WDQS. They provided this query which (currently) reproduces the problem. If you run this query and copy+paste the label "Hajo Beeckman" in a capable text editor, you will notice it has a soft hyphen between k and m, while if you do this from the web interface, it doesn't. So it seems to me that WDQS returns stale data, at least in this case.

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptApr 16 2019, 9:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The data seems to be updated in the database (revision 915966700). I suspect it may be the same issue as with T197447 - Blazegraph conflates strings with some invisible characters and strings without them.

Smalyshev renamed this task from WDQS returns stale data to WDQS confuses strings with and without U+00AD.Apr 18 2019, 9:09 PM
Smalyshev triaged this task as Medium priority.

Since collation is non-identical, I would have expected some kind of stringprep-like transformation to prevent these kinds of problems, with the added benefit of cleaner input.

Smalyshev closed this task as Resolved.May 2 2019, 9:33 PM

This is fixed now.

Indeed! Could you provide a link to the change(s)?

There's no link, I just manually fixed the data for now.

Bdijkstra added a comment.EditedMay 3 2019, 10:34 PM

I guess you fixed only this example? I can perhaps produce more. I have a list of suspect items, though I don't know how to formulate a query to check all labels of one or more items. And then there are probably other bad labels sourced from other wikis and other problematic characters like the zero-width space...

@Bdijkstra we don't have generic solution for this yet, but if you provide a list of bad literals, it can be fixed.