Last february I removed all soft hyphens (U+00AD) from the labels on Q57730933, but over time, different users have (inadvertently) added them back in. The third user who did this, was able to tell me that their source for the labels is WDQS. They provided this query which (currently) reproduces the problem. If you run this query and copy+paste the label "Hajo Beeckman" in a capable text editor, you will notice it has a soft hyphen between k and m, while if you do this from the web interface, it doesn't. So it seems to me that WDQS returns stale data, at least in this case.
Description
Related Objects
Event Timeline
The data seems to be updated in the database (revision 915966700). I suspect it may be the same issue as with T197447 - Blazegraph conflates strings with some invisible characters and strings without them.
Since collation is non-identical, I would have expected some kind of stringprep-like transformation to prevent these kinds of problems, with the added benefit of cleaner input.
I guess you fixed only this example? I can perhaps produce more. I have a list of suspect items, though I don't know how to formulate a query to check all labels of one or more items. And then there are probably other bad labels sourced from other wikis and other problematic characters like the zero-width space...
@Bdijkstra we don't have generic solution for this yet, but if you provide a list of bad literals, it can be fixed.