Page MenuHomePhabricator

[Problem] `Special:SetAliases`, `Special:NewItem` and `Special:NewProperty` separates aliases with | character but have no way to make a | part of one alias' text
Open, Needs TriagePublic

Description

Problem

Special:SetAliases, Special:NewItem and Special:NewProperty pages do not allow the user to enter an alias that contains a | character in it's text. This is because those pages provide one field to enter aliases that instructs the user to enter all of them, separated with a | character.

Since a | character might have legitimate use-cases, being part of the alias text, such as in the following:

we have to find a way to allow the user to use |s in aliases when they edit them through Special:SetAliases, Special:NewItem and Special:NewProperty pages.

Solution

Iteration 1 Acceptance Criteria

  • When Special:SetAliases or Special:SetLabelDescriptionAliases is given an item or a property that already has an alias containing | save action should fail with the error message TODO.
    • such case should be tracked (i.e. counted). data should be split by the special page used

      pages fail and inform the user that there's already one or more alias on the item that contain a |, and that they will have to edit the item/property on their page (after creating them if they do not exist yet) in order to be able to set aliases with |s in them.
    • a data point is collected on how many times this failure happens (only count per week/monght is enough) to understand the size of the issue and impact of the solution

This iteration is taken as a first step to eliminate the unintended wrong aliases being set on items/properties, and to collect enough data to know how big the problem is. Given that T118065 was created in 2015 and never been addressed, it could be that there's still no enough usage of those pages in combination with this case to justify costs of later iterations yet.


the next iteration can be either done as an A/B test or one alternative is selected to be implemented, that's decided depending on learnings from prev. iteration

Iteration 2.A Acceptance Criteria

  • on Special:SetAliases, Special:NewItem and Special:NewProperty pages, in alaises field, the user can continue to split aliases using |s, and can also escape those that must be part of the alias text by prefixing them with another pipe ||.

Iteration 2.B Acceptance Criteria

  • without javascript provide the user a way to reload the page with as many aliases fields to fill in one-by-one as they need
  • with javascript use the TagMultiselectWidget from OOUI

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 28 2019, 1:31 PM
Addshore triaged this task as Normal priority.Mar 28 2019, 1:34 PM
Addshore moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.
This comment was removed by Greta_Doci_WMDE.

Change 502530 had a related patch set uploaded (by Greta WMDE; owner: Greta Doçi):
[wikibase/javascript-api@master] Replace pipe character with the standard alternative character

https://gerrit.wikimedia.org/r/502530

I'm sorry for my rude comment at https://gerrit.wikimedia.org/r/502530. I made that in a hurry.

The two relevant lines of code are these:

Note that neither of these is wrong. The ability to enter multiple aliases separated with a pipe character is a documented feature of the special page: "Several aliases are separated by a pipe (|) character." I would argue we are not really talking about a bug here, but about a missing feature that was forgotten in the design of the special page.

Possible solutions I can think of:

  • Let the special page fail hard when one of the existing aliases the code is going to implode() already contains a pipe character. This is not really a solution, but prevents unintentional data loss.
  • Ask the user to escape pipe characters with a backslash when they aren't meant to be separation characters. Before doing the implode(), the code also adds backslashes to all | as well as \ in the existing aliases. After the explode, these (but only these two) are removed. The biggest issue with this solution is that it is not user friendly, and introduces a new escaping rule that was (to my knowledge) never used in Wikibase before.
  • The most user friendly solution I can think of is to use the TagMultiselectWidget from OOUI: https://doc.wikimedia.org/oojs-ui/master/demos/?page=widgets&theme=wikimediaui&direction=ltr&platform=desktop. The problem with this is that it does not work without JavaScript, but the main purpose of the special page is to provide a no-JS fallback. This effectively means this solution can not be used – except OOUI provides a no-JS alternative for this, which I think it doesn't, but I might miss something.
  • The special page can be redesigned to have one input field for each alias. At the end is an "add" button that reloads the special page (remember, it must work without JavaScript) and adds a new empty field for entering another alias. To make this easier for the user, the special page could always present 10 input fields, and the "add" button always add 10 more. Empty ones and duplicates are ignored anyway (the special page does not even need to do anything for this, I believe the data model classes already take care of this).

Whatever route you decide to go, to me it feels surprisingly expensive. I wonder why this task does not have any estimate, discussion, or investigation results?

Sorry for the confusion, and for not updating the ticket before uploading the patch. After consultation with Lydia, we decided that we disallow | characters in aliases and make item pages compatible with the current behavior of Special:SetAliases.

We have checked some of the existing cases (aliases with |) and they all seemed to be non-legitimate (looks like entered by mistake, or some aliases are including wikitext image tags and some seemed to confuse it maybe because they came from Special:SetAliases and thought it functions the same .. etc).

​That will have one minor impact on summary messages when users try to edit aliases of an item that already have some aliases containing a |. Though that seems acceptable.

The plan is to announce this change before deploying it so that we allow for feedback in case anyone has legitimate reasons to ask for allowing | characters in aliases.

alaa_wmde renamed this task from Special:SetAliases should not be destructive when editing aliases that include the | character to Item pages should split alias entries containing | character to be consistent with Special:SetAliases.Apr 10 2019, 2:44 PM
alaa_wmde updated the task description. (Show Details)
alaa_wmde added subscribers: Lea_Lacroix_WMDE, alaa_wmde.
alaa_wmde added a comment.EditedApr 10 2019, 2:57 PM

I updated task description according to what Lydia, Greta and I decided after discussing this issue yesterday. (We will still confirm with a feedback mail on relevant mailing-list)

@thiemowmde I'm sorry again about the confusion caused by the patch being in review before the task description was updated. The input you provided is very helpful, and I will take these options to the mainling-list feedback email that we plan to send out asap.

Are we really sure disallowing aliases with |s in is the right path forward?
What are we going to do with the 56k aliases that currently have pipes in them? (Note this is 56k alias strings, not 56k items (it will be more items I expect).

This also feels like a restriction on the data model that could be solved in a better way.
If we look at this task from a wikibase perspective disallowing a | character seems very silly.

alaa_wmde added a comment.EditedApr 11 2019, 2:29 PM

Are we really sure disallowing aliases with |s in is the right path forward?
What are we going to do with the 56k aliases that currently have pipes in them? (Note this is 56k alias strings, not 56k items (it will be more items I expect).

we can migrate them if we decide for that.. eps. that the ones we looked at all seemed to contain a | due to a bot or mistake more than intended by the user. I can be wrong, is there a way to get the full list? we thought we would actually ask for feedback on our mailing lists for use cases instead.

This also feels like a restriction on the data model that could be solved in a better way.
If we look at this task from a wikibase perspective disallowing a | character seems very silly.

It is a restriction, but is it an undesirable one? I cannot think of any use case where a | is an important part of an alias (I mean | is more of a computer character than any specific language character that is irreplaceable) . Especially when we are using it ourselves as a separator in display and on Special:SetAliases with no one objecting to that of our users, I believe. or was this task created because some users wanted to add |s in Special:SetAliases?

It does sound to me that | then should be treated as a special character that is not allowed (then we would either error or split automatically) to make things more consistent, but that isn't the only solution of course and the suggested one previously (allowing | in Special:SetAliases) also makes it consistent.

I'm afraid I have to agree with @Addshore here. Sure, it might be that many of the aliases that currently contain pipe characters are mistakes. It might even be the vast majority. But even if this is the case, …

  • There are certainly items that actually have a title or alternative titles that require the use of one or more pipe characters. Why should we ban them from the Wikidata knowledge base? Just because of a bug? Why not fix it instead? I drafted a few alternatives above.
  • Disallowing a random low-ASCII (!) character just feels wrong. Again, why? Why not, let's say, disallow comma, semicolon, and more non-alphanumeric characters as well? They all might be from a broken import.
  • Wikibase disallows tabs and vertical whitespace (newlines and such) for good reasons. But we never blocked the use of a visible character, as far as I'm aware of.

I mean | is more of a computer character than any specific language character […]

I find this a very dangerous argument. I'm not aware of a policy that disallows to add and describe "computer stuff" in the Wikidata knowledge base. Even if https://www.wikidata.org/wiki/Q1132051 would be the only item that requires the use of an actual pipe character in labels and aliases, a single item like this is reason enough for me to heavily veto the idea of blocking any visible character – except it is for security reasons. @Lydia_Pintscher, I can't believe you found this a good idea.

Item pages should split alias entries containing | character to be consistent with Special:SetAliases

Oh my. I just realized this task was turned around, and the bug should now become a "feature". Why? Don't we have a UI that allows to press enter after each alias? Since when do users expect that a random character acts as a separator? May I see the places where users actually requested this?

Sorry for writing so much, but I feel there is an important lesson to learn here, despite this appearing like it would be a minor thing. I feel this highlights a quite fundamental misunderstanding.

See, limiting the use of a character is, in a way, not different from limiting the gender property to be binary male/female, or limiting a city's major to be an instance of human, or limiting units to registered SI units only. All these ideas sound good at first glance. After all, you must be a human being to become the major of a city, don't you? You can't be anything but male or female, right? All units are convertible to SI, aren't they? A pipe character can't be anything but a separator character, or can it?

The answer to all these questions is no.

  • I hope the gender example is obvious in 2019.
  • Wikidata knows a cat that was major of a city for a brief moment in history.
  • We got very precise measurements of ancient buildings that have been done in units we don't know exactly how to convert to SI.
  • A pipe character is a very old low-ASCII character that was put to use for a gazillion different things you could fill a book with. (Note: The "low" refers to the first 128 characters that have already been used on 7-bit computers in the 1960s.)

Wikidata's answer to all this was, to this date, to not hard-code arbitrary limits, but be open by default, and fix mistakes after they happened.

Supporting the full range of characters Unicode has to offer was, to this date, a consequence and important part of this "open by default" strategy. I can see reasons to have exceptions, e.g. security reasons, as mentioned already. But the issue discussed in this ticket here is certainly not one of these reasons, if you ask me.

Asking users to provide "legitimate reasons" to be allowed to use one of the (literally) most basic characters is, … how to put this? It scares me. It's like asking users to provide "legitimate reasons" to use Wikidata with a screenreader software, or asking users to first provide "legitimate reasons" why they want to become a community member before they are allowed to do their first edit.

@Lydia_Pintscher, what changed in Wikidata's philosophy since I left the team?

Ok let's take a step back and not make this more than it is.
We do have since very early in Wikidata the special page. It has always used the pipe character to separate aliases. For some reasons not known to me - probably oversight - the JS interface acts differently. This currently leads to data loss. That's not acceptable.
Now we have two ways forward to fix this:

  1. consistently treat the pipe character as an alias separator everywhere
  2. change the special page somehow

We are doing 1 now because that is relatively easily possible. It is also in line with what has been assumed when inputting the data in the examples we looked at. We will then see what the editors say and move forward from there.

I had a quick look around and found a few examples of legitimate aliases or labels with |s in:

and this only took me a couple of minutes and just looking in English.

It does look like lots of labels and aliases with |s in are indeed bad, but there are ligitimate cases in there.

We will then see what the editors say and move forward from there.

I think we need to see what the editors say before we do what is currently being suggested, as it changes the behavior for the majority of users, who will be using the JS UI not the special page., and also would disallow the use of | in aliases which has always been fine until now.

Disallowing |s in aliases will also mean that they should be disallowed in labels as otherwise this will cause us issues when items are merged which we would then have to consider and work around in some way. (as the label of the item being merged from becomes the aliases of an item being merged to)

Wow in minutes .. it took a query on wb_terms >5 minutes to get 50 results only. Same on Query Service. How did you find those?

As for the task, this makes the costs of solution 1 we did so far even higher than solution 2 (change SetAliases to support | somehow). Thanks @Addshore for pulling these missing data for us. (I hope we will start to include that from now on before hand when it is a solution task ;)

@Lydia_Pintscher I will update the task again with these cases, and take a look at the suggest options by @thiemowmde for solution 2 to see which one we go with and consult with you on Monday.

sarhan.alaa updated the task description. (Show Details)Apr 13 2019, 7:03 AM
sarhan.alaa updated the task description. (Show Details)Apr 13 2019, 7:05 AM
sarhan.alaa moved this task from incoming to ready to go on the Wikidata board.Apr 13 2019, 7:07 AM

Moved back to be briefly discussed, estimated and broken down.

Wow in minutes .. it took a query on wb_terms >5 minutes to get 50 results only. Same on Query Service. How did you find those?

I found some "legit" |'s in aliases by sticking a | in the wikidata.org search box. e.g. https://www.wikidata.org/wiki/Q120812 is in the top few results. I guess you could maybe do this more efficiently with direct elasticsearch queries if you want to get more data about the good vs bad ones (for sure there are lots that are mistakes)

alaa_wmde updated the task description. (Show Details)Apr 19 2019, 7:15 AM
alaa_wmde renamed this task from Item pages should split alias entries containing | character to be consistent with Special:SetAliases to `Special:SetAliases`, `Special:NewItem` and `Special:NewProperty` separates aliases with | character but have no way to make a | part of one alias' text.Apr 19 2019, 7:39 AM
alaa_wmde raised the priority of this task from Normal to Needs Triage.
alaa_wmde updated the task description. (Show Details)
alaa_wmde updated the task description. (Show Details)
alaa_wmde renamed this task from `Special:SetAliases`, `Special:NewItem` and `Special:NewProperty` separates aliases with | character but have no way to make a | part of one alias' text to [Problem] `Special:SetAliases`, `Special:NewItem` and `Special:NewProperty` separates aliases with | character but have no way to make a | part of one alias' text.Apr 19 2019, 7:43 AM

@Lydia_Pintscher I refined this task based on all the info we have got so far. From my perspective, it sounds ready to be discussed, estimated and broken down into iteration tasks and further sub-tasks if needed.

alaa_wmde updated the task description. (Show Details)Apr 19 2019, 8:05 AM

Ok let's move ahead with this.
Thanks all for figuring out a way forward.

Change 502530 abandoned by Thiemo Kreuz (WMDE):
Replace pipe character with the standard alternative character

https://gerrit.wikimedia.org/r/502530

WMDE-leszek updated the task description. (Show Details)May 7 2019, 1:09 PM
WMDE-leszek updated the task description. (Show Details)
WMDE-leszek updated the task description. (Show Details)May 7 2019, 1:20 PM

For "Iteration 1 Acceptance Criteria" rather than specifically counting, I think logging these would be enough, then we can count these in logstash.

Waiting on iteration 1 https://phabricator.wikimedia.org/T223270 to be completed and enough time to pass for data collection