Page MenuHomePhabricator

[Task] Apply normalization to string values in statements
Open, Needs TriagePublic

Description

I suggest to trim leading/trailing spaces around statements of string type, and apply unicode normalization.

Due to a c&p error I added some trainling spaces at a VIAF statement and had to remove them with an extra edit, see URL.

Example URL: https://www.wikidata.org/w/index.php?title=Q6486700&diff=10051134&oldid=10051107


Other issues:

  • CR
  • unprintable characters (if not included in unicode normalization)

Details

Reference
bz45925

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:32 AM
bzimport set Reference to bz45925.
bzimport added a subscriber: Unknown Object (MLST).
Raymond created this task.Mar 9 2013, 9:04 AM

The "malformed input" error is thrown by the RegexValidator in function buildStringType from WikibaseDataTypeBuilders. It often happens when copy-pasting text from other sources (see http://www.wikidata.org/wiki/Wikidata:Project_chat#Error when adding commonscat) and it is made worse by bug 63301 which sometimes adds newlines when pressing return to save the claim, thus triggering the error.

The trimming was previously done in ValueView but it has since been removed (see http://github.com/wmde/ValueView/commit/0a8350999bb1ee028db9487286173aee0b20640f). We could use the StringNormalizer class to trim the string and also remove incomplete UTF-8 sequences (see related bug 50486), however I'm not familiar with the data value processing code so I'm not sure where is the correct place to do it.

Just to be clear: ValueView did not trim the value, it just checked for empty or whitespace-only strings.

At the time of writing, the current behaviour seems to be: When a value with a leading space character is submitted, the back-end parser will return a "malformed input" error.
I wonder whether silent trimming should rather be implemented in the back-end parser instead of in the front-end.
Can some authority please decide on that?

IIRC When I discussed this with Daniel his comment was that the parser shouldn't do anything to the string but only parse it. That makes sense to me.
If we only do it in the frontend then these could still get in via the API for example. That seems sub-optimal.

Daniel: Can you chime in on the pros and cons of doing this in the backend please?

I said that? Hm, then I have to disagree with myself there... The parsevalue API module (resp. StringValueParser) should trim input (and apply utf8 normalization).

Hm... apparently, StringValueParser does not create StringValues from strings. We seem to use the NullParser for this, which returns an UnknownValue. Whatever.

So, let's have an actual StringValueParser that applies normalization.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:33 PM
Lydia_Pintscher added a project: patch-welcome.
Lydia_Pintscher set Security to None.
Lydia_Pintscher removed a subscriber: Wikidata-bugs.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2015, 7:25 AM
thiemowmde renamed this task from Trim spaces around statements of string type to [Task] Trim spaces around statements of string type.Aug 13 2015, 3:19 PM
thiemowmde lowered the priority of this task from High to Medium.
thiemowmde updated the task description. (Show Details)
thiemowmde added a project: DataValues.

wbparsevalue now supports datatype=string. The UI should start using it. In fact, the UI should use the same code to parse, and at the same time validate, all data values. All special case code for string values should go away.

daniel raised the priority of this task from Medium to High.Aug 27 2015, 3:09 PM

Bumping to high because of implications missing unicode normalization. This could potentially break JSON dumps.

daniel renamed this task from [Task] Trim spaces around statements of string type to [Task] Apply normalization to string values in statements.Aug 27 2015, 3:11 PM
daniel updated the task description. (Show Details)

From what I wrote in T147037:

I think it's a frequent problem when strings start or end with odd characters or even include CR in the middle.

It happens a lot when trying to create statements for article titles. Oddly, I think the same text for labels tends to work.

Instead of rejecting it completely, it would be helpful if a cleaned version was offered for saving instead.

Esc3300 updated the task description. (Show Details)
daniel added a comment.Oct 2 2016, 4:34 PM

Please distinguish between

a) normalization applied by the API, when receiving a JASON structure representing a sting value.
b) normalization applied by the API when parsing user input, and returning a string value
c) normalization applied by the JavaScript widget before sending user input to the API

I believe we should do (b), but not (a). We can also do (c), but with (b) in place, this seems redundant.

The reason I believe we should not do (a) is that we are not dealing with user input then, but with a DataValue object. DataValue objects are either valid or they are not, we should not guess intent. Clients that supply DataValues should take care to apply any normalization they desire.

thiemowmde removed thiemowmde as the assignee of this task.Sep 4 2018, 12:02 PM
thiemowmde raised the priority of this task from High to Needs Triage.
thiemowmde removed a project: Patch-For-Review.