Page MenuHomePhabricator

[Task] Apply normalization to string values in statements
Open, Needs TriagePublic


I suggest to trim leading/trailing spaces around statements of string type, and apply unicode normalization.

Due to a c&p error I added some trainling spaces at a VIAF statement and had to remove them with an extra edit, see URL.

Example URL:

Other issues:

  • CR
  • unprintable characters (if not included in unicode normalization)



Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:32 AM
bzimport set Reference to bz45925.
bzimport added a subscriber: Unknown Object (MLST).

The "malformed input" error is thrown by the RegexValidator in function buildStringType from WikibaseDataTypeBuilders. It often happens when copy-pasting text from other sources (see when adding commonscat) and it is made worse by bug 63301 which sometimes adds newlines when pressing return to save the claim, thus triggering the error.

The trimming was previously done in ValueView but it has since been removed (see We could use the StringNormalizer class to trim the string and also remove incomplete UTF-8 sequences (see related bug 50486), however I'm not familiar with the data value processing code so I'm not sure where is the correct place to do it.

Just to be clear: ValueView did not trim the value, it just checked for empty or whitespace-only strings.

At the time of writing, the current behaviour seems to be: When a value with a leading space character is submitted, the back-end parser will return a "malformed input" error.
I wonder whether silent trimming should rather be implemented in the back-end parser instead of in the front-end.
Can some authority please decide on that?

IIRC When I discussed this with Daniel his comment was that the parser shouldn't do anything to the string but only parse it. That makes sense to me.
If we only do it in the frontend then these could still get in via the API for example. That seems sub-optimal.

Daniel: Can you chime in on the pros and cons of doing this in the backend please?

I said that? Hm, then I have to disagree with myself there... The parsevalue API module (resp. StringValueParser) should trim input (and apply utf8 normalization).

Hm... apparently, StringValueParser does not create StringValues from strings. We seem to use the NullParser for this, which returns an UnknownValue. Whatever.

So, let's have an actual StringValueParser that applies normalization.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:33 PM
thiemowmde renamed this task from Trim spaces around statements of string type to [Task] Trim spaces around statements of string type.Aug 13 2015, 3:19 PM
thiemowmde lowered the priority of this task from High to Medium.
thiemowmde updated the task description. (Show Details)
thiemowmde added a project: DataValues.

wbparsevalue now supports datatype=string. The UI should start using it. In fact, the UI should use the same code to parse, and at the same time validate, all data values. All special case code for string values should go away.

daniel raised the priority of this task from Medium to High.Aug 27 2015, 3:09 PM

Bumping to high because of implications missing unicode normalization. This could potentially break JSON dumps.

daniel renamed this task from [Task] Trim spaces around statements of string type to [Task] Apply normalization to string values in statements.Aug 27 2015, 3:11 PM
daniel updated the task description. (Show Details)

From what I wrote in T147037:

I think it's a frequent problem when strings start or end with odd characters or even include CR in the middle.

It happens a lot when trying to create statements for article titles. Oddly, I think the same text for labels tends to work.

Instead of rejecting it completely, it would be helpful if a cleaned version was offered for saving instead.

Please distinguish between

a) normalization applied by the API, when receiving a JASON structure representing a sting value.
b) normalization applied by the API when parsing user input, and returning a string value
c) normalization applied by the JavaScript widget before sending user input to the API

I believe we should do (b), but not (a). We can also do (c), but with (b) in place, this seems redundant.

The reason I believe we should not do (a) is that we are not dealing with user input then, but with a DataValue object. DataValue objects are either valid or they are not, we should not guess intent. Clients that supply DataValues should take care to apply any normalization they desire.

thiemowmde raised the priority of this task from High to Needs Triage.
thiemowmde removed a project: Patch-For-Review.

This has been known about for SEVEN YEARS?? Why do I even bother reporting anything

Because unfortunately there are too many to not loose track of some important ones :(

The only real option that I see working if we want to allow whitespace values in the UI is....

  • Have a list of whitespace chars that we want to trim
  • If a string is made up ONLY of these whitespace chars, do not trim
  • If a string has these whitespace chars at the beginning or end, and also has other chars, trim them?

If the cases of having whitespace at the start or end is all covered by the “only whitespace” case, then we can just add this validation in the API, and if commons calls this api (as it should) then this would fix both issues. T251380 and this

I suggest two separate data types for strings that should be trimmed (which should be the norm) and "binary" strings that aren't.