Page MenuHomePhabricator

Special:NewItem and Special:NewProperty allow creation of items with term language as any string
Closed, ResolvedPublic

Description

Special:NewItem and Special:NewProperty allow creation of items and properties with a term language not contained within the allowed list. For example "TEST".

Some level of validity check still occurs throwing out codes with bad characters T138724

This is likely very similar to T39459 although that task talks about using uselang with an invalid code (which is different but related to this case) and also T39459 doesn't actually allow you to make the add as far as I can tell.

This may all be related to the fix of T115792 @ https://gerrit.wikimedia.org/r/#/c/291878/

See for example my test @ https://test.wikidata.org/w/api.php?action=wbgetentities&ids=Q2528

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 26 2016, 3:30 PM
Restricted Application added subscribers: Luke081515, TerraCodes, Urbanecm. · View Herald TranscriptJun 26 2016, 3:31 PM
Addshore updated the task description. (Show Details)Jun 26 2016, 3:34 PM
Addshore awarded a token.
Nikki added a subscriber: Nikki.Jun 26 2016, 4:00 PM

I tested it on test.wikidata.org too and it seems that it uses whatever you type if you don't explicitly select an option. For example, if I type "English" and submit the form, it uses "English" as the language code. On top of that, filtering only seems to work if you type a language code, so typing "hr" will highlight Croatian, but typing "Croatian" does nothing. If you do explicitly select something, it only displays the language code which also might encourage people to try to "fix" it by entering the language name instead.

Nikki added a comment.Jun 26 2016, 4:07 PM

It seems I also can't remove the bad terms, when I tried to delete the "Türkçe" description on https://www.wikidata.org/wiki/Q6052351, I get "Unrecognized value for parameter 'language': Türkçe"

Change 296198 had a related patch set uploaded (by Addshore):
Don't allow bad lang codes in SpecialNewEntity

https://gerrit.wikimedia.org/r/296198

Once the fix is in place these should be easily fixable using wbeditentity (I think).

Addshore moved this task from incoming to ready to go on the Wikidata board.Jun 28 2016, 9:30 PM

Change 296198 merged by jenkins-bot:
Don't allow bad lang codes in SpecialNewEntity

https://gerrit.wikimedia.org/r/296198

Does this still need a backport? Or can this be closed as resolved?

@thiemowmde depending on when this would normally get deployed a backport would make sense.
We should also discuss cleaning up the bad data (making sure it is possible)

Smalyshev added a subscriber: Smalyshev.EditedJul 9 2016, 12:52 AM

Is is also impossible to delete such value - see https://www.wikidata.org/wiki/Q208693 and try to remove "aln espaniol" one, you'd get Unrecognized value for parameter 'language': aln espaniol. I think deletion should be possible in this case.

https://www.wikidata.org/wiki/User:Pasleim/Language_statistics_for_items indicates we have a bunch of bad items, too bad it doesn't have Q-ids for items related to those strings, it's be then easier to get some bot to clean them up. I can try to make a tool scanning the dump for them though.

I had a look at the wbsetlabel API module, but it's not possible to have API parameters that depend on each other. The only solution would be to set the language parameter to string, with no strict limitation to a set of known languages, and do the strict validation manually whenever a label is set.

Same must then be done for all API modules.

I suggest to not do this because it makes the code error prone and is not necessary after a proper cleanup.

But removing the data may still be possible using wbeditentity?

I'll go and take a look now!

Addshore added a comment.EditedJul 11 2016, 12:18 PM

It is possible to remove the bad languages using wgeditentity

https://www.wikidata.org/w/index.php?title=Q208693&diff=356535483&oldid=355910545

  1. Get the whole entity with wbgetentities
  2. Remove the bad language codes from the JSON
  3. Send the whole data back to wbeditentity with the CLEAR parameter set!

I've cleaned up most of the languages that were obviously bad. Didn't touch ones like aeb-latn, but if anybody knows how to easily get the list of allowed languages it should be pretty easy to run a script to clean ones with non-matching langs.

greg added a subscriber: greg.Jul 25 2016, 6:52 PM

Just to get this on the record on the task:
What is needed with this issue now? Is it still UBN!? Should follow-up occur in a separate task?

I think the UBN part has been fixed, as bad language codes are not added anymore and bad data has been cleaned up from wikidata.

greg lowered the priority of this task from Unbreak Now! to High.Jul 25 2016, 8:06 PM

per that, reducing to High. This is mostly so when I review open UBN! tasks my signal/noise ratio is slightly higher.

@Addshore / etc feel free to reprioritize if that makes sense for how you're using the priority in the context of Wikidata.

Addshore closed this task as Resolved.Jul 31 2016, 8:41 PM
Addshore claimed this task.

I think this can be closed (anyone feel free to re-open if you think otherwise)!

Restricted Application added a project: User-Addshore. · View Herald TranscriptAug 15 2018, 10:30 AM