Page MenuHomePhabricator

Lexeme language code allows both upper and lowercase qids in -x-qid syntax
Closed, DeclinedPublic5 Estimated Story Points

Description

Problem:

  1. Currently, we treat uppercase and lowercase Q-ids (mis-x-Q1 and mis-x-q1) as two separate codes and allow both even though they are semantically identical. This only happens with the qid part of the code.
  1. In other parts of the code, if you try to save mis and MIS, mis-x-Q1 and MIS-X-Q1 or en-gb and en-GB, the second of each pair is rejected. This can be improved.

Example:
See https://test.wikidata.org/wiki/Lexeme:L97

Example for the normalization:

  • HE-X-Q21283070 -> he-x-Q21283070
  • he-x-q21283070 -> he-x-Q21283070
  • He-X-q21283070 -> he-x-Q21283070

Notes:

  • ?l wikibase:lemma ?lemma . BIND( lang(?lemma) as ?code) renders lowercase (e.g. nan-x-q56929 )

Acceptance criteria:

  • If a user enters the code, it is saved in the described standardized way (no matter if parts of it were entered uppercase or lowercase).

This should automatically lead to:

  • If an editor enters both the Q and q version to the same Item then the edit should fail (with our standard message saying you cannot save two Lemmas with the same code).

Sprint 11 Planning - Notes

  • standard = everything is lowercase except the uppercase Q
  • reverts are handled by = reverting to the original version without normalization
  • for existing qids which don't fit the standarized way = they will be updated by the community after this is released
  • We will display the language code in the way they were originally saved to allow the community to fix

Original:

See https://test.wikidata.org/wiki/Lexeme:L97

It treats mis-x-Q1 and mis-x-q1 as separate codes and allows both even though they are semantically identical.

This only happens with the qid part of the code. If you try to save mis and MIS, mis-x-Q1 and mis-X-Q1 or en-gb and en-GB, it rejects the second of each pair.

?l wikibase:lemma ?lemma . BIND( lang(?lemma) as ?code) renders lowercase (e.g. nan-x-q56929 )

Event Timeline

mis-x-Q1 validation happens in LexemeTermLanguageValidator:

		// ...
		if ( count( $parts ) > 1 && !$this->isValidItemId( $parts[1] ) ) {
			$context->addViolation( new InvalidItemId( $parts[1] ) );
		}
	}
	private function isValidItemId( $id ) {
		return preg_match( ItemId::PATTERN, $id );
	}

ItemId::PATTERN allows both lower- and uppercase Q. If we’d noticed this during initial development, we probably should’ve allowed only one of them here. But given that lexemes with both now exist on Wikidata, that might be too heavy-handed…

On the other hand – it’s in change-op validation, not in the core data model. So maybe if we add such a restriction, it only applies to new edits, and you can still edit existing lexemes with such language codes. Not sure.

The vast majority of the time people use Q, people are presumably copying the ID from the item page. I just fixed all the ones which had q, but I'll have to wait until the next lexeme dump to find out if I got them all (or if any new ones appeared).

There were 10 lexemes in the 2020-11-27 lexeme dump which I've also fixed. They came from two users, Jacek Janowski and Tokyo Akademia, which was the case for most of the previous ones I fixed too.

@Lucas_Werkmeister_WMDE Would it also be possible to automatically convert them to upper case instead of rejecting them? That seems better if we know what's intended.

In general that should also be possible, though I haven’t yet looked at where in the code that would happen.

Personally, I'd use lowercase, as almost all language codes in Wikidata are lowercased and lang(?lemma) also gives lowercase, not uppercase. See https://w.wiki/3fnb

Manuel updated the task description. (Show Details)

Hi @ItamarWMDE this has acceptance criteria now! Ready? :)

karapayneWMDE set the point value for this task to 8.

I have now clarified the edge cases that we discovered in sprint planning and modified the task description accordingly:

  • If an editor enters both the Q and q version to the same Item then the edit should fail (with our standard message saying you cannot save two Lemmas with the same code). This is the standard behavior, so we will not have to deal with the edge case of "if an item has both Q and q" in a special way.
  • Reverts should just go back to the original version (without normalization). This way we will not have to deal with normalization edge cases.

breakdown notes

  • The scope of this task is to ensure that language codes that include a Q-id are saved in a standardized way

Subtask 1: make sure only codes with an uppercase Q-id are saved (sub-task will be created when this story is picked up)

  • show error if saving a language mis-x- code with a lowercase item id
  • this will need adjusting the LexemeTermLanguageValidator as discussed in T268689#6646820
  • this should get us ~80% of the benefit of this story

Subtask 2: automatically normalize lowercase q in item IDs to uppercase Q when saving instead of showing an error

  • this task might benefit from having its priority reconfirmed by @Manuel
  • it might make sense to do this for other parts of the language code as well
  • This might have to happen in ChangeOpLemmaEdit

Yes, my intention was to normalize the capitalization of all parts of the string (mis-x- lowercase, Qid uppercase). From your question @Michael about reconfirmation, I assume that part 2 is more complex than we hoped for (in terms of work needed or the result we get, e.g. as it would require modifying a non-lexeme-specific part of the code, if I understand correctly)?

From your question @Michael about reconfirmation, I assume that part 2 is more complex than we hoped for (in terms of work needed or the result we get, e.g. as it would require modifying a non-lexeme-specific part of the code, if I understand correctly)?

Mh, not necessarily "more complex than we hoped for", but surely more complex than part 1 and quite separate from part 1. So there is the option here to treat these two parts separately in terms of the product perspective, and the core functionality of this story would already be implemented with part 1.
Also, we could probably expand the scope of part 2 to "save any language code regardless of capitalization in the correct, standardized way instead of showing an error". For example, when the user enters en-GB then could just save this as en-gb instead of showing an error about an unknown language code, like we currently do. (Which is also suboptimal copy, because the language code is in fact known, just it is capitalized wrong.)

Thank you, @Michael, for pointing this out! That was actually the original intention of my AC. What we then added during sprint planning could also be read differently, so I fixed it!

This fell through the cracks as we removed it from our "Special:NewLexeme revival" work board without adding it to the new unified board. Adding it to the prioritized column of our sprint work board.

sprint planning -subtask 1 accepted into sprint 6. subtask 2 will be converted into a unique task and added to the product backlog

karapayneWMDE changed the point value for this task from 8 to 5.

(Note: estimate of 5 is for subtask 1 [disallow lowercase qid], excluding subtask 2 [normalize case generally])

We have split this task in two parts (see T268689#8036398) and successfully deployed the first part: T317863: Only save Lexeme language codes with uppercase item IDs (mis-x-Qid), not lowercase (mis-x-qid)

I have decided not to follow up with the second part at this point in favor of a more general evaluation of normalization (see T319022#8307285).

-> Declining this parent task for now.