Page MenuHomePhabricator

Normalize pagenames/filenames on save in Wikibase
Open, Needs TriagePublic5 Estimated Story Points

Description

As a user, I want to enter pagenames/filenames without triggering a constraint violation because of (missing) underscores.

Problem:
Currently, Wikibase does not normalize pagenames/filenames on save (e.g. underscores in the input for properties of datatype Commons media are allowed).

At the same time, Wikidata’s quality constraints extension triggers a constraint violation after saving, if underscores are used. This is by design as to long-established Community practices.

As a result, this inconsistency leaves users with unnecessary manual work.

Example:
Currently, we don't normalize the file name ("_" not removed on save)

We do already normalize sitelinks ("_" removed on save)

BDD
GIVEN a Wikibase instance like Wikidata
WHEN a new edit is saved via UI or API
AND a pagename/filename is added or changed in that edit
THEN this pagename/filename is normalized on save ("my file_name.jpg" -> "my file name.jpg")

Acceptance criteria:

  • pagenames/filenames are normalized on save, as described in the BDD
  • This should be done considering alignment with the work done for T286047.
  • Wait until community communication has happened, as this could technically be considered a breaking change (planned for the week of July 19-23).

Original:
misleading: "Commons link should be well-formed." warning

If an image name (e.g. P18, P1442) does contain underscores '_', wikidata considers this to be ill-formed. Now

  • Grab fritz jellinek wiener zentralfriedhof 2020-01-30(2).jpg and
  • Grab_fritz_jellinek_wiener_zentralfriedhof_2020-01-30(2).jpg are identical by definition and the warning is an annoyance

So please can you silently do all the necessary underscore <-> space replacements by the underlying software instead of propagating it to the users.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Manuel renamed this task from Normalize file names to misleading: "Commons link should be well-formed." warning.Jun 21 2021, 8:30 PM
Manuel updated the task description. (Show Details)
Manuel renamed this task from misleading: "Commons link should be well-formed." warning to Normalize pagenames/filenames on save .Jun 22 2021, 11:43 AM
Manuel updated the task description. (Show Details)
Addshore set the point value for this task to 5.Jun 30 2021, 10:21 AM

I have been investigating this but I found a weird case, maybe introduced in new refactoring but ValueParsers are totally cut off from the rest of the code and are basically dead code now. For example check for usages of WikibaseRepo::getValueParserFactory is never used. I put die() in it and still managed to save an edit.

It is used, just not through the WikibaseRepo getter:

extension-repo.json
		"wbparsevalue": {
			"class": "\\Wikibase\\Repo\\Api\\ParseValue",
			"factory": "\\Wikibase\\Repo\\Api\\ParseValue::factory",
			"services": [
				"StatsdDataFactory",
				"WikibaseRepo.ApiHelperFactory",
				"WikibaseRepo.DataTypeFactory",
				"WikibaseRepo.DataTypeValidatorFactory",
				"WikibaseRepo.ExceptionLocalizer",
				"WikibaseRepo.PropertyDataTypeLookup",
				"WikibaseRepo.ValidatorErrorLocalizer",
				"WikibaseRepo.ValueParserFactory"
			]
		},

But I think the value parser is the wrong point to do this normalization anyways. We want to normalize on save, not on parse – even if someone directly creates the JSON for a commonsMedia value and submits it to the API without using wbparsevalue, and that JSON specifies the file name with an underscore, we still want to save it using a space, just as we do (I assume) for sitelinks.

For sitelinks, it looks like this happens in SetSiteLink::getChangeOp(), with a call to $site->normalizePageName(); the result ends up in SiteLinkChangeOpFactory::newSetSiteLinkOp().

For Commons media, it looks like we already normalize the page name in CachingCommonsMediaFileNameLookup, but that class is only used by CommonsMediaExistsValidator, which only checks that the normalized name isn’t null before throwing it away. (More or less the same thing seems to be true for geoshape and tabular data validation, too.)

But I think the value parser is the wrong point to do this normalization anyways. We want to normalize on save, not on parse – even if someone directly creates the JSON for a commonsMedia value and submits it to the API without using wbparsevalue, and that JSON specifies the file name with an underscore, we still want to save it using a space, just as we do (I assume) for sitelinks.

Hm, let me walk that back a bit. Sitelinks and commonsMedia values aren’t the same thing – I’m not sure if there are currently any places where we normalize data value contents.

Maybe we should instead:

  1. Make the value parser normalize page names. This should be a fairly straightforward change that we can make without breaking anything.
  2. Make the value validator reject non-normalized page names. This would be a breaking change to the API (announce in advance, waiting period, etc.) that we would make afterwards.

(“Normalize” here mainly means replacing underscores with spaces, though I could imagine some Unicode normalization as well.)

@Lucas_Werkmeister_WMDE:

We want to normalize on save, not on parse – even if someone directly creates the JSON for a commonsMedia value and submits it to the API without using wbparsevalue, and that JSON specifies the file name with an underscore, we still want to save it using a space, just as we do (I assume) for sitelinks.

I agree!

Hm, let me walk that back a bit. (...) I’m not sure if there are currently any places where we normalize data value contents.

I would not want to make things harder for API users. I think APIs should ideally be flexible in what they accept, but very standardized in what they give back. For me to better understand your point: Could you please elaborate a bit on why you walked back from your previous comment?

Because the data value is in the same format that you get back when you retrieve the entity data. We sometimes allow for some redundancy (mainly for backwards compatibility, I believe) – for example, both of these values

{
  "entity-type": "item",
  "numeric-id": 5
}
{
  "id": "Q5"
}

will get you

{
  "entity-type": "item",
  "numeric-id": 5,
  "id": "Q5"
}

in the entity data. (At least the entity data you get out of the API or Special:EntityData. I don’t know if we actually store the redundant bits internally.) But I’m not aware of any place where we try to “clean up” the value beyond that (e.g. by allowing " Q5 " in the "id"). And I think it would be reasonable for API users to expect that if they pass a data value into the low-level API, and it’s accepted, then they will get it back unmodified. (wbparsevalue would be the higher-level API that’s “allowed” to do normalization.)

Moving to stalled to reflect ongoing internal discussions about how to implement this.

We also looked through the other datatypes, and it looks like _ normalization is the only kind of normalization we’d implement at the moment (for the Commons media, tabular data, and geoshape datatypes). In the future we could think about normalizing quantities: 0001.01.0. (Currently, 0001.0 is rejected as an illegal data value, since DecimalValue::QUANTITY_VALUE_PATTERN doesn’t permit leading zeroes.)

Clarified scope of ticket as per discussion with @Lucas_Werkmeister_WMDE.

@Manuel @Lydia_Pintscher @Samantha_Alipio_WMDE

So judging from the open questions here we need a product decision to un-stall this.

Open questions:

  • Standard Mediawiki-normalization for pagenames/filenames in Wikidata uses the opposite normalization ("file_name.jpg" -> "file_name.jpg"). How much of a problem is this inconsistency?
  • Do we want this to happen on all Wikibases or only on Wikidata?

@toan: Yes, I will un-stall it when everything is ready.

Manuel renamed this task from Normalize pagenames/filenames on save to Normalize pagenames/filenames on save in Wikibase.Jul 13 2021, 6:53 PM
Manuel updated the task description. (Show Details)
Manuel updated the task description. (Show Details)

Change 709774 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Introduce normalization-related wiring and services

https://gerrit.wikimedia.org/r/709774

Change 709775 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Add normalization to StatementChangeOpFactory

https://gerrit.wikimedia.org/r/709775

Change 709776 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Define normalizer for commonsMedia property type

https://gerrit.wikimedia.org/r/709776

Change 709958 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@master] Temporarily skip LexemeFormsMergerTest

https://gerrit.wikimedia.org/r/709958

Change 709959 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@master] Update StatementChangeOpFactory constructor call

https://gerrit.wikimedia.org/r/709959

Change 710044 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Introduce DataValueNormalizer interface

https://gerrit.wikimedia.org/r/710044

Change 710045 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Add data-value-normalizers to data type definitions

https://gerrit.wikimedia.org/r/710045

Change 710046 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Introduce SnakNormalizer

https://gerrit.wikimedia.org/r/710046

Change 710047 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Introduce ReferenceNormalizer

https://gerrit.wikimedia.org/r/710047

Change 710048 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Introduce StatementNormalizer

https://gerrit.wikimedia.org/r/710048

Change 709774 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/extensions/Wikibase@master] Introduce normalization-related wiring and services

Reason:

split into several smaller changes

https://gerrit.wikimedia.org/r/709774

Change 710069 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Add API integration tests for data value normalization

https://gerrit.wikimedia.org/r/710069

Change 710044 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Introduce DataValueNormalizer interface

https://gerrit.wikimedia.org/r/710044

Change 710045 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Add normalizer-factory-callback to data type definitions

https://gerrit.wikimedia.org/r/710045

Change 710046 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Introduce SnakNormalizer

https://gerrit.wikimedia.org/r/710046

Change 710047 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Introduce ReferenceNormalizer

https://gerrit.wikimedia.org/r/710047

Change 710048 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Introduce StatementNormalizer

https://gerrit.wikimedia.org/r/710048

Change 709958 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Temporarily skip LexemeFormsMergerTest

https://gerrit.wikimedia.org/r/709958

Change 709775 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Add normalization to StatementChangeOpFactory

https://gerrit.wikimedia.org/r/709775

Change 709776 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Define normalizer for commonsMedia property type

https://gerrit.wikimedia.org/r/709776

Change 710069 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Add API integration tests for data value normalization

https://gerrit.wikimedia.org/r/710069

Change 709959 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Update StatementChangeOpFactory constructor call

https://gerrit.wikimedia.org/r/709959

Stalled until next week’s train, then we can enable this on Test Wikidata and announce it.

Change 714334 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/mediawiki-config@master] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis

https://gerrit.wikimedia.org/r/714334

Change 714334 merged by jenkins-bot:

[operations/mediawiki-config@master] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis

https://gerrit.wikimedia.org/r/714334

Mentioned in SAL (#wikimedia-operations) [2021-08-23T11:31:58Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:714334|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis (T251480)]] (1/2) (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2021-08-23T11:33:39Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/Wikibase.php: Config: [[gerrit:714334|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on test wikis (T251480)]] (2/2) (duration: 00m 57s)

I noticed one issue while testing the deployment above: if you add an image statement with a non-normalized value, it gets saved normalized, but if you then edit the statement again, you’ll still see the non-normalized value (with underscores). Only a reload solves this. We probably want to apply the data value returned by the API, I assume.

A related issue that is quite frustrating when adding images to Wikidata: you need to remove the 'File:' part, otherwise you get a pull-down list of possible matches where it's easy to then click on the wrong file. Could it be possible to submit 'File:' (and maybe also 'Image:' for legacy reasons), and for the code to then strip it out in the same way as underscores?

Change 715018 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/mediawiki-config@master] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis

https://gerrit.wikimedia.org/r/715018

Unassigning myself since I won’t be around on 6 September, someone else will have to deploy that config change.

Change 715018 merged by jenkins-bot:

[operations/mediawiki-config@master] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis

https://gerrit.wikimedia.org/r/715018

Mentioned in SAL (#wikimedia-operations) [2021-09-07T20:41:15Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715018|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis (T251480)]] (duration: 00m 59s)