Page MenuHomePhabricator

[Bug] Normalize titles before lookup in the SiteLinksTable
Open, MediumPublic

Description

SiteLinkTable should apply light weight normalization to page titles before storing the. This would avoid issues with specifying titles with or without spaces as parameters to API calls, etc.

The following normalization should be applied:

  • strip leading and trailing whitespace
  • unicode normalization
  • converting underscores to spaces (currently, the items_per_site table uses spaces in the page titles, in violation of current practice elsewhere in the database schema)

The following normalization should not be applied:

  • namespace normalization (this requires knowledge of the target wiki's config)
  • first letter capitalization (requires knowledge about the target wiki's content language, but also about namespaces)
  • redirect resolution (requires access to the target wiki's database)

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=45111

Details

Reference
bz45282

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:17 AM
bzimport set Reference to bz45282.
bzimport added a subscriber: Unknown Object (MLST).

..storing the? "Them" or something else?

I don't think this class is the correct place to do such rewrite, this class should use whatever string is passed to it or throw an error.

@jeblad: you are right, I also noticed that when poking at the issue yesterday.

The problem seems to be that SiteLinkTable's interface is a bit asymmetric: it stores information from SiteLink objects, but for queries, it takes a site ID and page title as a string. That is convenient, but introduces inconsistencies.

Perhaps the necessary normalization should be done in the SiteLink class, and we should use SiteLink instances for querying the SiteLinkTable. But even the SiteLink class doesn't have the necessary information (namely, whether the target is a mediaWiki instance). That would have to be done in the Site object.

So, this is my current take on the issue:

  • Site::normalizePageName() should get an option for enabling/disabling expensive canonical normalization. This is a core change.
  • SiteLinkTable should not take site id and page title as strings, but always operate on SiteLink instances.
  • SiteLink should provide way to create an instance with or without "expensive" normalization, and apply "cheap" normalization always.

Related URL: https://gerrit.wikimedia.org/r/63967 (Gerrit Change I86c72ac3a9da52dfd3ee1aca86b247c59d3098ce)

Do we still need to do this or can this be closed?

Unicode normalization is still not applied consistently (this is relevant not only for the SiteLink table).

Perhaps we could file that as a separate bug and close this one.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Just checked all possible occurrences of this issue I could found:

  • CachingSiteLinkLookup does no normalization, but this is ok. It uses the unnormalized strings for caching, then delegates to other lookups that do normalization.
  • ModifyEntity passes unnormalized strings to SiteLinkLookup::getItemIdForLink.

Everything else looks fine to me. So I think only one issue is left.

thiemowmde renamed this task from Normalize titles before lookup in the SiteLinksTable to [Bug] Normalize titles before lookup in the SiteLinksTable.Sep 11 2015, 9:15 AM
thiemowmde set Security to None.
thiemowmde removed subscribers: Wikidata-bugs, Abraham.

Change 237605 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Mark missing Unicode normalization with a FIXME

https://gerrit.wikimedia.org/r/237605