Page MenuHomePhabricator

Make title-related properties consistent
Closed, ResolvedPublic

Description

We have quite a few properties related to the title and it's getting confusing. Not sure we can resolve it thought, but lemme document that:

  • The summary endpoint result has title property that's the page title with spaces, not underscores.
  • The feed response got title that uses underscores, and normalizedtitle that's got spaces.
  • Trending endpoint got title that uses underscores, and normalizedtitle that's got spaces.
  • Same for the on_this_day endpoint.

So it seems like only the summary endpoint itself uses the title as the display title not the normalized article key.

Since changing that would be backwards-incompatible, we need to accumulate more backwards-incompatible changes before doing this. The list is currently:

  • For the summary endpoint, use underscores not spaces
  • Use snake_case
  • Add or modify normalized_title and display_title fields with an omitted namespace name (File:, User:, Special:, ...)
  • Add a localized namespace field, namespace_title (User, Usuario, Special, Especial, ...)
  • Update the API documentation

Also, @Niedzielski proposed to get rid of the namespace prefix for some of the keys:

https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACollage_of_Nine_Dogs.jpg

Current:
{
  ...,
  "title": "File:Collage of Nine Dogs.jpg",
  ...
}

Proposed:
{
  ...,
  "title": "File:Collage_of_Nine_Dogs.jpg", // Use this when constructing a URL
  "normalized_title": "Collage of Nine Dogs.jpg", // Use this for plain text
  "display_title": "<strong>Collage of Nine Dogs.jpg</strong>", // Use this for WebViews
  "namespace_title": "File", // Use this when you want to include the namespace in the rich or plain text title
  ...
}

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
Resolved Jhernandez
Resolved Mholloway
ResolvedDereckson
ResolvedJdlrobson
Resolvedovasileva
Resolvedovasileva
ResolvedJdlrobson
DuplicateNone
DuplicateNone
Resolvedovasileva
DeclinedJdlrobson
ResolvedJdlrobson
Resolvedovasileva
Resolved Fjalapeno
Resolved Pchelolo
ResolvedMSantos
Resolved Mholloway

Event Timeline

Thanks of making this ticket…

I really like the format. @JoeWalsh @Jdlrobson @Mholloway what do you think?

cc @Niedzielski

Thanks @Pchelolo! I've updated the ticket to include all the changes from the original proposal. Please revert if I've screwed it up!

Seems good @Niedzielski. I've created this to not forget we wanted to do it, but most likely it's not gonna happen very soon, the changes are backwards-incompatible and there's lots of clients using the old format..

@Pchelolo would this be a good reason to version the summary end point?

Is that what you are heading towards by collecting all the backwards incompatible details?

@Fjalapeno We would eventually support Accept header content-type negotiation, but even with that it's a disruptive change, so we're definitely not doing it until we absolutely must. This change is not an absolute must, but I've created a task to not forget about it when something more important appears.

You should use displaytitle for plain text as well, just strip out the HTML tags. E.g. displaytitle: β-lactam antibiotic, normalized title: Β-lactam antibiotic - not so nice.

Also you should probably specify whether any of those titles uses URL encoding, and whether the namespace name is canonical or localized. (I suppose the latter, if it's intended for generating URLs or reader-visible titles.)

@Niedzielski What's the use case for separating the namespace_title from the normalized_title and display_title?

@bearND, it's difficult for clients to parse titles correctly WRT localization concerns and special characters like :. Removing the namespace from title fields will give clients a lot of flexibility to manipulate the title. For example, if you're on the Talk:Barack_Obama and want to construct a URL to the mainspace page or if you're on the User:bearND page and want your UI just to say "bearND" instead of "User:bearND". I hope that the namespace name and number will also both be available (I'm sure you remember all the work we did in the Android app around namespace parsing, especially for File and Talk pages).

On a slightly unrelated note, I hope that title will be URL encoded and allow us to avoid bugs like we saw with Swedish Wikipedia's picture of the day (forward slash encoding problem), C++ (plus sign encoding problem), Bill & Ted's Excellent Adventure (ampersand encoding problem), and Talk:Street Fighter II: The World Warrior (colon encoding problem).

There's a few things to think about here. Personally I would recommend dropping "namespace_title". In the example given I'd like to see:

"title": "File:Collage_of_Nine_Dogs.jpg", // Use this when constructing a URL or getting namespace
"display_title": "<strong>Collage of Nine Dogs.jpg</strong>", // Use this for rendering WebViews
  • Using 'Talk:' prefix on a non-English wikipedia will cause a redirect to the translated equivalent.
  • Personally I find it most useful to have the prefixed database key e.g. 'Talk:Barack Obama' and the display title "Barack Obama". You can safely split the string of the database key on ':' to get the namespace.
  • Every page has a talk page equivalent (and every talk page a page). The namespace number is always +1 for the talk page - so for namespace 0 the talk page is namespace 1.
  • I'd prefer to keep fields as minimal as possible in favour of descriptive fields. The types of title available in MediaWiki always caused me confusion when I first got into Wikimedia development.
  • I'd prefer to minimise processing on the client. For instance when building https://en.wikipedia.org/api/rest_v1/page/mobile-sections/User%3Ajdlrobson I added new fields that only show up for user pages.e.g. username is provided separately and titles are kept untouched. Why not think about the sort of things we can't do with just display_title and title. It might be the case that we can create new fields that are more useful.

Personally I find it most useful to have the prefixed database key e.g. 'Talk:Barack Obama' and the display title "Barack Obama". You can safely split the string of the database key on ':' to get the namespace.

Note that displaytitle, as defined by MediaWiki, includes the namespace. (In practice, I don't think displaytitle ever differs from title inn non-article namespaces.)

Every page has a talk page equivalent (and every talk page a page). The namespace number is always +1 for the talk page - so for namespace 0 the talk page is namespace 1.

That is not a safe assumption in general. Flow topics for example don't have talk pages (example).

A couple of thoughts:

  • The title property discussed here (with underscores & canonical namespace prefix) is typically called prefixedDBKey in MediaWiki. However, the REST API calls this property just title.
  • The HTML <strong> wrapper in display_title is somewhat surprising to me. Would all titles have the <strong> wrapper? If so, could we instead return just the human-readable title, and let clients add the wrappers / formatting they want?
  • Can we make the distinction between prefixed titles & un-prefixed ones more obvious / systematic? Perhaps use display_title_name to signal that only the name is returned, without the namespace? This would leave the option of adding display_title (with namespace) later, and perhaps also display_title_namespace.

The HTML <strong> wrapper in display_title is somewhat surprising to me. Would all titles have the <strong> wrapper? If so, could we instead return just the human-readable title, and let clients add the wrappers / formatting they want?

The display-title could also make parts of the title bold or italic etc. It's controlled by the article editors.

The HTML <strong> wrapper in display_title is somewhat surprising to me. Would all titles have the <strong> wrapper? If so, could we instead return just the human-readable title, and let clients add the wrappers / formatting they want?

The display-title could also make parts of the title bold or italic etc. It's controlled by the article editors.

Ah, I see. Didn't know that formatting was allowed in DISPLAYTITLE values.

So, title in the summary endpoint is the odd man out. It would be good to know how this field is used by the various clients to see if we can reintroduce the underscores without too much harm.

In Android code I see the title property mainly used to create PageTitle objects and in one place to build a hash code for persisting dismissed feed cards.

  • In the former case having the underscores shouldn't hurt anything. It might even be better since in most cases it is to load the title. The PageTitle class has a getDisplayText() method which strips underscores.
  • In the latter case introducing the underscores would mean that previously dismissed cards in the feed would reappear. It's probably not the end of the world to have this happen one time during the transition period. Possibly some temporary workaround code could be added to the Android app to also calculate the legacy hash code for the dismissed cards. Not sure about other clients.

I'm personally fine letting the client derive the namespace string from the title (prefixedDBKey), as long as there is a good understanding on how to do this correctly in RTL languages on the client side.

If the service is to strip the namespace from display title then I tend to prefer display_title_name over display_title since, as @Tgr mentioned above and as defined by MediaWiki, displaytitle includes the namespace.

Long term I'm not sure what [[ https://www.mediawiki.org/wiki/API:Query#Title_normalization | normalizedtitle ]] gives us. For display purposes display_title_name should be superior. For referencing pages title should be used.

@Jdlrobson wrote:
Personally I find it most useful to have the prefixed database key e.g. 'Talk:Barack Obama' and the display title "Barack Obama". You can safely split the string of the database key on ':' to get the namespace.

We still want to avoid this type of splitting on clients. It causes divergences in code and makes use convention instead. Instead if you need to get the name space, we should deliver it to you separately.

@bearND wrote:
I'm personally fine letting the client derive the namespace string from the title (prefixedDBKey), as long as there is a good understanding on how to do this correctly in RTL languages on the client side.

Same feedback here, we should be trying to consolidate common logic on the clients into properties that are consistent and reliable. Having a property is a lot easier then communicating an "understanding" and relying on convention.

@Jdlrobson wrote:
I'd prefer to minimise processing on the client…

Exactly… we should deliver the properties that the clients need so that they don't need to do any manipulation on the client side.

From what I saw, @Pchelolo's original proposed 4 fields seem to eliminate the need for the clients to do any further processing of the title. Is this true? Is there any use case not captured by those 4 fields (bikshedding on naming aside)?

Since it's been decided that other backwards-incompatible changes are not happening yet, I'm lowering the priority of this.

Also it would be nice to list the uses for the titles by all the consumers.

  • Using 'Talk:' prefix on a non-English wikipedia will cause a redirect to the translated equivalent.

This is true. I'd hope that any RESTful service running on a non-English Wikipedia to accept Talk:Foo and respond with 303 See Other [0] with the appropriate Location HTTP header. The client would then request the resource pointed to by that header.

Having the localized namespace name allows the client to avoid this second request if it wants while the service remains RESTful.


[0] 301 Moved Permanently might be more appropriate. 302 Temporarily Moved isn't appropriate in this situation for obvious reasons.

This is true. I'd hope that any RESTful service running on a non-English Wikipedia to accept Talk:Foo and respond with 303 See Other [0] with the appropriate Location HTTP header. The client would then request the resource pointed to by that header.

This is already the case, we just return 301 Redirect (see curl -i https://ru.wikipedia.org/api/rest_v1/page/summary/Talk:%D0%A2%D0%B0%D0%BD%D0%BA).

301 redirect makes more sense in this case I think, as the normalized namespace name is permanent.

Also, we never redirect for User and User_Talk namespaces, because the localized namespace name might be different depending on the user's gender (Участник vs Участница in Russian) and we don't know the gender to make an informed decision. We just return the content and disable Varnish caching for those namespaces (as we can't properly purge all the possible URL variants the content was accessed through)

@phuedx just want to make sure this doesn't get lost in your spec.

Like the "links" discussion, I think that grouping these under a common dictionary will be a good approach as well (titles: display, denormalized, normalized, namespace)

Change 384737 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Add 'titles' object to MCS summary output

https://gerrit.wikimedia.org/r/384737

Change 384737 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add 'titles' object to MCS summary output

https://gerrit.wikimedia.org/r/384737

Change 389545 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Add summary 2.0 common titles object to swagger spec

https://gerrit.wikimedia.org/r/389545

Change 389545 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add summary 2.0 common titles object to swagger spec

https://gerrit.wikimedia.org/r/389545