Page MenuHomePhabricator

RFC: Introduce PageIdentity to be used instead of WikiPage
Open, MediumPublic

Description

  • Affected components: None immediately, but PageIdentity is designed to replace many uses of the WikiPage and Title classes. So eventually: much of MediaWiki.
  • Engineer(s) or team for initial implementation: @daniel / Platform Engineering
  • Code steward: Platform Engineering

Motivation

Create a light weight replacement for WikiPage and Title to be used to represent editable wiki pages.

The Title class is not only large and has many dependencies, it's also conceptually overburdened: it is used to represent link targets on the one hand, and editable pages on the other. It's also an "active record" that uses global state to interact with the storage backend, which is something we want to move away from.

Similarly, the WikiPage class contains database logic and business logic, and relies on global state.

A light-weight value object to represent wiki pages is generally useful for decoupling, e.g. T208764: Remove cyclic dependency between Title and User classes, and to for serialized representation of deferred events, see e.g. T212482: RFC: Evolve hook system to support "filters" and "actions" only .

Requirements
  • PageIdentity must make available to following: the page's internal ID, the namespace number, the title (db key), and the ID of the wiki it belongs to, for cross-wiki operations.
  • PageIdentity (or a subclass or implementation) should be newable
  • PageIdentity (or a subclass or implementation) should be serializable
  • PageIdentity (or a subclass or implementation) should be a immutable pure value object

Exploration

One major aspect of the design of PageIdentity is interoperability with and transition from the Title class.

Considerations
  • To easy transition, it would be useful if Title could implement PageIdentity. This allows parameters of existing methods to be narrows from Title to PageIdentity.
  • To easy transition, it would be useful if WikiPage could implement PageIdentity. This allows parameters of existing methods to be narrows from WikiPage to PageIdentity.
  • It would be desirable of any instance of PageIdentity was guaranteed to represent a "real" page (not a special page, not an interwiki link, not a section). However, this desire conflicts with the wish to have Title implement PageIdentity.
  • PageIdentity and LinkTarget are distinct. Not all LinkTargets correspond to a PageIdentity (e.g. SpecialPages, interwiki linkls, section links, etc). Every PageIdentity has a corresponding LinkTarget, but it may not be trivial to determin if the PageIdentity belongs to another wiki.
Proposal

Make PageIdentity an interface and have Title implement that interface. Provide a pure value implementation, PageIdentityValue, as an alternative.

Title would again have a static newFromPage() method that would instantiate a Title from a PageIdentity, which would return the object unchanged if that object already was a Title instance.

We have been successfully using this approach with UserIdentity and LinkTarget.

The signature and semantics of the PageIdentity interface will have to make some concessions to be compatible with Title. In particular, since a Title object can represent things that are not editable wiki pages but merely link targets (Special pages, interwiki links, relative section links), PageIdentity will have to allow for such "improper" instances, and will need to offer a method for calling code to check whether it's a "proper" PageIdentity.

To provide a patch towards providing stronger guarantees, provide an additional interface, ProperPageIdentity, that represents the stricter semantics that PageIdentity should eventually represent (specifically, that it is always a page that exists or can be created). Ideally, ProperPageIdentity will become an alias for PageIdentity in the future.

Draft:

interface PageIdentity {
  function getId(): int; // see Title::getArticleID().  Throws if it's not a "proper" page!
  function canExist(): bool; // see Title::canExist.  False if not a proper page. Taint, should eventually go away.
  function exists(): bool; // see Title::exists.  False if not a proper page.
  function getWikiId(): string; // see RevisionRecord::getWikiId()
  function getNamespace(): int; // see Title::getNamespace
  function getDBkey(): string; // see Title::getDBkey
}

interface ProperPageIdentity extends PageIdentity {
  function getId(): int; // see Title::getArticleID().  Never throws.
  function canExist(): bool; // Always true.  Should eventually go away.
}

class PageIdentityValue implements ProperPageIdentity {
...
}

class Title implements PageIdentity {
  ...
  public static function newFromPage( ?PageIdentity $pageIdentity ): ?Title;
  public function getWikiId(); // always false, since Title only supports the local wiki
  public function getId( $wikiId ); // Throws if canExist() returns false. Also throws if $wikiId mismatches.
  public function toPageIdentity(): ProperPageIdentity; // throws if not a proper page
}

class WikiPage implements PageIdentity {
  ...
  function canExist(): bool; // see Title::canExist. Always true (needs fixing of some edge cases)
  function getWikiId( $wikiId ): string; // see RevisionRecord::getWikiId(); throws if $wikiId mismatches.
  function getNamespace(): int; // see Title::getNamespace
  function getDBkey(): string; // see Title::getDBkey
}

class TitleValue implements LinkTarget {
  public static function newFromPage( PageIdentity $page ): TitleValue; // throws if not local
}
PageIdentity and PageRecord

With the proposal of PageStore (T195069) also comes the concept of a PageRecord object, representing page level meta data such as the page's latest revision, touch date, etc. PageRecord should by a subtype of PageIdentity (extending interface of implementing class).

Notes
  • The idea to require a wiki ID in getId() to ensure the ID refers to the correct wiki in the context of cross-wiki enabled code allows us to safely introduce the idea of non-local page identities to existing code. If it does not cause unexpected problems, the intent is to use the same pattern in classes representing other entities, such as revisions or users.
  • PageIdentity instances may be defined relative to the local wiki, using a special value for the wiki ID. While this works nicely with the way MediaWiki handles database connections to different wikis, it may prove problematic when constructing cache keys or serializing instances of PageIdentity for consumption in the context of another wiki.

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
DuplicateNone
ResolvedNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedNone
ResolvedNone
OpenNone
StalledNone
OpenNone
OpenNone
Opendaniel
OpenNone
OpenNone
OpenNone
OpenNone
Resolveddaniel
Resolveddaniel
Opendaniel
Resolveddaniel
OpenNone
Opendaniel
Resolveddaniel
ResolvedJdforrester-WMF
ResolvedPRODUCTION ERRORUmherirrender
Resolveddaniel
Resolveddaniel
Resolveddaniel

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel updated the task description. (Show Details)
daniel added a subscriber: Vedmaka.

I would suggest that instead of (or in addition to) having assertProperPage(), have a throwing cast method in Title e.g. castToPageIdentity(). It would be type hinted as returning PageIdentity, and if the Title is not a "proper" page, it would throw.

Moving to "tune" per discussion in today's TechCom meeting. I'll update the experimental patch as discussed.

If PageRecord represents the page row from the database, and PageIdentity only the page ID and the TitleValue (a subset) - then how would one obtain such a PageIdentity object typically, without having the page row?

If a given code path has only an integer page ID but doesn't know if it exists, what value does PageIdentify add over just passing that integer to whatever method it would pass the PageIdentify to?

What could a PageIdentify (safely) be used for, besides retreiving a PageRecord and besides things that can or need a LinkTarget? An example use case would help.

It seems like the factory that would produce a PageRecord would only need to accept either an integer page ID, or a LinkTarget (e.g. PageRecordFactory->newFromId/newFromTitle). And for the use case of making cross-wiki log action links, we need the LinkTarget either way (and the interwiki prefix, which PageIdentify does not hold). We generally don't store or query foreign page IDs across wikis. Is that something we want to start doing? Using page IDs publicly and cross wikis also touches on T20493.

Krinkle renamed this task from Introduce PageIdentity to be used instead of Title to RC: Introduce PageIdentity to be used instead of Title.May 20 2020, 8:44 PM
Krinkle renamed this task from RC: Introduce PageIdentity to be used instead of Title to RFC: Introduce PageIdentity to be used instead of Title.

I feel there is still unclarity on how will this look at the end when the untangling is done and how will we get there? Will Title at some point stop extending PageIdentity (which means we need to place a many lot of deprecation warnings on methods taking PageIdentity, or is the plan to kill Title completely and avoid deprecations on the methods themselves?

Are we going to do mass-replacement of Title -> PageIdentity, or is it done case by case basis? If case-by-case, is there now sufficient alternatives (PageIdentity, LinkTarget, PageRecord?) to choose from to form the best interface and guidance which one to choose?

The proposal does not seem to be up to date with the proposed patch. It seems quite clear that option 2 is the preferred way to go.

I think getWikiId should not be in the interface, unless there is an agreement that all MediaWiki code must support non-local titles (and definition of what that means). It's hard to remove later, but easy to add later.

I think getWikiId should not be in the interface, unless there is an agreement that all MediaWiki code must support non-local titles (and definition of what that means). It's hard to remove later, but easy to add later.

I see your point, but we do have getWikiId() on RevisionRecord, and I was hoping to replace the usage of Title in RevisionStore with PageIdentity. getWikiId() is needed for parity with RevisionRecord.

If we have a PageIdentity that does not have a getWikiId(), what's the intended semantics? Is it guaranteed to be local? Then it can't be used with RevisionStore. Or does it give no guarantee what wiki it belongs to? Then it can't be used safely at all...

Do you have an idea how to get out of this?

I guess the same question applies to RevisionRecord... is every code expected to handle non-local revisions? It seems to have the same problem that callers need to explicitly check whether it is local or not instead of explicitly opting in to it. There doesn't seem to be many callers for RevisionRecord::getWikiId. Is there e.g. a task that explains why it was added?

I guess the same question applies to RevisionRecord... is every code expected to handle non-local revisions?

Most code should be oblivious of where the revision came from. At least in theory, only code that directly access the database needs to be aware of this.

It seems to have the same problem that callers need to explicitly check whether it is local or not instead of explicitly opting in to it.

Opting in would mean we create types that guarantee that they are always local. That means we have to make breaking changes to our interface when making code interwiki-ready.

I understand the concern, the thing is that that we already have the problem that we don't know for sure which wiki a revision, or user, or title belongs to. What I'm trying to do is to make it explicit.

There doesn't seem to be many callers for RevisionRecord::getWikiId. Is there e.g. a task that explains why it was added?

The getter isn't called much, that is true. It exists mostly to reflect the fact that a revision's identity is given not just be the revision ID, but by the revision ID in the context of a specific wiki. The same is true for pages (and users as well - UserIdentity should also have a wiki id). You can find code relying on this capability by looking for callers of RevisionStoreFactory::getRevisionStore(). The only production use right now seems to be Wikibase, though things like global user pages could/should probably also use this.

It is my understanding that the strategic direction is to enable more cross-wiki operation. I have been trying to model the basic concepts in the wiki with that in mind.

But perhaps my approach was not the best. Perhaps we need two versions of page, revision, user, etc - once that is guaranteed to be local, and one that can be remote. But in that case, the remote version needs to expose a wiki ID, and the local version would be a specialization of that, so it would also expose a wiki ID. Or the two types would have to be mutually incompatible. The remote version can't be a subtype of the local version, since it removes a guarantee, and would thus violate LSP.

Krinkle changed the task status from Open to Stalled.Jul 29 2020, 8:33 PM
Krinkle triaged this task as Medium priority.
daniel renamed this task from RFC: Introduce PageIdentity to be used instead of Title to RFC: Introduce PageIdentity to be used instead of WikiPage.Nov 30 2020, 3:51 PM
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)

Change 644281 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] EXPERIMENT: Introduce PageRecord interface

https://gerrit.wikimedia.org/r/644281

Change 645036 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Make WikiPage a ProperPageIdentity

https://gerrit.wikimedia.org/r/645036

I might be a little late to the show, but would love to leave my first impressions anyway. I hope it's helpful. Generally, it makes me very happy to see us moving towards tiny, pure value objects like this one. This makes all code using these (instead of almighty god objects like WikiPage) sooooo much easier to maintain. 🥳

However:

  • What is "proper"? "Proper" means "correct", "appropriate". How can a page be "proper" in contrast to an "improper", "incorrect", "inappropriate" page? I find this more and more confusing the more I think about it. I still don't fully get it, but believe this is about a page that is guaranteed to exist. Can we please come up with more descriptive terminology for this?
  • When a "proper" page is a page that exists, how can an "identity" be proper? I mean, an identity doesn't "exist". A page exists. A page does have an identity. But that "identity" is an abstract concept that, well, identifies something that exists, but doesn't "exist" independent from that thing.
  • The proposal also contains a PageRecord in addition to ProperPageIdentity. My issue with this is that "record" sounds like it's something that is already stored ("recorded") in the database. This makes it sound more like what I described as "proper" before: something that is guaranteed to exist (in the database). PageRecord even contains getLatest() and getTouched(), i.e. this is indeed a page that is already stored. If this is true, what's the point of having two names for the same concept? And if it's not the same, can we please find names that make it more obvious what the difference is?
  • There is also PageRecordValue that sounds like it's the exact same as PageRecord, just with implementation. Serious question: Why to we need the separate interface? The only difference is the constructor. But the constructor doesn't matter when we pass a PageRecordValue around as a value object. Or to ask the question in a different way: Is there a point of using PageRecordValue as a type hint, or should type hints always mention the PageRecord interface? If this is true, what's the point of having both, when only one is useful? I'm afraid we will end with lots of code that constructs a PageRecordValue (because that's the only one we can construct), but is type-hinted to return a PageRecord (because that's the only one allowed in type hints). Can you see how this is confusing?
  • The draft contains newFromPage, but castFromPageIdentity. Can we please change this to newFromPageIdentity and make it follow the existing, well established naming scheme we have in so many classes? I checked, and as of now there is exactly 1 castFrom… method in core (which was a mistake, in my opinion), but hundreds of newFrom….
  • asPageIdentity is a naming scheme that exists – let's see – zero times in existing core code. Can we stick to an existing naming scheme? I believe the most common naming scheme is get…. I understand this is not perfect. "Get" is used for way to many different things. Still I believe the suggested at… is not so much better that it's worth the confusion.
  • newFromPage sounds like it can consume a Page or WikiPage object. Please make sure this works. Otherwise rename it to newFromPageIdentity.

I might be a little late to the show, but would love to leave my first impressions anyway. I hope it's helpful. Generally, it makes me very happy to see us moving towards tiny, pure value objects like this one. This makes all code using these (instead of almighty god objects like WikiPage) sooooo much easier to maintain. 🥳

However:

  • What is "proper"? "Proper" means "correct", "appropriate". How can a page be "proper" in contrast to an "improper", "incorrect", "inappropriate" page? I find this more and more confusing the more I think about it. I still don't fully get it, but believe this is about a page that is guaranteed to exist. Can we please come up with more descriptive terminology for this?

Not quite. A "proper" page is one that might exist. One that can be edited/created. Examples of "improper" pages are special pages, section links, and interwiki links.

Ideally, PageIdentity would always be "proper". But Title objects can represent "improper" things. So to allow Title to implement PageIdentity, we need to allow PageIdentity to represent improper things.

Conceptually, that just means that Title shouldn't implement PageIdentity. But it must, to make the transition feasible: If Title doesn't implement PageIdentity, we have to deprecate and replace all methods that take a Title. If Title however implements PageIdentity, we can narrow the signature without breaking compatibility for callers.

This is a key aspect of the migration strategy, and has been discussed extensively in the context of this RFC. We have also been using this approach with LinkTarget.

  • When a "proper" page is a page that exists, how can an "identity" be proper? I mean, an identity doesn't "exist". A page exists. A page does have an identity. But that "identity" is an abstract concept that, well, identifies something that exists, but doesn't "exist" independent from that thing.

ProperPageIdentity is the identity of a proper page, not a proper identity of a page. A proper page doesn't have to exist, but it must be possible to create it.

One we have migrated away from Title, PageIdentity will adopt the same guarantees as ProperPageIdentity. They could become aliases.

  • The proposal also contains a PageRecord in addition to ProperPageIdentity. My issue with this is that "record" sounds like it's something that is already stored ("recorded") in the database. This makes it sound more like what I described as "proper" before: something that is guaranteed to exist (in the database). PageRecord even contains getLatest() and getTouched(), i.e. this is indeed a page that is already stored. If this is true, what's the point of having two names for the same concept? And if it's not the same, can we please find names that make it more obvious what the difference is?

PageRecords exist (or used to exist). A ProperPageIdentity may not exist, but can be created. A PageIdentity should also be able to be created, but PageIdentity instances may be "improper" for backwards compatibility reasons.

  • There is also PageRecordValue that sounds like it's the exact same as PageRecord, just with implementation. Serious question: Why to we need the separate interface?

So that WikiPage can implement PageRecord. That way, we can migrate methods that take a PageRecord by narrowing the signature, without braking compatibility for callers.

Why then create a new value object, instead of sticking with WikiPage? WikiPage is overburdened, but more importantly, it's mutable, and not serializable. We need immutable serializable representations of the key entities in our system (e.g. revisions, pages, users, etc) so they can be used to communicate across process boundaries (for Jobs, and in the future for async hook handlers / event listeners).

The only difference is the constructor. But the constructor doesn't matter when we pass a PageRecordValue around as a value object. Or to ask the question in a different way: Is there a point of using PageRecordValue as a type hint, or should type hints always mention the PageRecord interface?

No, the type hint should be PageRecord. The point of having a separate interface is that for at least quite a while, there will be two implementations (PageRecordValue and WikiPage).

If this is true, what's the point of having both, when only one is useful? I'm afraid we will end with lots of code that constructs a PageRecordValue (because that's the only one we can construct), but is type-hinted to return a PageRecord (because that's the only one allowed in type hints). Can you see how this is confusing?

No, I think that's perfectly normal: interfaces are declared to return or expect an abstract type, only the implementation that instantiates the object has knowledge of the concrete type. Seems like the classic open/closed approach to me.

  • The draft contains newFromPage, but castFromPageIdentity. Can we please change this to newFromPageIdentity and make it follow the existing, well established naming scheme we have in so many classes? I checked, and as of now there is exactly 1 castFrom… method in core (which was a mistake, in my opinion), but hundreds of newFrom….

There are two distinctions that I think are useful:

  • castFrom is guaranteed to work for null. newFrom generally will not.
  • newFrom should always return a new instance (we have counter-examples, and they keep causing confusion). castFrom may return the parameter directly, if it already implements the desired interface.

I find it problematic to have a method called new... that may return the same object multiple times, or may return null.

  • asPageIdentity is a naming scheme that exists – let's see – zero times in existing core code. Can we stick to an existing naming scheme? I believe the most common naming scheme is get…. I understand this is not perfect. "Get" is used for way to many different things. Still I believe the suggested at… is not so much better that it's worth the confusion.

You are right that it's not currently used in core, but it's quite common in extensions, see https://codesearch.wmcloud.org/search/?q=function%20as%5BA-Z%5D&i=nope&files=&repos=. I don't think it's terribly surprising.

That said, what is used in core is the to... prefix. I'd be fine with using that. https://codesearch.wmcloud.org/core/?q=function%20to%5BA-Z%5D&i=nope&files=&repos=.

  • newFromPage sounds like it can consume a Page or WikiPage object. Please make sure this works. Otherwise rename it to newFromPageIdentity.

The idea is that it will work for WikiPage, since WikiPage will be implementing PageRecord, which extends ProperPageIdentity, which extends PageIdentity. It will also work for Title, since Title implements PageIdentity.

I expect that we will remove the Page soon. It's ill-conceived, and we have been working to remove usages for a while.

I have some questions about how it would work in practice, and what requirements it will satisfy and/or want to satisfy but think are not feasible. I'm not sure how to phase this, so I'll ask it by way of reflecting that I think we want, and you can reflect back. The below isn't a proposal, but rather a reflection of what I think you want.

Status quo

  • class Title is widely used to in code that takes a parameter to identify a wiki page, such as to obtain a page ID or latest revision ID etc. It is currently only capable of providing that data for the wiki of the current global context. Although as a LinkTarget it can represent a link to a different site and/or section heading. But, those are not always in-farm wikis, or even wikis at all. There is no code currently that supports an interwiki Title as a way of accessing or storing information about a wiki page for another wiki.
  • class LinkTarget is widely used to represent things we can link to by URL, typically as instructed through [[wikitext]].

Intermediary stage

Introduce a new interface and value class that can represent any wiki page, effectively a (wiki-id, page-id) tuple. Similar to the RevisionStore does already, we would use this in new code to support representing and interacting with any wiki regardless of global context.

What we want to be possible:

  • 📝 (Backwards compatible) Classes that access or modify pages on the local wiki, which currenly take Title, should continue to support taking Title as-is.
  • 📝 To upgrade one-at-a-time classes to become wiki-agnostic and once done for a given class, expose that feature by making it accept the Page ID value objects whilst still also supporting Title objects for local-wiki use cases. This means Title and "proper" PageIdentity need to share a common type.
  • 📝 To upgrade one-at-a-time callers of such classes to pass in PageIdentity themselves, e.g. slowly pushing out and up the Title->PageID casting from low-level and leaf classes to their callers etc.

With this so far, I think the following would suffice. There appears to be no need for a "proper" distinction or to have temporary additional methods like "canExist" on the new interface. One would only accept PageIdentity in a service class once it is internally wiki-agnostic. For brand new code that works only with PageIdentity it would be trivial to cast to Title when interacting with older code. If we really wanted to, we could inverse this and accept PageIdentity in services that are not wiki-agnostic but then you'd need to be verify careful in placing a conditional at the start of each method that throws for non-local PageIdentity. I don't see what the benefit of such false advertising would be, though. It seems easy enough to cast to Title in new code when interacting with an older service class.

// These are always "proper"
interface PageIdentity  {
  function getId(): int; // see Title::getArticleID(), may throw if implemented by Title
  function getWikiId(): string; // see RevisionRecord::getWikiId()
  function getNamespace(): int; // see Title::getNamespace
  function getDBkey(): string; // see Title::getDBkey
}
class PageIdentityValue implements PageIdentity {  }
class WikiPage implements PageIdentity {  }

class Title implements PageIdentity { 
  public function getId(); // like getArticleID(), but throws if canExist()  or exists() returns false.
  public function getWikiId(); // always local wiki, e.g. wfWikiID() or false
  public function asPageIdentity(): ProperPageIdentity; // throws if no getId()
  public static function newFromPageIdentity( ?PageIdentity $pageIdentity ): ?Title;
}

I think however there additional thinks we/you want:

  • To be able to create (new) classes that only support "proper" PageIdentity objects, where the potential non-properness of a Title thus does not need to be worried about. These new classes need to be easy to interact with from older code, which is why a casting method from Title to PageIdentity is needed, although this may of course throw, which the caller would be responsible for handling.

This clashes with the earlier need for a common type between Title and PageIdentity. So a distinction here seems needed. Either by calling out "unproperness" or by calling out "properness". I have a gut-feeling that calling out "unproperness" might work better here as I think we would otherwise need several migrations to complete this, e.g. another one when removing "Proper" from type hints. But, I'll continue your example as-is for now:

interface PageIdentity  {
  function getId(): int; // see Title::getArticleID(), may throw if implemented by Title
  function getWikiId(): string; // …
  function getNamespace(): int; // …
  function getDBkey(): string; // …
}
+ interface ProperPageIdentity extends PageIdentity  {
+   function getId(): int; // never throws
+ }
± class PageIdentityValue implements ProperPageIdentity {  }
± class WikiPage implements ProperPageIdentity {  }

class Title implements PageIdentity { 
  public function getId(); // may throw if non-local or non-existent
  public function getWikiId(); // …
  public function asPageIdentity(): ProperPageIdentity; // …
  public static function newFromPageIdentity( ?PageIdentity $pageIdentity ): ?Title;
}

This means new services could accept ProperPageIdentity only, and if a call to such service where then added to older code, it would cast from Title and handle any exception.

End of transition

At some point Title would become deprecated and eventually removed. After that, we can remove the @throws doc from PageIdentify::getId, and thus the two interfaces become fully identical not just in interface but also in behaviour. Perhaps we would then make ProperPageIdentity a native alias to PageIdentity so that everything is allowed to cross-talk, and we may want to discourage use of ProperPageIdentity as it would signify something no longer relevant. We might however not bother deprecating it and just keeping it around as alias, and doing a find-replace in the code we can find for clarity.


I wrote this in response to the TechCom meeting where I perceived a shortcoming where some objects coudl be passed in somewhere without good type checks to prevent problems where e.g. a foreign page woudl be wrongly read as local. However, I now no longer see this. I probably misunderstood it.

How does this look now? Does this miss anything? It matches exactly your proposed structure, except for a few methods I left out. However, I think the description of the transition is perhaps differently. I think you intended for the new types to be much more widely adopted ahead of time any conversions, which would be good to understand why and what need that addresses, and then I can point out the issues I see with that :)

This clashes with the earlier need for a common type between Title and PageIdentity. So a distinction here seems needed. Either by calling out "unproperness" or by calling out "properness". I have a gut-feeling that calling out "unproperness" might work better here as I think we would otherwise need several migrations to complete this, e.g. another one when removing "Proper" from type hints.

That is what I initially thought as well, but decided against it for two reasons:

  1. The (proper) PageIdentity would be a subtype of ImproperPageIdentity, since it is more specific, provides more guarantees; but it makes no sense to say that the proper PageIdentity "is a" ImproperPageIdentity. The opposite however works: if PageIdentity may or may not be proper, ProperPageIdentity "is a" PageIdentity.
  2. If we had Title implement ImproperPageIdentity, we would convert method signatures that current hint against Title to hint against ImproperPageIdentity. But eventually, we want them to only accept proper PageIdentity, and we want to remove ImproperPageIdentity, so a second migration step would be needed. If we changed method signatures from Title to PageIdentity, they could remain as they are when PageIdentity startes to provide additional guarantees.

At some point Title would become deprecated and eventually removed. After that, we can remove the @throws doc from PageIdentify::getId, and thus the two interfaces become fully identical not just in interface but also in behaviour. Perhaps we would then make ProperPageIdentity a native alias to PageIdentity so that everything is allowed to cross-talk, and we may want to discourage use of ProperPageIdentity as it would signify something no longer relevant. We might however not bother deprecating it and just keeping it around as alias, and doing a find-replace in the code we can find for clarity.

That was exactly the idea.

I wrote this in response to the TechCom meeting where I perceived a shortcoming where some objects coudl be passed in somewhere without good type checks to prevent problems where e.g. a foreign page woudl be wrongly read as local. However, I now no longer see this. I probably misunderstood it.

I think I can still see a problem:

  • Let's say you have PageRestrictionManager::getPageRestrictions( Title $page ) which uses Title::getArticleID()
  • You convert it to PageRestrictionManager::getPageRestrictions( PageIdentity $page ) by just replacing getArticleID() with getId(), without checking getWikiId().
  • Now you (from somewhere) get a PageIdentity that refers to different wiki, and pass it to getPageRestrictions()
  • getPageRestrictions() will now use a page ID that belongs to a different database, potentially causing data corruption. This kinf of confusion was the root cause of T260485: CentralAuth uses wrong actor ID when locally suppressing the user (CVE-2020-25869).

One way to address this would be to change getId() to be getId( $wikiId = false ). getId() would throw if called with a $wikiId different from the one returned by getWikiId(). This way, naive callers would get an exception from non-local PageIdentities, and wiki-aware callers would have an easy way to ensure they are looking at PageIdentities belonging to the wiki they expect.

How does this look now? Does this miss anything? It matches exactly your proposed structure, except for a few methods I left out. However, I think the description of the transition is perhaps differently. I think you intended for the new types to be much more widely adopted ahead of time any conversions, which would be good to understand why and what need that addresses, and then I can point out the issues I see with that :)

I hardly see any difference at all... Can you highlight them again for me?

I'm unable to fit this thing in my head to be able to judge how the migration would go. Some general thoughts:

I'm still not fully convinced that all code needs to become wiki-aware. I think it's only small minority where this makes sense, hence only that part of code should be affected. But I might be wrong: maybe this enables global templates/gadgets/whatelse which means that through parser and other parts a lot more code suddenly needs to become wiki-aware?

I think getting this kind of change fully right in one go is impossible. I think we are close to or at the point of good enough to start trying this out while explicitly stating that we may do some tweaks until it's marked stable (in the next MW release, presumably).

There are some concerns of tech debt being left behind/carried over:

  • Adding type declarations to params and return values of the interfaces
  • Fixing inconsistent, misleading or unclear names

Thanks a lot for the responses. Unfortunately it appears like I failed explaining some of my concerns.

Proper

"Proper" is a confusing word with a wide variety of meanings that doesn't explain what it's meant to represent in this context here. I looked around and found "persistable" being used in other projects for exactly this concept. Can we change this?

Cast

"Cast" implies a well-known concept (mostly known from C) that is somewhat different from what the method actually does. There is also the naming scheme "from…" Still a rare one, but much more often than "cast…". Can we follow this and just remove the confusing extra prefix?

RE "Cast", I agree castFromPageIdentity is confusing here. Following the MW idiom means we'd call it newFromPageIdentity, or given it has a verb already the shorer fromPageIdentity could work as well I suppose.

RE "Proper", I think in this case we don't want to limit ourselves to the persistable aspect per-se. Rather I think in context of this proposal, it is more the general validity of the object that is in question, and this is only the name for the transitional phase. It would not be part of the interface long-term. I do agree "Proper" is quite ambiguous and even more so in "international" English. Perhaps "Valid" would capture better what we are after. Caling it "Persistable" suggests that non-persisting is also supported, whereas that's not the case. The invalid ones will essentially throw upon significant use.

Thanks, Krinkle!

[…] we don't want to limit ourselves to the persistable aspect per-se […] in context of this proposal, it is more the general validity of the object […]

I can see this, thanks. While "valid" is somewhat ambiguous as well, I agree that "valid" is notably better than "proper".

The last item of discussion besides naming seems to be how we avoid code that is not cross-wiki aware from getting confused by PageIdentity instances that belong to another wiki.

I have made two patches exploring different approaches:

Please let me know which one you prefer, and why.

Although I dislike having what is essentially the same assert method in three places in the first approach ( https://gerrit.wikimedia.org/r/c/mediawiki/core/+/650126 ) I do feel that avoiding the extra complexity of the extra classes and methods in the second approach is worth it. I'm pretty okay with using runtime assertions for something like this, I don't see that as a real downside.

I'm still not fully convinced that all code needs to become wiki-aware. I think it's only small minority where this makes sense, hence only that part of code should be affected. But I might be wrong: maybe this enables global templates/gadgets/whatelse which means that through parser and other parts a lot more code suddenly needs to become wiki-aware?

Yes, that among other things is the idea.

Eventually, we'd want to be able to construct a MediaWikiServices container for another wiki. This would require all services to be cross-wiki aware.

I think getting this kind of change fully right in one go is impossible. I think we are close to or at the point of good enough to start trying this out while explicitly stating that we may do some tweaks until it's marked stable (in the next MW release, presumably).

If PageIdentity remains unstable in the next release, this means we can't use it in hook signatures. Since one of the reasons to introduce PageIdentity is to avoid exposing Title in hooks signatures, that would be annoying.

My hope is that we can keep PageIdentity stable for callers. It's not intended to ever be stable to be implemented by extensions, so we will be free to add methods and parameters.

There are some concerns of tech debt being left behind/carried over:

  • Adding type declarations to params and return values of the interfaces
  • Fixing inconsistent, misleading or unclear names

That's always a tricky one - on the one hand, it's nice to clean things up while you are touching them. On the other hand, it's better to not change too much at once.

I have been looking into addressing the concerns raised above. Here is what I have come up with:

castFrom...: we already have Title::castFromLinkTarget, so I'll go with Title::castFromPageIdentity for consistency. Using the new... prefix for a method that is not guaranteed to return a new instance seems like an anti-pattern to me. Note that Title::castFromLinkTarget and Title::newFromLinkTarget both exist and have different semantics. I'd be fine with simply from..., but then castFromLinkTarget also needs to be renamed, and having fromLinkTarget and newFromLinkTarget do different things seems confusing. The cast prefix makes the distinction clear.

The name "castFromLinkTarget" of course is also my fault. I suppose the concern here is that "cast" in programming typically refers to a re-interpretation of an object reference, so a cast resulting in a new object would be unexpected. However, I believe this kind of "converting" cast is not unheard of - if my vague memories serve me right, at least in C++ casts can involve constructor based or operator based conversions that result in a new object.

as...: I'm changing this to the more conventional to... prefix.

Proper...: I would have been ok to change from "proper" to "valid", but Title has an isValid() method that defines "validity" differently from the idea of a "proper" PageIdentity. This would lead to a situation where a "valid" Title is not a "valid" PageIdentity, even though Title implements PageIdentity (e.g. a link to a Special page is a "valid" Title but not a "valid" PageIdentity). That seems extremely confusing, so I'm sticking with "proper". The only alternative I can think of that would be more accurate would be something like NonBogusPageIdentity... I don't quite understand the concern with the adjective - Webster gives sense (8) as "marked by suitability, rightness, or appropriateness". What's wrong with that?

LocalPageIdentity: duplicating all interfaces and value classes seems to do more harm than good by ading complexity and potential confusion. In any case, we still need to assert programmatically that a given PageIdentity belongs to the expected wiki by checking the wiki ID. So I'm going with assertWiki( $wikiId ) and getId( $wikiId = false ) without the extra types.

As discussed in the TechCom meeting on Wednesday, this is entering the Last Call period. If no objections remain unaddressed by January 6, this RFC will be approved as proposed.

Sorry if I repeat arguments that have already been shared, but: Avoiding the word "valid" just because there is already Title::isValid() doesn't really change much. We still have two concepts of being valid. One is just named after what is literally a synonym for "valid". My problem is that neither of the two words – as far as I understand them – gives any hint at what the difference is. It's effectively as if both are named "valid".

We can't and shouldn't rename isValid(). But we do have the chance to pick a name for the new class that helps people understand better what it represents.

As far as I understand the conversation above everything named "proper" is meant to be temporary anyway. So does it hurt to give it a longer name? The contrary, I would argue. A longer, "uglier" name makes it more obvious that it's temporary.

Sorry if I repeat arguments that have already been shared, but: Avoiding the word "valid" just because there is already Title::isValid() doesn't really change much. We still have two concepts of being valid.

That is correct - It's a direct consequence of the fact that Title is used to represent two different things (link targets and pages). Introducing PageIdentity is intended to allow us to fix that.

As far as I understand the conversation above everything named "proper" is meant to be temporary anyway. So does it hurt to give it a longer name? The contrary, I would argue. A longer, "uglier" name makes it more obvious that it's temporary.

Sure. What longer, clearer name do you have in mind?

I don't know. I still don't understand what it represents. See, that's why I'm here, writing this. Please find a name that helps people understand what this represents.

I don't know. I still don't understand what it represents. See, that's why I'm here, writing this. Please find a name that helps people understand what this represents.

I have tried my best, but apparently wasn't successful. Now what?

I don't know. I still don't understand what it represents. See, that's why I'm here, writing this. Please find a name that helps people understand what this represents.

I have tried my best, but apparently wasn't successful. The best explanation I have is that ProperPageIdentity represents the same thing as PageIdentity: a page that contains user generated content (as opposed to a special page, a section, or an external link). You could also say: anything that has a revision history. The difference between ProperPageIdentity and PageIdentity is that, for a transitional period, a PageIdentity object may actually be something else (e.g. a special page or section) and will throw if you try to use it in a "pagey" way. Eventually, this dodgy behavior will go away, and PageIdentity will become synonymous with ProperPageIdentity.

This is admittedly not very pretty, but it's consistent with the way we currently make assumptions about Title objects in some parts of the code, expecting them to refer to actual pages, with undefined behavior if they don't. The reasoning behind using this approach to the transition is explain in the conversation on this ticket.

So, given all this - do you have a suggestion how we could find a better name? To me, ProperPageIdentity is just a PageIdentity that isn't "bad".

it's too bad that we can't use git-like language such as tree-ish, commit-ish etc, e.g. Page-ishIdentity :-) I suppose it could be called PagelikeIdentity... (Isn't bike-shedding fun?)

it's too bad that we can't use git-like language such as tree-ish, commit-ish etc, e.g. Page-ishIdentity :-) I suppose it could be called PagelikeIdentity... (Isn't bike-shedding fun?)

Finding a descriptive name for the "unclean" version of the interface would be easier, but it would mean that we have to change that name in all code that uses it once thw two interfaces become synonymous. The idea is that we can use the interface with the weaker guarantees as a drop-in for Title now, improving guarantees later without having to change that code again. This means that the weaker "unclean" interface has to use the simple/permanent name, and the "stronger" interface has to use the longer "special" name.

So, as alternatives to ProperPageIdentity, all I can come up with are things like TruePageIdentity or GuaranteedPageIdentity or YesReallyAnActualPageIdentity...

Change 654517 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] WIP: strict return types for PageIdentity

https://gerrit.wikimedia.org/r/654517

Following is a summary from @daniel and I going through some of the questions and misunderstandings I had around this proposal. The main questions I had, in no particular order:

  • Do we really need a (temporary) divide in the new interface for both "poor" and "proper" page identities? And if so, can we quantify what problem it solves and (if any) what debt or other downsides are we exchanging to solve that problem?
  • The poor/proper divide suggests we need two transitions (Title to poor PageId, and then to proper PageId). Ending up with "Proper" as the only implementation seems odd and thus likely means we need a third (and breaking) transition to drop the "proper" prefix. Is there a way we can avoid this?
  • How does this relate to the overall direction of ensuring new service classes are taking their wiki state from service wiring only and thus can be constructed for wiki B within a PHP process/request for wiki A. Does it mandate that transition before the "proper" version can be adopted inside a class? Or is it a step along the way? Or not really related?
  • What does the lifecycle of existing code look like through each of the transition phases? Specifically, what benefits are gained at each step, and which risks/footguns are we (temporarily) exposing and what can we do to minimise or eliminate those.
Why a "non-proper" interface at first

As I understand it now, the main reason for having a "non-proper" intermediary state is for backward compatibility during the migration. We can't atomically change the whole universe at once, and that's generally a bad idea even if we could. We are going to have to first make it so that methods accepting Title objects today can be updated such that they (also) accept PageIdentityValue objects. This means they need a common interface, and that interface can't (currently) promise the strong/proper guruantees PageIdentityValue. Specifically the guruantee that A: it represents a valid title that could actually exist as a wiki page on this wiki (so not a special page, anchor link, interwiki etc.), and B: that it actually exists right now and has a page ID etc.

Also important to note is that Daniel's proposal does not involve any dedicated "non-proper" interface or class. The only implementation that would be non-proper is the existing Title class. The interface will be considered mostly internal (only "Stable to type").

This means there aren't as many transitions as I feared. We transition from Title signatures to PageIdentity signatures, and that's it for most code. At some point Title will be deprecated and some time later removed. At that point PageIdentityValue (which is always proper) is the only PageIdentity implementation and thus can always be expected to always be proper. The point in time where we make the "breaking change" to PageIdentity from non-proper to always proper would not actually be breaking or observable for anyone since at that point Title will have been removed already.

The only places that will make mention of ProperPageIdentity are type hints in a relatively small amount of new servicees that may choose to exclusively accepts PageIdentityValue, for which PageIdentityValue is the interface that bans the non-proper Title objects. If and when you need to interact with such new service from existing code, you'd simply ask Title to create a PageIdentityValue for you and pass that in. The caller at that point would also be responsible for catching or checking any non-proper scenarios.

Relation to local vs cross-wiki support

In addition to the proper vs non-proper semantics, there is another important factor that feels related to this RFC, and that's the support for representing pages on other wikis. And while that brings a lot of joy, it also introduces a footgun where a developer might use a PageIdentity object incorrectly, e.g. treat the page ID number as being for the local wiki instead.

This is not an entirely new problem. We've had numerous issues in the past where a Title or TitleValue is misinterpreted accross wikis. Such as T204792 (see associated patch) where jam: … is usually an interwiki code for jam.wikipedia.org, unless you're currently on din.wikipedia.org where "Jam" is the local word for "Talk". Or T260485 (restricted) and similar issues where an Actor ID or User ID is wrongly interpreted in the wrong wiki's context.

I guess there are two concerns here to think about:

  1. How can we make sure that for code that supports both Title and PageIdentityValue, they are not required to suddenly add support for complex cross-wiki cases, but also fail in a clear and obvious way if an unsuspecting caller provides a non-local PageIdentityValue?
  1. For code that does intend to support cross-wiki scenarios, can we do something to ergonomically make this "easy" to get right, and "hard" to do wrong?

The answer to both of these is that PageIdentity::getId (as to be implemented by Title and PageIdentityValue) will take a $wikiId parameter, which the caller will have to provide with the wiki they think they are operating on. Service classes such as RevisionStore, CommentStore that can be constructed separately for different wikis, generally know which wiki ID or db domain they are instantiated for so this should generally be as easy as $page->getId( $this->wikiId ).

For the more common case where code only works for the current request/wiki, they can omit the parameter which means "local". I was mistaken in my fear that this would be fragile. Omitting the parameter does not mean it forgoes the check. Omitting means it will mandate that the wikiId is boolean false, which is the value we use in service wiring (incl rdbms) to represent "current wiki". We may want to change that at some point, in which case this parameter will have to become mandatory, and then "local"-only classes will need to store the global wiki ID somewhere and pass it in the same way as newer service classes do, but for now this works fine.

Migration story
  1. New features can choose to type against ProperPageIdentity and work exclusively with PageIdentityValue. When interacting with legacy code, they may need to use Title::newFromPage as-needed to provide for that.
  2. Old methods that require Title today can be progressively expanded to require PageIdentity instead, which covers both Title and PageIdentityValue. It will likely need changes in its implementation to first only use the subset of methods that the interface provides, as well as to rename some method calls such as getArticleId to getId (which is a method we will backport to the Title class for compliance, so you can safely use this in Title-only code as well).
    • At this point, new services that only use PageIdentityValue (typed as ProperPageIdentity) can start to interact with this migrated old method directly, without needing to create a Title.
  3. Once core and bundled/wmf extensions have undergone this transition, we can deprecate `Title.
  4. Once Title is removed, we can update the documentation for PageIdentity such as canExist() is always true. getId() will continue to be throw-able since it gets for wikiId correctness, althoguh it will no longer throw for pages not existing since Title was the only case of non-existent page identities.
    • At this time we can remove ProperPageIdentity and alias it to PageIdentity. It can be kept around as long as we like. We may want to find/replace it in core code so as to remove it from our day-to-day visibility, but otherwise fine. We can also remove any remnant calls to canExist(). We might want to deprecate/remove it after several LTSes, but there is no rush to do that. I would suggest we keep it around and instead wait for another future migration in this area to deprecate/remove so that there is no disruption in the form of fatal errors of undefined methods that we can avoid by simply keeping a one-line method around.

Change 589353 merged by jenkins-bot:
[mediawiki/core@master] Introduce PageIdentity interface

https://gerrit.wikimedia.org/r/589353

Approved after last call per Wednesday's TechCom meeting.

I'm keeping the ticket open until the attached patches are merged.

Change 645036 merged by jenkins-bot:
[mediawiki/core@master] Make WikiPage a ProperPageIdentity

https://gerrit.wikimedia.org/r/645036

Jdforrester-WMF changed the task status from Stalled to Open.Jan 15 2021, 6:01 PM

RfC approved, patches are landing -> clearly not Stalled.

Change 645036 merged by jenkins-bot:
[mediawiki/core@master] Make WikiPage a ProperPageIdentity

https://gerrit.wikimedia.org/r/645036

This patch broke FlaggedRevs, and consequently broke CI for all extensions (T272170). Reverted in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/656468

Change 656944 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/FlaggedRevs@master] Use FlaggableWikiPage only for real pages

https://gerrit.wikimedia.org/r/656944

Change 656906 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Re-apply "Make WikiPage a ProperPageIdentity"

https://gerrit.wikimedia.org/r/656906

Change 656948 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Re-apply "Define equality for PageIdentity and LinkTarget"

https://gerrit.wikimedia.org/r/656948

Change 656949 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] RevisionRecord: add getPage()

https://gerrit.wikimedia.org/r/656949

Change 657134 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] EXP: Make WikiPage a PageIdentity

https://gerrit.wikimedia.org/r/657134

Change 656949 merged by jenkins-bot:
[mediawiki/core@master] RevisionRecord: add getPage()

https://gerrit.wikimedia.org/r/656949

Change 654517 merged by jenkins-bot:
[mediawiki/core@master] Strict return types for PageIdentity

https://gerrit.wikimedia.org/r/654517

Change 656944 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Use FlaggableWikiPage only for real pages

https://gerrit.wikimedia.org/r/656944

Sorry I am too late, but I don't understand how this would work in all cases? There is indeed a need for a simpler page identification, specially for out-of-wiki references, that use a combination of wiki name, numerical namespace and title, which is very inefficient.

However, some of the usages commented here cannot be service by that. When a page gets moved (renamed) we can alter the title on the page identity only (that is a win). However, links to the title don't get automatically updated -as it happens currently- as they link to a title, not a page. This means a costly job would have to update now lots of pagelink records to point to a new identity (and it may not exist or be a different identity with the old -new- title), rather than the current behaviour, where they always point to the right title.

If PageIdentity and PageTitle are merged, how will an ever-changing link meaning would be handled, one where we link to a valid page one time, then to a non created one, and other time to an invalid one (per design), without doing mass edits on the database? In other words, if in the end only "ProperPageIdenties" survive, how to model non-existent title references? This affects pagelinks, but also other extensions- who says that the extension owner intended to gather pages and not titles in all cases?

If PageIdentity and PageTitle are merged, how will an ever-changing link meaning would be handled, one where we link to a valid page one time, then to a non created one, and other time to an invalid one (per design), without doing mass edits on the database? In other words, if in the end only "ProperPageIdenties" survive, how to model non-existent title references? This affects pagelinks, but also other extensions- who says that the extension owner intended to gather pages and not titles in all cases?

I think there is a misunderstanding here PageIdentity is solely about how we represent things in PHP, during a single request. Nothing changes in the database, and nothing changes between requests. Also, PageIdentity does not represent link targets. We have the LinkTarget interface for that. The idea behind this RFC is to, in PHP, make a clear distinction between "things we can link to" and "things that can be edited". In the database, this distinction has always existed.

We may in the future normalize titles (or maybe link targets) in the database. That's entirely independent of this RFC.

With respect to renames, there is one case that we need to be careful of: in the request that actually renames a page, we currently update the Title objects (interestingly, by changing their IDs, not by changing their names). With immutable PageIdentity objects, that is not possible. We have to be careful to ensure that code that runs after the rename has occurred in the database doesn't have stale PageIdentity instances.

But again, on the database level, nothing at all changes.

My apologies, my confusion arose from mixing LinkTarget and Title. Which means this is a good change because Title was confusing for me. I am happy that, despite my confussion, that brought some considerations to you. Sorry again.