Page MenuHomePhabricator

RFC: Introduce PageIdentity to be used instead of Title
Open, Stalled, MediumPublic

Description

  • Affected components: None immediately, but PageIdentity is designed to replace many uses of the Title class. So eventually: much of MediaWiki.
  • Engineer(s) or team for initial implementation: @daniel / Platform Engineering
  • Code steward: Platform Engineering

Motivation

Create a light weight replacement for Title to be used to represent editable wiki pages.

The Title class is not only large and has many dependencies, it's also conceptually overburdened: it is used to represent link targets on the one hand, and editable pages on the other. It's also an "active record" that uses global state to interact with the storage backend, which is something we want to move away from.

A light-weight value object to represent wiki pages is generally useful for decoupling, see e.g. T212482: RFC: Evolve hook system to support "filters" and "actions" only and T208764: Remove cyclic dependency between Title and User classes.

Requirements
  • PageIdentity must make available to following: the page's internal ID, the namespace number, the title (db key), and the ID of the wiki it belongs to, for cross-wiki operations.
  • PageIdentity (or a subclass or implementation) should be newable
  • PageIdentity (or a subclass or implementation) should be serializable
  • PageIdentity (or a subclass or implementation) should be a immutable pure value object

Exploration

One major aspect of the design of PageIdentity is interoperability with and transition from the Title class. There is two approaches to this:

Option 1: composition

Make PageIdentity a value object, and let title contain an instance of PageIdentity, accessible via an asPageIdentity() method. Title would also have a static castFromPageIdentity() method that would instantiate a Title from a PageIdentity. For optimization, PageIdentity could have a reference to a Title "glued" to it, which can used by castFromPageIdentity() to avoid instantiating a new title on every call.

This approach allows the definition of PageIdentity to be entirely independent of Title. The new semantics and signatures can be defined without much though how they relate to the semantics of Title.

However, any method that takes a Title as a parameter will need to be deprecated and replaced and later removed, and every caller of such a method will have to be updated to use the new method. While this change is incomplete, there will be a lot of asPageIdentity() and castFromPageIdentity() calls, cluttering the code and causing overhead.

Draft:

interface PageIdentity {
  function getId(): int; // see Title::getArticleID()
  function getWikiId(): string; // see RevisionRecord::getWikiId()
  function exists() bool; // see Title::exists()
  function getNamespace(): int; // see Title::getNamespace
  function getName(): string; // see Title::getDBkey
}
Option 2: subtype

Make PageIdentity an interface and have Title implement that interface. Provide a pure value implementation, PageIdentityValue, as an alternative. Title would again have a static castFromPageIdentity() method that would instantiate a Title from a PageIdentity, which would return the object unchanged if that object already was a Title instance.

We have been successfully using this approach with UserIdentity and LinkTarget.

The signature and semantics of the PageIdentity interface will have to make some concessions to be compatible with Title. In particular, since a Title object can represent things that are not editable wiki pages but merely link targets (Special pages, interwiki links, relative section links), PageIdentity will have to allow for such "improper" instances, and will need to offer a method for calling code to check whether it's a "proper" PageIdentity. This is not very pretty, but no worse than the present situation, which has different parts of code use Title objects with different assumptions about what that Title objects represents.

For transitioning to the new type, we can now simply change type hints of paramters from Title to PageIdentity. This widens the type, and thus doesn't break calling code, and doesn't require deprecation (except for methods that are considered stable for overriding). Only methods that return a Title will need to be replaced. This makes for a much smoother transition, with less need to deprecate and remove methods.

Draft:

interface PageIdentity {
  function getId(): int; // see Title::getArticleID()
  function getWikiId(): string; // see RevisionRecord::getWikiId()
  function exists() bool; // see Title::exists()
  function getNamespace(): int; // see Title::getNamespace
  function getDBkey(): string; // see Title::getDBkey
  function assertProperPage(): bool; // Throws if canExist() returns false. Taint, to be deprecated.
}
PageIdentity and LinkTarget

The link target aspect of Title has long ago been factored out into the LinkTarget interface and TitleValue class. PageIdentity would now be introduced to represent the other aspect of Title. But how do the two relate?

To get a PageIdentity for a given LinkTarget, it will have to be looked up in the database - for now, by converting to a Title object via Title::castFromLinkTarget, and in the future using the PageStore service (T195069).

To make a LinkTarget from a given PageIdentity, TitleValue could have a static newFromPage() method. Or PageIdentity could have an asLinkTarget() method.

Alternatively, PageIdentity could extend the LinkTarget interface. In that case, PageIdentity would require getFragement() and getInterwiki() to always return null, getDBkey() to always return a non-empty string, and isExternal() to always return false. This would satisfy LSP, though it seems to reduce the LinkTarget interface significantly. A more narrow LocalLinkTarget interface sounds tempting, but a general LinkTarget cannot extend a more narrow LocalLinkTarget, since that would violate LSP (because "every LinkTarget is a LocalLinkTarget" is incorrect).

Conceptually, the question is how tightly we want to tie the ideas of LinkTarget and PageIdentity together. The represent rather different things, though saying that "every page is a link target" does make sense. Combining them closely is easy enough for local pages, but what about cross-wiki PageIdentities, that return something from getWikiId()? Should these also return the equivalent interwiki prefix from getInterwiki()? In what context would that interwiki prefix be valid - the local database, or the database indicated by getWIkiId()? Also, it seems odd for a PageIdentity to have a getFragment() method that is required to always return an empty string.

PageIdentity and PageRecord

With the proposal of PageStore (T195069) also comes the concept of a PageRecord object, representing page level meta data such as the page's latest revision, touch date, etc. PageRecord should by a subtype of PageIdentity (extending interface of implementing class).

Bike shed
  • What terminology should be used to represent the idea that a PageIdentity refers to a "real" page, rather than a "special" page or foreign link target? "article" is ambiguous...
  • What terminology should be used to identify PageIdentities that identify pages that belong to a different wiki, but one to which we still have direct database access via PageStore and RevisionStore, know the namespaces, etc? How does that relate to isExternal() and getInterwiki() defined in LinkTarget?

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
StalledNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedppelberg
ResolvedKrinkle
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
OpenNone
Opendaniel
OpenNone
Stalleddaniel

Event Timeline

daniel created this task.Nov 5 2018, 7:59 PM
Aklapper renamed this task from Introduce PageIdentiy to be used instead of Title to Introduce PageIdentity to be used instead of Title.Nov 5 2018, 11:32 PM
Aklapper updated the task description. (Show Details)
tosfos reassigned this task from daniel to Vedmaka.Mar 4 2019, 3:33 PM
tosfos added a subscriber: tosfos.
daniel updated the task description. (Show Details)Mar 18 2019, 2:32 PM

Tagging this as an RFC. This is going to be used throughout the codebase, and there are nitty gritty questions open, so it should see wide feedback.

The description needs some polishing for this to work as an RFC, but I think the motivation and idea are pretty clear already.

daniel updated the task description. (Show Details)Apr 3 2019, 7:17 PM

Pinging for input: @tstarling, @brion, @Krinkle, @Tgr, @Anomie. Your take on the open questions mentioned in the description would help me move this forward.

  • Should PageIdentity be a LinkTarget, or have a LinkTarget? Having a LinkTarget seems cleaner, but inconvenient, and potentially confusing because Title will be implementing both, LinkTarget and PageIdentity. On the other hand, the notion of "remoteness" is different for LinkTargets and PageIdentities.

That's a good question. This may be a good time to think through the whole hierarchy of things that Title represents.

Note that a long-term request from our DBAs is to create a titles table so we can assign an ID to titles that don't exist, so e.g. pagelinks doesn't have to repeat pl_namespace+pl_title in every row. We may want to take that into consideration when designing interfaces here.

  1. A "title" is basically a local namespace ID + text. Someday the titles table might give it an optional ID too.
    • I don't know if we really have/use a concept of "interwiki titles", having an interwiki prefix + namespace ID + title, since MediaWiki generally has no way to know the namespace name to ID mapping for a foreign wiki.
  2. A "link" is either a "local link" or an "interwiki link". All links have an optional fragment. This is LinkTarget (and also TitleValue, which turns out to be poorly named).
    1. A "local link" is a "title" plus the optional fragment.
    2. An "interwiki link" is not a "title". It's an interwiki ID, plus a string that theoretically represents a "title" on the remote site but in practice may actually be another (fragmentless) "link", plus the optional fragment.
  3. A "page" is a "title" that has actual content and a page_id.
  4. Then we have the Title class, which is used to represent all of these concepts and has a lot of extra functionality too.
    • We might want to look at whether any of those extra concepts are further "things" versus "services". At first glance,
      • Depending on what exactly PageRecord contains, it might have that too: content model, language, length.

I'm not sure where exactly PageIdentity fits into that. For a name "PageIdentity" I'd think it should be somewhere around #3, but what's the point of it being separate from "PageRecord" then? Your description is a lot closer to it being #1, in which case "TitleIdentity" would be a better name. But you're also conflating it with #2, at which point I think we're not so much fixing the confusion that is Title as just splitting that confusion across multiple interfaces. The proliferation of using LinkTarget in place of Title in the guise of removing dependencies on Title probably didn't really help that situation.

So what I'd propose would be to make a TitleIdentity that represents #1, and maybe also has an accessor for the page_id. We already have LinkTarget to represent #2. I hope we don't really need a separate "identity" interface for #3 versus just using PageRecord for that.

If we do need "interwiki titles", the only relation that makes sense to me between TitleIdentity and InterwikiTitleIdentity would be if TitleIdentity is an instance of InterwikiTitleIdentity that's hard-coded to be local, even though that seems backwards. The point is that you could pass a TitleIdentity to something expecting a InterwikiTitleIdentity, but explicitly not vice versa. I'm wary of using InterwikiTitleIdentity as the general-purpose class for something that's so tied to the database-level representation as it seems too open to accidentally assuming a local title when it's not.

LinkTarget might have a TitleIdentity when it's representing a "local link", but should probably not be one even though that's not terribly convenient. We should be careful about letting LinkTarget have an InterwikiTitleEntity, as something like "foundationsite:advocacy" can't have one (the target is not even a wiki) and something like "de:wikt:Wiktionary:Foo" must be fully parsed to wikiid=dewiktionary ns=4 title=Foo rather than say wikiid=dewiki ns=0 title=wikt:Wiktionary:Foo (which would also mean that every wiki would have to be able to know that "Wiktionary" is a name for ns4 on dewiktionary when parsing "de:wikt:Wiktionary:Foo").

  • What should the relationship between PageIdentity and PageRecord (see T195069) be? PageRecord should probably have a PageIdentity, not be a PageIdentity, so that the contract of PageRecord can guarantee that PageRecords always refer to entries in the page table (and perhaps also the archive table).

Yes, have would be the right relationship here.

I'm not sure about the "(and perhaps also the archive table)" bit, since the archive table really just contains revisions (+ "titles") rather than actual pages. But that's not really on-topic for this task.

  • getArticleID is an akward method name to have in PageIdentity. It should have a getId() method instead. Should Title::getArticleID() be deprecated in favor of getId(), when Title implements PageIdentity? Is getId() a clear enough name in the context of Title?

As things stand now, that wouldn't be bad. If we add that titles table in the future, then getId() would be a confusing name since we'd have to distinguish between the title_id and the page_id.

Another option would be to make it getPageId(), and have Title::getArticleId() be an alias. That leaves getTitleId() open for the titles table.

BTW, there's a parallel here with the "actor" concept. A registered user always has both a user_id and an actor_id, while an IP user or interwiki user doesn't have a user_id and only gets an actor_id when something needs to refer to it. Similarly, a "page" will always have a page_id and a title_id, while a "title" doesn't have a title_id until something needs to refer to it. Obviously not a perfect analogy though: we don't allow converting IP users or interwiki users into registered users or vice versa, while we will have to allow making and unmaking the "page" for a "title".

  • How can the idea of a page "existing" be clarified in the contract if PageIdentity, while maintaining compatibility with the way exists(), canExist(), isSpecialPage() and isAlwaysKnown() work on Title?

It seems pretty straightforward to me:

  • exists() → Generally, the "title" has a "page".
    • The 'TitleExists' hook is a bit of a hack for previewing a page creation to properly render as if the creation were completed. We could replace it with any other mechanism that lets previews, including from MediaWiki-extensions-TemplateSandbox, achieve that. That might mean we move the method from the value object to a service, I suppose.
  • isSpecialPage() → This is really just a shorthand for ->getNamespaceId() === NS_SPECIAL. Seems trivial.
  • canExist() → Another shorthand, for whether ->getNamespaceId() allows page creation. Seems trivial.
  • isAlwaysKnown() → I note the doc comment already points out that it's misleadingly named and says it's semi-deprecated. Probably we should fully deprecate it in favor of the service having isKnown().

What problems do you see?

  • What terminology should be used to represent the idea that a PageIdentity refers to a "real" page, rather than a "special" page or foreign link target? "article" is ambiguous...
  • What terminology should be used to identify PageIdentities that identify pages that belong to a different wiki, but one to which we still have direct database access via PageStore and RevisionStore, know the namespaces, etc? How doe4s that relate to isExternal() and getInterwiki() defined in LinkTarget?

I question the assumptions behind these questions, see above.

function isCuratable: string; // it's a "real" page in a database we can access directly. Equivalent to Title::canExist()

Why invent a fancy new name?

@Anomie thank you for your detailed response. At a first glance, I think I agree with most of what you say. But before we discuss the details, I'd like to raise one thing I learned from trying to introduce TitleValue:

If we don't define the new interface in a way that Title can directly implement, adaption will be slow, difficult, and error prone. So while in theory I would very much like clear cut concepts with strong guarantees, it seems we'll have to live with some dirty compromised. One consequence of this is that, if Title implements new interface Foo, a Foo instance can be anything a Title can be: an existing page, a non existing page, a special page, an interwiki link, etc. I find this very annoying, but I don't see a good way around it.

That's a good point, but I'm afraid that making a "LinkTarget + page ID" interface isn't going to get us to a place that's better than just paring out some of the excess functionality out of Title while keeping Title itself.

I wonder if we can get someplace more sane by making internal subclasses of Title and privatizing some of its fields, since it looks like Title already isn't "newable". Have Title::makeTitle() return a LocalTitle (which implements TitleIdentity) or an InterwikiTitle (which doesn't) depending on whether it's local or not, and the same for the other factory methods.

The drawback there is that Phan probably wouldn't be able to know whether a Title actually does implement TitleIdentity when doing static analysis, we might have to add extra asserts or type declarations to make it happy.

@Anomie, my hope is that we can introduce narrower interfaces with weak guarantees now, and introduce extensions of such interfaces later that provide stronger guarantees. This may however result in somewhat awkward signatures (e.g. the sub-interface exposing an isLocal() method that it guarantees will always return true).

How would this help to remove the cyclic dependency between User and Title? I was looking at those classes last night, and the dependencies between them don't seem related to this proposal. There is User::isWatched() etc., which is already a wrapper around WatchedItemStore, so the dependency could be removed by just moving it fully to WatchedItemStore and deprecating the wrapper. Similarly Title::getUserPermissionsErrors() is now a deprecated wrapper for PermissionManager, and most of the other uses of User in Title could receive similar treatment.

User::isValidUserName() calls Title::newFromText() and then immediately extracts the canonical text form. This is apparently due to the incomplete prior round of abstraction, since it could apparently use MediaWikiTitleCodec instead.

Osnard added a subscriber: Osnard.Apr 12 2019, 12:19 PM

@tstarling This is not primarily motivated by the need to cut the dependencies between User and Title. Rather, it's motivated by the desire to have a light weight, directly constructable value object that basically wraps a page ID and also provides access to the page's namespace and title. Such a value object would implement PageIdentidy, and Title would also implement page identity, allowing for an easy transition.

The idea is that new MediaWiki core could would hint against PageIdentity instead of Title (or WikiPage) when it needs to know a page ID. This would allow us to resolve indirect dependencies between User and Title, e.g. via the planned RestrictionStore and BlockManager services, which User would delegate to (at least for now) and which contain code that deals with page IDs. Allowing these to function without the need to know about Title is the aspect of PageIdentity that ties this into the Tilte/User decoupling.

daniel added a comment.EditedMar 27 2020, 10:34 AM

It's unclear to me now whether we really need this in addition to PageRecord.

EDIT: I do think we need this. I have updated the task description since.

Change 589353 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Introduce PageIdentity interface

https://gerrit.wikimedia.org/r/589353

daniel updated the task description. (Show Details)Apr 17 2020, 8:57 AM
daniel updated the task description. (Show Details)Apr 17 2020, 9:03 AM
daniel updated the task description. (Show Details)Apr 17 2020, 10:22 AM
daniel removed Vedmaka as the assignee of this task.Apr 20 2020, 4:32 PM
daniel updated the task description. (Show Details)
daniel added a subscriber: Vedmaka.
daniel moved this task from P1: Define to P3: Explore on the TechCom-RFC board.Apr 27 2020, 9:25 AM
DannyS712 updated the task description. (Show Details)May 7 2020, 3:17 AM
DannyS712 added a subscriber: DannyS712.
Krinkle updated the task description. (Show Details)May 7 2020, 6:56 PM

I would suggest that instead of (or in addition to) having assertProperPage(), have a throwing cast method in Title e.g. castToPageIdentity(). It would be type hinted as returning PageIdentity, and if the Title is not a "proper" page, it would throw.

daniel moved this task from P3: Explore to P4: Tune on the TechCom-RFC board.May 13 2020, 9:11 PM

Moving to "tune" per discussion in today's TechCom meeting. I'll update the experimental patch as discussed.

Krinkle added a comment.EditedMay 13 2020, 9:15 PM

If PageRecord represents the page row from the database, and PageIdentity only the page ID and the TitleValue (a subset) - then how would one obtain such a PageIdentity object typically, without having the page row?

If a given code path has only an integer page ID but doesn't know if it exists, what value does PageIdentify add over just passing that integer to whatever method it would pass the PageIdentify to?

What could a PageIdentify (safely) be used for, besides retreiving a PageRecord and besides things that can or need a LinkTarget? An example use case would help.

It seems like the factory that would produce a PageRecord would only need to accept either an integer page ID, or a LinkTarget (e.g. PageRecordFactory->newFromId/newFromTitle). And for the use case of making cross-wiki log action links, we need the LinkTarget either way (and the interwiki prefix, which PageIdentify does not hold). We generally don't store or query foreign page IDs across wikis. Is that something we want to start doing? Using page IDs publicly and cross wikis also touches on T20493.

Krinkle renamed this task from Introduce PageIdentity to be used instead of Title to RC: Introduce PageIdentity to be used instead of Title.May 20 2020, 8:44 PM
Krinkle renamed this task from RC: Introduce PageIdentity to be used instead of Title to RFC: Introduce PageIdentity to be used instead of Title.

I feel there is still unclarity on how will this look at the end when the untangling is done and how will we get there? Will Title at some point stop extending PageIdentity (which means we need to place a many lot of deprecation warnings on methods taking PageIdentity, or is the plan to kill Title completely and avoid deprecations on the methods themselves?

Are we going to do mass-replacement of Title -> PageIdentity, or is it done case by case basis? If case-by-case, is there now sufficient alternatives (PageIdentity, LinkTarget, PageRecord?) to choose from to form the best interface and guidance which one to choose?

The proposal does not seem to be up to date with the proposed patch. It seems quite clear that option 2 is the preferred way to go.

I think getWikiId should not be in the interface, unless there is an agreement that all MediaWiki code must support non-local titles (and definition of what that means). It's hard to remove later, but easy to add later.

I think getWikiId should not be in the interface, unless there is an agreement that all MediaWiki code must support non-local titles (and definition of what that means). It's hard to remove later, but easy to add later.

I see your point, but we do have getWikiId() on RevisionRecord, and I was hoping to replace the usage of Title in RevisionStore with PageIdentity. getWikiId() is needed for parity with RevisionRecord.

If we have a PageIdentity that does not have a getWikiId(), what's the intended semantics? Is it guaranteed to be local? Then it can't be used with RevisionStore. Or does it give no guarantee what wiki it belongs to? Then it can't be used safely at all...

Do you have an idea how to get out of this?

I guess the same question applies to RevisionRecord... is every code expected to handle non-local revisions? It seems to have the same problem that callers need to explicitly check whether it is local or not instead of explicitly opting in to it. There doesn't seem to be many callers for RevisionRecord::getWikiId. Is there e.g. a task that explains why it was added?

I guess the same question applies to RevisionRecord... is every code expected to handle non-local revisions?

Most code should be oblivious of where the revision came from. At least in theory, only code that directly access the database needs to be aware of this.

It seems to have the same problem that callers need to explicitly check whether it is local or not instead of explicitly opting in to it.

Opting in would mean we create types that guarantee that they are always local. That means we have to make breaking changes to our interface when making code interwiki-ready.

I understand the concern, the thing is that that we already have the problem that we don't know for sure which wiki a revision, or user, or title belongs to. What I'm trying to do is to make it explicit.

There doesn't seem to be many callers for RevisionRecord::getWikiId. Is there e.g. a task that explains why it was added?

The getter isn't called much, that is true. It exists mostly to reflect the fact that a revision's identity is given not just be the revision ID, but by the revision ID in the context of a specific wiki. The same is true for pages (and users as well - UserIdentity should also have a wiki id). You can find code relying on this capability by looking for callers of RevisionStoreFactory::getRevisionStore(). The only production use right now seems to be Wikibase, though things like global user pages could/should probably also use this.

It is my understanding that the strategic direction is to enable more cross-wiki operation. I have been trying to model the basic concepts in the wiki with that in mind.

But perhaps my approach was not the best. Perhaps we need two versions of page, revision, user, etc - once that is guaranteed to be local, and one that can be remote. But in that case, the remote version needs to expose a wiki ID, and the local version would be a specialization of that, so it would also expose a wiki ID. Or the two types would have to be mutually incompatible. The remote version can't be a subtype of the local version, since it removes a guarantee, and would thus violate LSP.

Krinkle changed the task status from Open to Stalled.Jul 29 2020, 8:33 PM
Krinkle assigned this task to daniel.Jul 30 2020, 12:41 AM
Krinkle triaged this task as Medium priority.
Aklapper removed a subscriber: Anomie.Fri, Oct 16, 5:02 PM