Page MenuHomePhabricator

Introduce PageIdentity to be used instead of Title
Open, Needs TriagePublic

Description

Introducing a PageIdentity interface and a PageIdentityValue class would allow a lot of code to stop depending on the Title class, removing spurious dependencies dragged in by that. It also allows us to move towards using immutable, serializable value objects, instead of global-state-aware smart records. This is useful in the light of T212482: Evolve hook system to support "filters" and "actions" only and also T208764: Remove cyclic dependency between Title and User classes.

PageIdentity should be defined in a way that allows Title to implement it directly. It should be similar to LinkTarget (and could perhaps extend it), with the distinction that it also exposes a page ID. This (annoyingly) means that a PageIdentity does not always identify an entry in the page table - it may instead also refer to a non-existing page, a special page, or a foreign (interwiki) page. Another interface (or class), PageRecord, could in the future be used to represent things that live in the page table, and together with a PageStore could replace the WikiPage class. But that is beyond the scope of this proposal.

Open questions:

  • Should PageIdentity be a LinkTarget, or have a LinkTarget? Having a LinkTarget seems cleaner, but inconvenient, and potentially confusing because Title will be implementing both, LinkTarget and PageIdentity. On the other hand, the notion of "remoteness" is different for LinkTargets and PageIdentities.
  • What should the relationship between PageIdentity and PageRecord (see T195069) be? PageRecord should probably have a PageIdentity, not be a PageIdentity, so that the contract of PageRecord can guarantee that PageRecords always refer to entries in the page table (and perhaps also the archive table).
  • getArticleID is an akward method name to have in PageIdentity. It should have a getId() method instead. Should Title::getArticleID() be deprecated in favor of getId(), when Title implements PageIdentity? Is getId() a clear enough name in the context of Title?
  • How can the idea of a page "existing" be clarified in the contract if PageIdentity, while maintaining compatibility with the way exists(), canExist(), isSpecialPage() and isAlwaysKnown() work on Title?
  • What terminology should be used to represent the idea that a PageIdentity refers to a "real" page, rather than a "special" page or foreign link target? "article" is ambiguous...
  • What terminology should be used to identify PageIdentities that identify pages that belong to a different wiki, but one to which we still have direct database access via PageStore and RevisionStore, know the namespaces, etc? How doe4s that relate to isExternal() and getInterwiki() defined in LinkTarget?

Straw man interface proposal:

interface PageIdentity {
  function getId(): int; // 0 if not in the database (could be missing, or special, or interwiki, or...)
  function getLinkTarget(): LinkTarget; // or should this just extend LinkTarget?
  function getWikiId(): string; // or database domain? false for local? do we need isLocal()?
  function isCuratable: string; // it's a "real" page in a database we can access directly. Equivalent to Title::canExist()
}

Related Objects

StatusAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
ResolvedAbit
OpenNone
OpenNone
OpenNone
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedppelberg
ResolvedKrinkle
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Opendaniel
OpenNone
OpenVedmaka

Event Timeline

daniel created this task.Nov 5 2018, 7:59 PM
Aklapper renamed this task from Introduce PageIdentiy to be used instead of Title to Introduce PageIdentity to be used instead of Title.Nov 5 2018, 11:32 PM
Aklapper updated the task description. (Show Details)
tosfos reassigned this task from daniel to Vedmaka.Mar 4 2019, 3:33 PM
tosfos added a subscriber: tosfos.
daniel updated the task description. (Show Details)Mar 18 2019, 2:32 PM

Tagging this as an RFC. This is going to be used throughout the codebase, and there are nitty gritty questions open, so it should see wide feedback.

The description needs some polishing for this to work as an RFC, but I think the motivation and idea are pretty clear already.

daniel updated the task description. (Show Details)Apr 3 2019, 7:17 PM

Pinging for input: @tstarling, @brion, @Krinkle, @Tgr, @Anomie. Your take on the open questions mentioned in the description would help me move this forward.

  • Should PageIdentity be a LinkTarget, or have a LinkTarget? Having a LinkTarget seems cleaner, but inconvenient, and potentially confusing because Title will be implementing both, LinkTarget and PageIdentity. On the other hand, the notion of "remoteness" is different for LinkTargets and PageIdentities.

That's a good question. This may be a good time to think through the whole hierarchy of things that Title represents.

Note that a long-term request from our DBAs is to create a titles table so we can assign an ID to titles that don't exist, so e.g. pagelinks doesn't have to repeat pl_namespace+pl_title in every row. We may want to take that into consideration when designing interfaces here.

  1. A "title" is basically a local namespace ID + text. Someday the titles table might give it an optional ID too.
    • I don't know if we really have/use a concept of "interwiki titles", having an interwiki prefix + namespace ID + title, since MediaWiki generally has no way to know the namespace name to ID mapping for a foreign wiki.
  2. A "link" is either a "local link" or an "interwiki link". All links have an optional fragment. This is LinkTarget (and also TitleValue, which turns out to be poorly named).
    1. A "local link" is a "title" plus the optional fragment.
    2. An "interwiki link" is not a "title". It's an interwiki ID, plus a string that theoretically represents a "title" on the remote site but in practice may actually be another (fragmentless) "link", plus the optional fragment.
  3. A "page" is a "title" that has actual content and a page_id.
  4. Then we have the Title class, which is used to represent all of these concepts and has a lot of extra functionality too.
    • We might want to look at whether any of those extra concepts are further "things" versus "services". At first glance,
      • Depending on what exactly PageRecord contains, it might have that too: content model, language, length.

I'm not sure where exactly PageIdentity fits into that. For a name "PageIdentity" I'd think it should be somewhere around #3, but what's the point of it being separate from "PageRecord" then? Your description is a lot closer to it being #1, in which case "TitleIdentity" would be a better name. But you're also conflating it with #2, at which point I think we're not so much fixing the confusion that is Title as just splitting that confusion across multiple interfaces. The proliferation of using LinkTarget in place of Title in the guise of removing dependencies on Title probably didn't really help that situation.

So what I'd propose would be to make a TitleIdentity that represents #1, and maybe also has an accessor for the page_id. We already have LinkTarget to represent #2. I hope we don't really need a separate "identity" interface for #3 versus just using PageRecord for that.

If we do need "interwiki titles", the only relation that makes sense to me between TitleIdentity and InterwikiTitleIdentity would be if TitleIdentity is an instance of InterwikiTitleIdentity that's hard-coded to be local, even though that seems backwards. The point is that you could pass a TitleIdentity to something expecting a InterwikiTitleIdentity, but explicitly not vice versa. I'm wary of using InterwikiTitleIdentity as the general-purpose class for something that's so tied to the database-level representation as it seems too open to accidentally assuming a local title when it's not.

LinkTarget might have a TitleIdentity when it's representing a "local link", but should probably not be one even though that's not terribly convenient. We should be careful about letting LinkTarget have an InterwikiTitleEntity, as something like "foundationsite:advocacy" can't have one (the target is not even a wiki) and something like "de:wikt:Wiktionary:Foo" must be fully parsed to wikiid=dewiktionary ns=4 title=Foo rather than say wikiid=dewiki ns=0 title=wikt:Wiktionary:Foo (which would also mean that every wiki would have to be able to know that "Wiktionary" is a name for ns4 on dewiktionary when parsing "de:wikt:Wiktionary:Foo").

  • What should the relationship between PageIdentity and PageRecord (see T195069) be? PageRecord should probably have a PageIdentity, not be a PageIdentity, so that the contract of PageRecord can guarantee that PageRecords always refer to entries in the page table (and perhaps also the archive table).

Yes, have would be the right relationship here.

I'm not sure about the "(and perhaps also the archive table)" bit, since the archive table really just contains revisions (+ "titles") rather than actual pages. But that's not really on-topic for this task.

  • getArticleID is an akward method name to have in PageIdentity. It should have a getId() method instead. Should Title::getArticleID() be deprecated in favor of getId(), when Title implements PageIdentity? Is getId() a clear enough name in the context of Title?

As things stand now, that wouldn't be bad. If we add that titles table in the future, then getId() would be a confusing name since we'd have to distinguish between the title_id and the page_id.

Another option would be to make it getPageId(), and have Title::getArticleId() be an alias. That leaves getTitleId() open for the titles table.

BTW, there's a parallel here with the "actor" concept. A registered user always has both a user_id and an actor_id, while an IP user or interwiki user doesn't have a user_id and only gets an actor_id when something needs to refer to it. Similarly, a "page" will always have a page_id and a title_id, while a "title" doesn't have a title_id until something needs to refer to it. Obviously not a perfect analogy though: we don't allow converting IP users or interwiki users into registered users or vice versa, while we will have to allow making and unmaking the "page" for a "title".

  • How can the idea of a page "existing" be clarified in the contract if PageIdentity, while maintaining compatibility with the way exists(), canExist(), isSpecialPage() and isAlwaysKnown() work on Title?

It seems pretty straightforward to me:

  • exists() → Generally, the "title" has a "page".
    • The 'TitleExists' hook is a bit of a hack for previewing a page creation to properly render as if the creation were completed. We could replace it with any other mechanism that lets previews, including from MediaWiki-extensions-TemplateSandbox, achieve that. That might mean we move the method from the value object to a service, I suppose.
  • isSpecialPage() → This is really just a shorthand for ->getNamespaceId() === NS_SPECIAL. Seems trivial.
  • canExist() → Another shorthand, for whether ->getNamespaceId() allows page creation. Seems trivial.
  • isAlwaysKnown() → I note the doc comment already points out that it's misleadingly named and says it's semi-deprecated. Probably we should fully deprecate it in favor of the service having isKnown().

What problems do you see?

  • What terminology should be used to represent the idea that a PageIdentity refers to a "real" page, rather than a "special" page or foreign link target? "article" is ambiguous...
  • What terminology should be used to identify PageIdentities that identify pages that belong to a different wiki, but one to which we still have direct database access via PageStore and RevisionStore, know the namespaces, etc? How doe4s that relate to isExternal() and getInterwiki() defined in LinkTarget?

I question the assumptions behind these questions, see above.

function isCuratable: string; // it's a "real" page in a database we can access directly. Equivalent to Title::canExist()

Why invent a fancy new name?

@Anomie thank you for your detailed response. At a first glance, I think I agree with most of what you say. But before we discuss the details, I'd like to raise one thing I learned from trying to introduce TitleValue:

If we don't define the new interface in a way that Title can directly implement, adaption will be slow, difficult, and error prone. So while in theory I would very much like clear cut concepts with strong guarantees, it seems we'll have to live with some dirty compromised. One consequence of this is that, if Title implements new interface Foo, a Foo instance can be anything a Title can be: an existing page, a non existing page, a special page, an interwiki link, etc. I find this very annoying, but I don't see a good way around it.

daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Apr 11 2019, 8:17 PM

That's a good point, but I'm afraid that making a "LinkTarget + page ID" interface isn't going to get us to a place that's better than just paring out some of the excess functionality out of Title while keeping Title itself.

I wonder if we can get someplace more sane by making internal subclasses of Title and privatizing some of its fields, since it looks like Title already isn't "newable". Have Title::makeTitle() return a LocalTitle (which implements TitleIdentity) or an InterwikiTitle (which doesn't) depending on whether it's local or not, and the same for the other factory methods.

The drawback there is that Phan probably wouldn't be able to know whether a Title actually does implement TitleIdentity when doing static analysis, we might have to add extra asserts or type declarations to make it happy.

@Anomie, my hope is that we can introduce narrower interfaces with weak guarantees now, and introduce extensions of such interfaces later that provide stronger guarantees. This may however result in somewhat awkward signatures (e.g. the sub-interface exposing an isLocal() method that it guarantees will always return true).

How would this help to remove the cyclic dependency between User and Title? I was looking at those classes last night, and the dependencies between them don't seem related to this proposal. There is User::isWatched() etc., which is already a wrapper around WatchedItemStore, so the dependency could be removed by just moving it fully to WatchedItemStore and deprecating the wrapper. Similarly Title::getUserPermissionsErrors() is now a deprecated wrapper for PermissionManager, and most of the other uses of User in Title could receive similar treatment.

User::isValidUserName() calls Title::newFromText() and then immediately extracts the canonical text form. This is apparently due to the incomplete prior round of abstraction, since it could apparently use MediaWikiTitleCodec instead.

Osnard added a subscriber: Osnard.Apr 12 2019, 12:19 PM

@tstarling This is not primarily motivated by the need to cut the dependencies between User and Title. Rather, it's motivated by the desire to have a light weight, directly constructable value object that basically wraps a page ID and also provides access to the page's namespace and title. Such a value object would implement PageIdentidy, and Title would also implement page identity, allowing for an easy transition.

The idea is that new MediaWiki core could would hint against PageIdentity instead of Title (or WikiPage) when it needs to know a page ID. This would allow us to resolve indirect dependencies between User and Title, e.g. via the planned RestrictionStore and BlockManager services, which User would delegate to (at least for now) and which contain code that deals with page IDs. Allowing these to function without the need to know about Title is the aspect of PageIdentity that ties this into the Tilte/User decoupling.