Page MenuHomePhabricator

Pages do not have stable identifiers
Closed, DeclinedPublic

Description

Problem
I was under the assumption that the pageid was a stable identifier (stable, as in that it remains the same regardless of user action). However, that does not seem to be the case.

If I am a permissioned user and have a pageid I can request the title:
/api.php?action=query&format=json&pageids=4&formatversion=2

{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 4,
                "ns": 0,
                "title": "Gotham"
            }
        ]
    }
}

If I delete that page, I get a response that it's missing:
/api.php?action=query&format=json&pageids=4&formatversion=2

{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 4,
                "missing": true
            }
        ]
    }
}

While it's in the deleted state, the pageid does not exist (even if you have permission to see all of the deleted revisions).

If I restore the page, then all of the sudden, it's back again:
/api.php?action=query&format=json&pageids=4&formatversion=2

{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 4,
                "ns": 0,
                "title": "Gotham"
            }
        ]
    }
}

However, there are many ways in which restoring may not result in the same pageid

The tricky part with trying to track pages across undeletion by the page_id is that you can get some unexpected situations:

  • Undeleting a subset of revisions, moving that page elsewhere, then undeleting the rest will assign a new page ID to the second batch even though they're at the old title.
  • Recreating the page at the same title will assign a new ID for the title.
    • Then undeleting the old revisions will keep that new page ID.

The problem is that we can store the pageid or the title in a database, but there isn't way to ensure that in the future this refers to the same page. I realize that if you change everything about a page, is it still the same page? I suppose I mean what users consider to be the same page. If a page can be deleted, restored, and moved and still be the same page, then it should have a stable id throughout any of those processes.

Solution
We could change our page deletion strategy from a hard delete (where the page is removed from the table) to a soft delete. This would invovle adding a page_deleted column that would either be a nullable datetime of when the page was deleted, or a boolean field that would indicate whether or not the page is deleted. I think the former is better since it gives more information about the page being deleted.

This change would fix the API endpoints as the page would no longer be missing (but perhaps should return that it has been deleted). If a user were to re-create the page, it would recreate with the same id, it's deleted status would be removed (although, all of the existing revisions would continue to be deleted). Effectively, a deleted page is the same as saying no revisions.

Alternatively, if we don't want to change the way that page deletion works, we could just abstract this with the API. The API would basically query for pages in the archive table. However, this adds a lot of overhead to API endpoints without actually fixing the underlining issue.

Use Cases

Work Around
Store the pageid and the title and assume that one or the other hasn't changed.

Event Timeline

This has nothing to do with the API. I'm not sure there's anything actionable here at all, but I'll give you the benefit of the doubt on that.

CommunityTechBot renamed this task from h8aaaaaaaa to Pages do not have stable identifiers.Jul 2 2018, 2:10 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Krinkle changed the task status from Open to Stalled.Jul 28 2018, 10:13 PM
Krinkle triaged this task as Low priority.
Krinkle moved this task from Untriaged to Schema changes on the MediaWiki-libs-Rdbms board.
Krinkle added a project: TechCom.
Krinkle moved this task from Schema changes to Usage problem on the MediaWiki-libs-Rdbms board.
Krinkle subscribed.

Marking stalled pending outcome of T20493.

In triaging this with TechCom, two things came up:

  • The concept of re-using IDs is not new, and is something we already do when restoring a page from the archive. Doing it also on re-creation of a page seems sensible, although non-trivial with our current schema given it would require querying of the archive table to find the old ID, and there can in fact be more than one there, we'd have to decide which one to pick, and still leave us with multiple IDs. If we go with this approach, it would make sense to normalise this first with an upgrade script.
  • The act of re-creating a deleted title is not the only thing to consider. There is also the act of renaming a page. We currently preserve IDs during a rename and that seems desirable given a page remains semantically the same. The problem is, however, that if the new title previously existed as a page, we would not re-use its ID and still leave us with multiple IDs for the same title.

There was consensus to post-pone this until after T20493, which might end up solving it implicitly.

Krinkle moved this task from P1: Define to Old on the TechCom-RFC board.

I think this writeup misses the main problem: in MediaWiki, live revisions are associated with a page ID but deleted revisions are associated with a title. That allows for all kind of weird things through the creative use of delete/undelete and page move - merging pages, splitting a page into multiple parts, transferring parts of a page history into another page etc. This violates the assumption behind having a stable page identifier: that the page is an "atomic" building block that can be tracked over time as a single entity. Fixing that (if it should be fixed) is more of an UX / user expectations issue than a technical one.

  • The act of re-creating a deleted title is not the only thing to consider. There is also the act of renaming a page. We currently preserve IDs during a rename and that seems desirable given a page remains semantically the same. The problem is, however, that if the new title previously existed as a page, we would not re-use its ID and still leave us with multiple IDs for the same title.

My expectation would be that if you move a page, it would retain the id (as it does now), and if you recreate a page at the old title, a new id would be created for it (since it is a new entity and the old one still exists).

On second thought... recreating the page (whether by recreate or by using the same title) might have a different action depending on the state of the existing title.

If the title exists in the database already, then recreating the page with the title and restoring the page, is effectively the same action. The only difference would be, is that when recreating the page, the most recent revision would wipe out all of the existing content.

If you wanted to not have the revision history with a recreated page (why?) you'd have to move that revision history to a different title (we could do this automatically to continue the existing workflows). So for instance, if Saturn was deleted, and someone wanted to recreate it with a new revision history, the existing history/id would be moved to Saturn/Archive/1 or something like that. This would also allow an admin to restore a different revision history if they wanted to and easily track the number of times it's been deleted and what the pages looked like in each state.

If you wanted to not have the revision history with a recreated page (why?)

Maybe because the previous revision history vandalism, a copyright violation, doxxing, a personal attack, or something along those lines? It certainly shouldn't be unhidden, and especially not by someone without the undelete right who happens to create a page at the new title.

you'd have to move that revision history to a different title (we could do this automatically to continue the existing workflows). So for instance, if Saturn was deleted, and someone wanted to recreate it with a new revision history, the existing history/id would be moved to Saturn/Archive/1 or something like that.

Note than the English Wikipedia's main article namespace doesn't have subpages as such. IMO it would be better to give wiki admins robust tools for splitting and merging (deleted) history where that's necessary rather than trying to automatically move things around.

Even better would be to conduct an actual consultation rather than relying on either of our individual opinions.

@Anomie sure! I didn't know mainspace doesn't have subpages. I totally understand that you wouldn't necessarily want revisions revealed, but we also have the concept of suppression to completely remove revisions if they contain information like in your examples. I suppose the whole deletion of pages/revisions should be reworked to a certain extent.

I suppose the whole deletion of pages/revisions should be reworked to a certain extent.

That's T20493.

Closing old RFC that is not yet on to our 2020 process and does not appear to have an active owner. Feel free to re-open with our template or file a new one when that changes.


On the topic of this particular task, I don't think it currently describes well a problem for end-users nor developers. Page IDs are generally stable, but there are and likely always will be ways for the subject of an article to drift with time at which point part of it will move and part of it will spawn into a new page. Re-creations of the same topic under the same or a similar title also happens which are imho sensible to track with a differnet page ID since they are fresh revision histories.

Other issues that I think are better quantified and more actionable: