Page MenuHomePhabricator

Figure out long-term page rename handling and page history strategy, especially for old revisions
Closed, InvalidPublic

Description

For queries for a title & revision, we currently don't check whether the revision's title (part of the API response) actually corresponds to the requested title.

For the case where it does not match, @mobrovac came up with the great idea of redirecting to the proper title (and storing the revision table entry under the actual title).

For old revisions of pages that were renamed, this will result in revision entries only under the latest title. This is mildly in conflict with RESTBase's policy of providing cite-able stable URLs to old revisions, which don't change with renames. Fixing this likely involves walking the history backwards to find rename points, and populating the page_revisions table with the resulting information. This is not entirely trivial, so I think we should defer this for now. In the meantime, new entries for current revisions will be stored at the accurate locations.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke subscribed.
GWicke renamed this task from Double-check what happens if title & title corresponding to revision don't agree to Handle mis-matches between title & revision title.Jan 24 2015, 11:03 PM
GWicke updated the task description. (Show Details)
GWicke set Security to None.
GWicke added a subscriber: mobrovac.

I think we also have a potential conflict here. I completely agree we RESTBase should provide citable, stable API URI's. However, changing a page's title falls into the grey-zone area, to some extent. If a request comes in for a page the title of which has been changed, we have (basically) two choices:

  1. Provide the latest version of the page with the old title. This decision implies the user is aware of (or is somehow able to obtain) the information about the page title change.
  2. Provide the latest version of the page with the new title (which might be accomplished via redirection). Here, on the other hand, we lose to some extent URI stability (albeit only in edge-case scenarios).

In either case, however, I think we are going to have to walk through history. Imagine that for a page with some revisions (r1, r2, ..., rn) we change the title in revision rt (t < n) from titleA to titleB. Then, there is a request for titleB with revision rq (q < t). We then have the same two aforementioned choices.

So, maybe the solution would be to internally use pageIds and map page titles (which can change through history) to it ?

@mobrovac, we can translate revision ids to the correct title at the time, and redirect requests for the revision using other names to that title + revision. Building the correct index for old revisions will involve a history walk, which is some work but not impossible either.

@mobrovac addressed the most important issue with https://github.com/wikimedia/restbase/commit/a06695e9aa807a23854d51a1ad84e70f4184b51d

Lets check if there are other cases we should handle, but I believe this is good enough to unblock the release.

This was further fixed by https://github.com/wikimedia/restbase/commit/efc2b2c4c386af6ab22fb13868e26dea0cdb4fee.

Revisions are now only stored using their *current* canonical title as returned in the revision metadata. This means that requests for old revisions of renamed pages will all be saved using the current name of that page, which isn't quite historically accurate. Changing this will require some more work, but is also not very high priority.

On second thought, I have doubts about the usefulness of redirecting the client to the proper location. With clients that follow redirects automatically (which many do), the fact that the redirect happened is often not easily recognizable for a consumer.

Throwing an error is another option, although it is a fairly draconian measure if all you got wrong is the capitalization (asked for 'cat' instead of 'Cat'). There is also the issue of page renames. I would not be surprised if revision 1 was actually the Main Page at the time, but was then renamed to Wikipedia:Wikipedians later. So, more to think about here.

On second thought, I have doubts about the usefulness of redirecting the client to the proper location. With clients that follow redirects automatically (which many do), the fact that the redirect happened is often not easily recognizable for a consumer.

Clients should, however, honour responses, and if they encounter 301, that should be recorded by them and used in subsequent requests.

Throwing an error is another option, although it is a fairly draconian measure if all you got wrong is the capitalization (asked for 'cat' instead of 'Cat').

Even if the title were completely changed, I don't think throwing an error would be appropriate. IMHO, you describe it correctly as draconian 😄

There is also the issue of page renames. I would not be surprised if revision 1 was actually the Main Page at the time, but was then renamed to Wikipedia:Wikipedians later. So, more to think about here.

When I saw the task filed about, that was my first thought.

GWicke renamed this task from Handle mis-matches between title & revision title to Figure out long-term page rename handling and page history strategy, especially for old revisions.Mar 15 2015, 4:17 PM
GWicke triaged this task as Medium priority.

Clients should, however, honour responses, and if they encounter 301, that should be recorded by them and used in subsequent requests.

Yeah, but this doesn't happen that often in practice (at least not in python / PHP / node / curl etc clients), and the actual user (as in person using a browser, script or both author) won't normally notice. Even if they wanted to find out, it's often hard to get information on whether a redirect happened behind the scenes. Effectively, the information available to the user is very similar to what we expose right now: basically none.

Pchelolo subscribed.

Given that we only store latest revision now, it's not a problem any longer.