Page MenuHomePhabricator

Deprecate and remove the public title/{title} endpoint
Closed, ResolvedPublic

Description

The /page/title/{title} endpoint provides metadata about the pages.

Right now it doesn't even use the page_revisions table in the storage, because we didn't do rerenders on global user blocks (see T120212) so to avoid exposing potentially private data in usernames we rely on MediaWiki API access checks. The page_revisions table is used for access checks for other endpoints, but that might change in the future with the introduction of the revision table.

The endpoint is used though for updating the revision table state for page deletes, moves and undeletes, but we can as well notify RESTBase about those by requesting the HTML with some special header or allow ChangeProp to make requests to /sys hierarchy and remove the public access to the endpoint.

We are unaware of any major user of the endpoint inside or outside of the WMF, and according to monitoring over the last 3 months the external request rate was 0 except one burst of requests probably made by some bot.

The response schema specified in the documentation is the following:

{
  "count": 0,
  "items": {
    "title": "string",
    "page_id": 0,
    "rev": 0,
    "tid": "string",
    "latest_rev": 0,
    "latest_tid": "string",
    "nextrev_tid": "string",
    "comment": "string",
    "renames": [
      "string"
    ],
    "restrictions": [
      "string"
    ],
    "tags": [
      "string"
    ],
    "user_id": 0,
    "user_text": "string",
    "timestamp": "2017-02-14T18:35:28.602Z",
    "redirect": true
  }
}

and it's supported only partially, for example we don't really expose the latest_rev field and a couple others are not supported either.

As a part of releasing REST API 1.0 we need to decide what to do with the endpoint. Options are:

  1. Add proper support for all specified fields or remove unsupported fields from the schema, mark endpoint stable.
  2. Delete the endpoint. In case we delete it the /page/revision/{revision} will go away too together with a listing of revisions at /page/revision/. The listings have been proven impossible to support with Cassandra, and the /revision/{revision} endpoint is practically the same as the /title/{title} endpoint. Also, it's the only one using secondary indexes in our system, so we might be able to simplify the system a bit and remove the usage of secondary indexes - less key spaces in Cassandra, less overall load on updating, less range queries.

I would personally vote for solution 2 - delete unused stuff. @GWicke @mobrovac @Eevans what's your vote on this?

Event Timeline

Pchelolo created this task.Feb 14 2017, 6:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2017, 6:58 PM

One more option is to leave the endpoints for updates just hide it from the public merged spec.

I think it's useful to have the title -> (pageid, revision) connection. It is true that it is not used right now and it somewhat incomplete, but I don't think that's a good reason to remove it. Not being used now, does not mean people will not leverage it in the future, does it? What is your concern with keeping it? You mention purely implementation details, but I think that the conversation should be about the API endpoint, not its underlying implementation.

I think it's useful to have the title -> (pageid, revision) connection. It is true that it is not used right now and it somewhat incomplete, but I don't think that's a good reason to remove it. Not being used now, does not mean people will not leverage it in the future, does it? What is your concern with keeping it? You mention purely implementation details, but I think that the conversation should be about the API endpoint, not its underlying implementation.

Right now, when we're looking at all APIs for stability, is a great moment to reflect on this, because if we mark it as Stable - we're stuck with it for life. So I mostly wanna gather opinions on this and make a final decision: either go through the response schema, document and mark stable or delete. Don't want it to stay in experimental for another year.

Deleting (or possibly reimplementing as a simple proxy to MW API) it would provide a very significant simplification for the system:

The grant goal of the latest storage project is to get rid of the range read and update queries in revision retention policy. If we do that, secondary indexes are the only thing that would do range read then update queries. /revision/{revision} is the only place in the whole system where we use secondary indexes, so if we delete (or reimplement in a way that just proxies to MW API) that endpoint, we'd likely be able to delete support for secondary indexes completely.

The restriction table work for access checks has been somewhat stalled with other priorities, but eventual goal is to get it running. In case we do that, the page_revisions table will be not used at all, nighter by the API in question, nor for access checks. So, we might as well want to remove it.

But yes, you're right that's implementation details I'm mixing into the API discussion..

I'm fine with removing these end points from the public API until we have a better idea of how this should work. We can always re-add it later, and until then we can save some maintenance effort (and secondary indexes). We should be able to figure something out for the private updates, possibly with a restriction to 10.* IPs, and still exposing the title end point internally (without it showing up in the public spec).

My second preference would be to keep them around as experimental. However, since they don't seem to be useful to anybody right now the benefit of that seems to be limited.

Good points guys! Indeed, deleting it and then resurrecting it seems feasible.

As for the secret endpoint, instead of playing with IP address ranges, I think we should just set up a hidden domain in RESTBase, i.e. one that cannot be accessed from outside of production, and use the endpoint there. Something like restbase.svc.eqiad.wmnet:7231/{domain:private}/v1/{target-domain}/page/title/{title}.

As for the secret endpoint, instead of playing with IP address ranges, I think we should just set up a hidden domain in RESTBase, i.e. one that cannot be accessed from outside of production, and use the endpoint there. Something like restbase.svc.eqiad.wmnet:7231/{domain:private}/v1/{target-domain}/page/title/{title}.

Isn't that enough to just remove it from public docs? It's not security by obscurity, it's more like 'lack of advertisement'..

As for the secret endpoint, instead of playing with IP address ranges, I think we should just set up a hidden domain in RESTBase, i.e. one that cannot be accessed from outside of production, and use the endpoint there. Something like restbase.svc.eqiad.wmnet:7231/{domain:private}/v1/{target-domain}/page/title/{title}.

Isn't that enough to just remove it from public docs? It's not security by obscurity, it's more like 'lack of advertisement'..

IMHO, if we go with public removal, then we go with it truly. Removing it from the docs does not prevent people from starting to use it ;)

Okey, lemme try to prototype this.. I'll do it in a separate PR from the general stability bump PR (https://github.com/wikimedia/restbase/pull/764) for possibly of easy revert.

So, the overall agreement is to delete (hide) the following endpoints:

  1. /page/title/
  2. /page/title/{title}
  3. /page/title/{title}/
  4. /page/revision/
  5. /page/revision/{revision}

We need to send an announcement to the appropriate mailing list ASAP. I propose to set the sunset date to May 01 2017 since according to grafana there's no real usage of the endpoints.

Pchelolo renamed this task from Decide on the future of the public title/{title} endpoint to Deprecate and remove the public title/{title} endpoint.Feb 15 2017, 12:43 AM
Pchelolo triaged this task as Medium priority.
Pchelolo moved this task from doing to blocked on the Services board.Feb 23 2017, 11:32 PM
Pchelolo edited projects, added Services (blocked); removed Services (doing).

Deprecation was announced

mobrovac moved this task from blocked to doing on the Services board.May 8 2017, 8:37 PM
mobrovac edited projects, added Services (doing); removed Services (blocked).

I've had a look and I don't think removing both /page/revision/{revision} and /page/title/{title} would actually be good. Conceptually, revisions are the basis of MW content, so it is good to have them exposed. Practically, I don't see how we could achieve things like signalling the visibility change of a revision without it being exposed. I would propose to (a) remove the listing end points; (b) remove the latest_rev, latest_tid and nextrev_tid fields from the response.

In order to get rid of the secondary index on this bucket, we can remove either /page/revision or /page/title, in this way removing the need for the secondary index. As things stand, it would be easier to remove /page/revision, which seems reasonable since we treat the title as a kind of PK throughout RESTBase. Thoughts?

GWicke added a comment.EditedJul 5 2017, 4:04 PM

We have been working on moving restriction handling to a separate "restrictions" table, but have kind of dropped the ball on this half way through. See T148592 for the remaining steps needed there.

Once that is wrapped up, the revision table won't be needed for restriction tracking any more. I believe that means that the only reason to keep the table at all (and keep updating it) would be to ensure timely responses for a public end point. Given the low to non-existent usage of those public end points, it might be simpler to either remove them altogether, or replace them with a simple proxy request to the PHP API.

GWicke added a comment.EditedJul 5 2017, 6:02 PM

Practically, I don't see how we could achieve things like signalling the visibility change of a revision without it being exposed.

Do you have a use case in mind which requires this in the REST API? The main use cases I can come up with would be better off following an EventStream feed to pick up on suppressions after they downloaded content, rather than probing the REST API or action API for changes.

Practically, I don't see how we could achieve things like signalling the visibility change of a revision without it being exposed.

Do you have a use case in mind which requires this in the REST API? The main use cases I can come up with would be better off following an EventStream feed to pick up on suppressions after they downloaded content, rather than probing the REST API or action API for changes.

Yes, the revision visibility change :) One can change the visibility of various parts of a revision, and without exposing the revision in any way in the REST API, I don't see how we can update that.

@mobrovac, revision visibility changes are rare events, so it would be a lot more efficient for clients to follow a feed, rather than polling for repeatedly for each revision they have downloaded before. Nobody seems to be doing this polling in practice.

Yes, the revision visibility change :) One can change the visibility of various parts of a revision, and without exposing the revision in any way in the REST API, I don't see how we can update that.

We can add /{title}/{revision} into the picture. That will allow to remove secondary indexing at least.

@mobrovac, revision visibility changes are rare events, so it would be a lot more efficient for clients to follow a feed, rather than polling for repeatedly for each revision they have downloaded before. Nobody seems to be doing this polling in practice.

Oh no no no no no... Let's stick with just one practice for everything please... If we start using EventStreams for some updates then we need to switch everything to it..

GWicke added a comment.EditedJul 7 2017, 5:36 PM

@mobrovac, revision visibility changes are rare events, so it would be a lot more efficient for clients to follow a feed, rather than polling for repeatedly for each revision they have downloaded before. Nobody seems to be doing this polling in practice.

Oh no no no no no... Let's stick with just one practice for everything please... If we start using EventStreams for some updates then we need to switch everything to it..

I was talking about third party users, which @mobrovac mentioned as potentially needing the public revision information to check up on changes in visibility. This is not about how we update things ourselves.

@mobrovac, revision visibility changes are rare events, so it would be a lot more efficient for clients to follow a feed, rather than polling for repeatedly for each revision they have downloaded before. Nobody seems to be doing this polling in practice.

Oh no no no no no... Let's stick with just one practice for everything please... If we start using EventStreams for some updates then we need to switch everything to it..

I was talking about third party users, which @mobrovac mentioned as potentially needing the public revision information to check up on changes in visibility. This is not about how we update things ourselves.

I don't understand your comment. This is how we update things ourselves (by having these end points exposed in RESTBase and used by CP). As for third-party, that was a conceptual thing: revisions are first citizens in the MW world and it only seems right to expose them through the REST API (arguably, the REST API is much better for such purposes than the action API). EventStream falls into a different category, as it's targeting clients that follow changes, rather than those that just need/want to look up information about a specific revision.

I vote for re-scoping this ticket to deprecate and remove /page/revision* routes. This would allow us to still follow the restrictions all the while allowing us to get rid of secondary indices (which is a de-facto blocker for fully switching to Cassandra 3). To that end, PR #877 introduces the /page/title/{title}/{revision} end point which is used by CP to update the restrictions.

Mentioned in SAL (#wikimedia-operations) [2017-09-27T08:33:47Z] <mobrovac@tin> Started deploy [restbase/deploy@1cc530b]: Introduce the /page/title/{title}/{revision} end point - T158100

Mentioned in SAL (#wikimedia-operations) [2017-09-27T08:43:48Z] <mobrovac@tin> Finished deploy [restbase/deploy@1cc530b]: Introduce the /page/title/{title}/{revision} end point - T158100 (duration: 10m 01s)

Change 380932 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Config: Use title/{title}/{revision} instead of revision/{revision}

https://gerrit.wikimedia.org/r/380932

Change 380932 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Config: Use title/{title}/{revision} instead of revision/{revision}

https://gerrit.wikimedia.org/r/380932

Mentioned in SAL (#wikimedia-operations) [2017-09-27T08:53:48Z] <mobrovac@tin> Started deploy [changeprop/deploy@7a3dc66]: Use the /page/title/{title}/{revision} end point for revision-visibility-change - T158100

Mentioned in SAL (#wikimedia-operations) [2017-09-27T08:55:14Z] <mobrovac@tin> Finished deploy [changeprop/deploy@7a3dc66]: Use the /page/title/{title}/{revision} end point for revision-visibility-change - T158100 (duration: 01m 26s)

Mentioned in SAL (#wikimedia-operations) [2017-10-17T08:39:27Z] <mobrovac@tin> Started deploy [restbase/deploy@3bcd1b7]: Remove /page/revision and add history metrics end points - T158100 T175805

Mentioned in SAL (#wikimedia-operations) [2017-10-17T08:40:31Z] <mobrovac@tin> Finished deploy [restbase/deploy@3bcd1b7]: Remove /page/revision and add history metrics end points - T158100 T175805 (duration: 01m 05s)

The public /revision hierarchy was removed as well as secondary indexing usage in Cassandra. deleting the secondary indexes does not really delete the tables, so the cleanup should be done manually. Here's the list of tables that should be removed:

"local_group_wikisource_T_title__revisions".idx_by_rev_ever
"local_group_wikipedia_T_title__revisions".idx_by_rev_ever
"local_group_wikinews_T_title__revisions".idx_by_rev_ever
"local_group_wikiversity_T_title__revisions".idx_by_rev_ever
"local_group_phase0_T_title__revisions".idx_by_rev_ever
"local_group_wikibooks_T_title__revisions".idx_by_rev_ever
"local_group_wikiquote_T_title__revisions".idx_by_rev_ever
"local_group_wikivoyage_T_title__revisions".idx_by_rev_ever
"local_group_wiktionary_T_title__revisions".idx_by_rev_ever
"local_group_wikimedia_T_title__revisions".idx_by_rev_ever
"local_group_default_T_title__revisions".idx_by_rev_ever

Verified with nodetool there's no activity on those tables.

The above-listed tables have been dropped. We still have the deprecation notice on the documentation for /title/, /title/{title}/ and /title/{title}{/revision}. I propose to remove entirely the first two end points and change the stability note of the last one to Stable in PR #897.

Mentioned in SAL (#wikimedia-operations) [2017-10-25T10:31:20Z] <mobrovac@tin> Started deploy [restbase/deploy@2ae5b0c]: Remove /title/{title}/ and double-process all summaries - T158100

Mentioned in SAL (#wikimedia-operations) [2017-10-25T10:41:01Z] <mobrovac@tin> Finished deploy [restbase/deploy@2ae5b0c]: Remove /title/{title}/ and double-process all summaries - T158100 (duration: 09m 41s)

mobrovac closed this task as Resolved.Oct 25 2017, 10:58 AM
mobrovac claimed this task.
mobrovac edited projects, added Services (done); removed Services (doing).

The /title/ and /title/{title}/ end points and the /page/revision hierarchy have been dropped. Resolving.