Page MenuHomePhabricator

Curator gets number of intermediate revisions between two revisions
Closed, ResolvedPublic

Description

"As a Curator, I want to know how many revisions were made between two revisions I am comparing, so I can know how much work went into the differences I'm reviewing."

When displaying a diff between two revisions, our clients also show how many revisions happened between the two revisions. (See https://en.wikipedia.org/w/index.php?title=France&type=revision&diff=920186445&oldid=914890923 for an example, "(8 intermediate revisions by 7 users not shown)").

For "intermediate revisions", I think we are talking about non-deleted revisions, that are strictly between the two compared revisions, non-inclusive. So with a revision history with IDs in reverse chron order 50, 40, 30, 20, 10:

  • the number of intermediate revisions between ID 50 and 10 is 3
  • the number of intermediate revisions between 50 and 40 is 0
  • the number of intermediate revisions between 20 and 50 is 2
  • the number of intermediate revisions between 30 and 30 is an error

GET /page/{title}/history/edits?from=<from>&to=<to>

Return the number of intermediate revisions between revision with id from and revision with id to, non-inclusive

Request body: none

Notable request headers: none

Status codes:

  • 200: body is the intermediate revision count
  • 404: revision with id from and/or revision with id to do not exist
  • 403: revision with id from and/or revision with id to is unauthorized, or page that these are revisions of is unauthorized
  • 400: client error, such as if from and to are the same ID, or if from and to are not revisions of the same page

Notable response headers: none

Response body: JSON object with the following properties

  • count: integer of 0 or more representing count of non-inclusive non-deleted intermediate revisions between revision with id from and revision with id to (see above for examples)

Event Timeline

My inclination is that we should make this a separate endpoint for these reasons:

  • Other clients might not need this number when they ask for a diff
  • Other clients might need this number and not a diff
  • It's not directly related to the diff functionality, except that the same number is often shown on official clients

@Tsevener and @NHarateh_WMF it'd be great to get your input.

I believe that we need to copy the errors from compare endpoint in case of revisions do not belong to the same page -> return 400

What happens if the user doesn't have the right to read the page the revisions belong to? 403?

How is reverted revisions handled?

eprodromou updated the task description. (Show Details)

I believe that we need to copy the errors from compare endpoint in case of revisions do not belong to the same page -> return 400

What happens if the user doesn't have the right to read the page the revisions belong to? 403?

How is reverted revisions handled?

Counted just like any other revision.

GET /revision/{from}/intermediatecount/{to}

  • Do we envision expanding this endpoint to cover filters like T231590 in the future?
  • The /page/history/{title}/counts/edits is a special case of this endpoint as you can get the same information as /revision/<first_page_rev>/intermediatecount/<last_page_revision>. Given that the information provided by the endpoints is so similar that they even intersect, why are the endpoints named so differently and even belong to a different 'namespace'? one is 'page', other is 'revision'.

Implementation-wise (TO BE PROVEN/DISPROVEN) modifying the queries from T231590 to have constraints on rev_timestamp should not degrade the performance of those queries dramatically (again, this is a theory with no proof yet). So, why don't we merge these endpoints?

If I were to design these endpoints from scratch, I'd do something like:

/page/history/{title}/counts/edits?filter=anon - for the counts endpoint
and
/page/history/{title}/counts/edits/from/{rev_id}/to{rev_id}?filter=anon - for the intermediate count endpoint.

I'm not advocating for going all in, but at least the inconsistencies in naming should be addressed in my opinion.

GET /revision/{from}/intermediatecount/{to}

  • Do we envision expanding this endpoint to cover filters like T231590 in the future?

I don't think so, but I will confirm.

  • The /page/history/{title}/counts/edits is a special case of this endpoint as you can get the same information as /revision/<first_page_rev>/intermediatecount/<last_page_revision>. Given that the information provided by the endpoints is so similar that they even intersect, why are the endpoints named so differently and even belong to a different 'namespace'? one is 'page', other is 'revision'.

They are similar in implementation but different in interface. I think the distance to another revision is a property of the revision and not of the page. The total number of edits is a property of the page, though.

You're correct, the total number of edits could be retrieved by getting the distance between the first revision and the last revision and adding two. If you knew the IDs of the first and last revision, you could use this method.

If you didn't know the first and last revision ID, you'd have to scroll through the whole history to get the first and last ID, at which point you could probably count all the intermediate ones yourself.

I think we need to have interfaces that reflect what developers actually want, even if it's possible to get that same information some other way. If someone wants the distance between two revisions, let's have an endpoint that gives them the distance between two revisions.

If you didn't know the first and last revision ID, you'd have to scroll through the whole history to get the first and last ID, at which point you could probably count all the intermediate ones yourself.

Not really, we can just omit them and that would mean first to last as in my example. Or in future we can discuss if we want to have special first and last keywords that would generally be accepted in a revision id input. This however would work less well since we already decided not to do a hierarchical access to things with a required title parameter everywhere.

I think we need to have interfaces that reflect what developers actually want, even if it's possible to get that same information some other way.

That is true, but we also need to at least try to anticipate the future needs and asks, and about how all of this fits together, so the we create a comprehensive and consistent API and not a collection of one-offs.

What should happen if revision from is newer then revision to?

the number of intermediate revisions between 50 and 40 is 0

Is this an example of what happens or is this a typo?

What should happen if revision from is newer then revision to?

Always a positive value, regardless of which is newer.

the number of intermediate revisions between 50 and 40 is 0

Is this an example of what happens or is this a typo?

That is an example. It's a non-inclusive count, so if the two revisions are right after each other, the intermediate count is zero.

So, @Pchelolo can confirm, I missed a requirement that came up during our call to get the number of unique contributors as well as the number of unique edits between two revisions.

I think once that's clear, his design for making the requests as variations of the other counts becomes more clear.

I'm going to change this endpoint to:

/page/history/{title}/counts/edits?from=<from_id>&to=<to_id>

And I'll add one like:

/page/history/{title}/counts/editors?from=<from_id>&to=<to_id>

@Pchelolo let's discuss during the checkin call today.

Why are we adding from and to as query parameters? It would be a good practice to have parameters in the path as much as possible, cause it's making the paths more predictable - thus better for frontend caching.

Why are we adding from and to as query parameters? It would be a good practice to have parameters in the path as much as possible, cause it's making the paths more predictable - thus better for frontend caching.

I like query parameters for variations on an endpoint. Using the query parameters shows that this is a modification of the /page/history/{title}/counts/edits endpoint, and not a separate endpoint on its own.

Is your concern that client developers might put the parameters in different order?

Is your concern that client developers might put the parameters in different order?

Exactly. Clients might not even have control over how the query parameters in the request get ordered. It might be different for different browsers and different implementations of request libraries, but any reordering would split the caches. So, the less query parameters - the better.

To be honest, stylistically, I like the query parameters more as well, but we have to be aware of the consequences of the decision to use them.

@Pchelolo I feel like this is a pretty common issue. In general, I'd like to optimize for more friendly endpoints at the expense of caching -- especially if we haven't yet added in the caching headers or object caching in the endpoints.

Can you live with query params "for now" and we move to fixed params in the path if we find it's causing cache problems?

Can you live with query params "for now" and we move to fixed params in the path if we find it's causing cache problems?

In this case I think the answer will be yes. Thinking about it again, these endpoints would unlikely be used much, and would unlikely be very cache-friendly anyway, so it's fine.

My inclination is that we should make this a separate endpoint for these reasons:

  • Other clients might not need this number when they ask for a diff
  • Other clients might need this number and not a diff
  • It's not directly related to the diff functionality, except that the same number is often shown on official clients

@Tsevener and @NHarateh_WMF it'd be great to get your input.

Sorry for the delay on this feedback but a separate endpoint sounds good - we only show the label on one variation of the diff screen anyways so would be a waste to calculate it for every compare call.