Page MenuHomePhabricator

Implement GET Edit Count
Closed, ResolvedPublic3 Estimated Story Points

Description

This task depends on successful completion of T231598 and T231600

Acceptance Criteria:

  • Define route History count routes
    • /page/{title}/history/counts/edits
  • Define route handler for each count type
  • Request:
    • Must support HTTP GET only
    • Request body must be empty
  • Response
    • Response must return JSON
    • Response JSON must have structures:
{
    "edits": 753
}
  • edits: total number of edits

Event Timeline

WDoranWMF renamed this task from Implement History Count Endpoints to Implement GET History Count Endpoints.Aug 29 2019, 6:22 PM
WDoranWMF set the point value for this task to 3.

Namespacing, and the details of how it will impact this endpoint, are being discussed in task T232485.

So, if we're just returning a single value, I'm not crazy about the object and the level of indirection. Can we just return a number, without the object wrapper?

Are you planning to have one endpoint that combines all the count types? Something like /page/{title}/history/counts that returns all the counts or /page/{title}/history/counts?type=revertededits&type=botedits&type=anonedits that returns selected count types

Are you planning to have one endpoint that combines all the count types?

Hi, Natalia. @Anomie pointed out that getting these counts is pretty hard on our infrastructure, so if we get them as separate HTTP calls, each call has lower impact on our servers, and we can spread them out across multiple servers. I realize it means you need to make 5 different calls; is that going to work for you?

Thanks for clarifying @eprodromou! Yes, if it makes it easier for you, we can make multiple calls

eprodromou renamed this task from Implement GET History Count Endpoints to Implement GET Edit Count.Oct 2 2019, 9:59 PM
eprodromou updated the task description. (Show Details)
eprodromou added a subscriber: tstarling.

@tstarling pointed to an OWASP recommendation to always wrap JSON output in objects, so I'm going to accept that we'll have it wrapped.

Also, I'm splitting up this task into five tasks, one for each user story.

Change 540519 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Add core REST API endpoints for history edit counts

https://gerrit.wikimedia.org/r/540519

@eprodromou , could you reconsider splitting the task up? Most of the code is the same between these endpoints - they differ only in the query - so I implemented them using a common base class.

@BPirkle the intention was to give you a chance to split them up with @Pchelolo . I'm glad to heard we've got a patch for this one!

Hi, Natalia. @Anomie pointed out that getting these counts is pretty hard on our infrastructure, so if we get them as separate HTTP calls, each call has lower impact on our servers, and we can spread them out across multiple servers. I realize it means you need to make 5 different calls; is that going to work for you?

Hm... I am not sure this justification would be valid.

  1. All the load here is on database, so it doesn't really matter if a big request is landed on a single app server. If we do see that running 5 individual queries is more efficient than 1 mega-query - we can still do that on the server.
  2. 5 requests will always be slower then 1 request. All the set-up work will add up.
  3. If we every to long-term cache the results and use purging, we'd need 5 purges instead of 1

I am NOT advocating that merging these into 1 will necessarily be faster, but I have a feeling it will. It's certainly better from the API perspective.

@Anomie , discussion arose during our Green Team checking meeting today about 5 endpoints vs 1 endpoint. I know you are tasked with unrelated things right now, but if you have time to give an opinion it would be very helpful.

It appears that if we implement 5 queries, the IOS team will simply call all 5 of them together. So implementing 5 separate endpoints means the overhead of 5 calls, and potentially 5 separate cache entries to store/purge. Therefore implementing this as 1 endpoint would be more convenient for the IOS team, and appears to have some benefits. But it may have some drawbacks, such as holding a single process open longer.

It appeared in T231598: Compose Count Queries that you had concerns about one mega-query for all the results. An alternative "single endpoint" implementation would be to make 5 separate queries in one endpoint and then return all the data.

So the questions are:

  • do we correctly understand that you are not in favor of one "mega query" to supply the data for all 5 counts?
  • do you have an opinion on 5 separate endpoints vs 1 endpoint that does 5 separate queries?

Of course, things other than the IOS app may eventually use this data, so making a decision only on its behavior may be short sighted. But it will certainly be the initial consumer.

I have to wonder why the IOS team needs all 5 numbers on every page. Then again, I also have to wonder why they need some of these numbers at all.

If there will be clients that only need a subset of these numbers, it would be a win for them to not have to get the numbers they don't want. Particularly if they don't want the count of editors or count of anon edits.

A potential disadvantage to 5 endpoints is that it might encourage them to call all 5 in parallel rather than in series, making for a larger load spike.

I'm not going to try to make a decision between the options, I'll just provide the numbers.

I re-ran the comparison again. All were run with presumably-warm caches.

  • enwiki ANI:
    • For the individual queries, edits took 0.29 seconds, editors 6.83 (4.51 without STRAIGHT_JOIN), bots 0.05, anons 6.86 (14.86 without STRAIGHT_JOIN), reverts 2.21, total 16.24 seconds. Handler_read_key=3363179, Handler_read_next=4447423.
    • The combined query (subquery version) took 23.89 seconds. Handler_read_key=4172140, Handler_read_next=1127187, Handler_read_rnd_next=3254423, Handler_tmp_write=3254442.
  • enwiki Barack Obama:
    • For the individual queries, edits took 0.01 seconds, editors 0.20 (0.15 without STRAIGHT_JOIN), bots 0.00, anons 0.24 (0.23 without STRAIGHT_JOIN), reverts 0.11, total 0.56 seconds. Handler_read_key=90346, Handler_read_next=111075.
    • The combined query (subquery version) took 0.68 seconds. Handler_read_key=101161, Handler_read_next=28170, Handler_read_rnd_next=54684, Handler_tmp_write=54685.

So the mega-query still seems more expensive, at least on these two pages, unless you're doing the ANI query for anons without the STRAIGHT_JOIN.

I have to wonder why the IOS team needs all 5 numbers on every page. Then again, I also have to wonder why they need some of these numbers at all.

I think the answer to this is on the design mockup at T228783 - it seems to show all the numbers on the same screen, so all will be requested, and most likely all will be requested in parallel.

I have to wonder why the IOS team needs all 5 numbers on every page. Then again, I also have to wonder why they need some of these numbers at all.

Let's take it as a given that our colleagues are competent and would not waste our development time or our server resources asking for features that users don't need.

If there will be clients that only need a subset of these numbers, it would be a win for them to not have to get the numbers they don't want. Particularly if they don't want the count of editors or count of anon edits.

I agree on this. Also, it's possible we will add other counts in the future based on new needs, and bundling each one into the same endpoint doesn't seem like the best use of resources.

Given that 1) we have designs already for doing counts at different endpoints 2) iOS team has accepted this design 3) we may add more counts in the future and 4) some clients might not use all counts, I think that the right way to do it is with separate endpoints.

@Pchelolo let me know if you can't live with that.

Let's take it as a given that our colleagues are competent and would not waste our development time or our server resources asking for features that users don't need.

I've found that sometimes people can decide on a solution for their problem and ask for that solution, when in practice a different solution would actually serve their needs better or would serve other needs beyond their own. It's also possible they may not have correctly estimated the costs when making the request.[1] Thus, I tend to ask.

@Pchelolo's link to T228783 was more informative, as it actually does answer my semi-rhetorical question instead of criticizing me for asking. With that design, yes, they'd clearly need all the numbers for every page-history viewed. I'll leave to others deciding whether the discovered costs are enough to ask them if they want to reevaluate that design.

BTW, in looking at the design mock-up currently in the description of that task I note that it does not seem to be using the counts of total actors or of reverted edits since the change in T228783#5439449, while it does seem to want to know the count of "user edits" (meaning non-IP?) and the count of revisions with rev_minor_edit set. The "user edits" could be calculated as total edits - IP edits,[2] but the count of minor edits would need a new endpoint.

[1]: Personally, my own initial guess was that these would be significantly more costly than they turned out to be.
[2]: Technically wrong in that revision-deleted edits can't be put into either bucket, but probably close enough in practice. For that matter, the number we're returning for "anonymous edits" is also including imported edits that are attributed to an interwiki username, again technically wrong but in practice maybe ok.

BTW, in looking at the design mock-up currently in the description of that task I note that it does not seem to be using the counts of total actors or of reverted edits since the change in T228783#5439449, while it does seem to want to know the count of "user edits" (meaning non-IP?) and the count of revisions with rev_minor_edit set. The "user edits" could be calculated as total edits - IP edits,[2] but the count of minor edits would need a new endpoint.

We thought getting reverted edits would be more difficult than getting minor edits so we updated the design to include minor edits. We'll be happy with either.

We thought getting reverted edits would be more difficult than getting minor edits so we updated the design to include minor edits. We'll be happy with either.

I posted the query for minor edits at T231598#5560533.

Defining a "reverted edit" is certainly harder, but performance-wise it seems about the same for the two sample pages I've been testing.

@Pchelolo's link to T228783 was more informative, as it actually does answer my semi-rhetorical question instead of criticizing me for asking.

I apologize. I didn't mean it as a criticism. I know you're trying to make sure we do meaningful and useful work. I understand that it makes sense to question assumptions.

We thought getting reverted edits would be more difficult than getting minor edits so we updated the design to include minor edits. We'll be happy with either.

There's a separate user story for minor edit count; it's at T234941. There's not yet any engineering tasks for it. I think we could work on it in our next sprint, if we can get the query approval from DBA.

Change 540519 merged by jenkins-bot:
[mediawiki/core@master] Add core REST API endpoint for history edit counts

https://gerrit.wikimedia.org/r/540519

There's an inconsistency between the count and listing endpoints. that I feel like we could followup on really quick:

For history listing API we have 3 filters:

[ 'anonymous', 'bot', 'reverted' ]

For history counts API we have options:

[ 'anonedits', 'botedits', 'revertededits', 'edits', 'editors' ]

The first 3 have the same semantics, 'edits' is equivalent to the history list with no filter and 'editors' is a little different. Given that we might be adding more filters in the future, this will become a mess.

At least one thing that IMHO must be done is to bring 'anonymous' and 'anon' into either one of those in both APIs. It should be either filter=anonymous and anonymousedits or filter=anon and anonedits.

If we wanted to go further, we could make the APIs even more consistent and have something like /page/Bla/history/count/edits?filter=anonymous and page/Bla/history/count/editors - this will be much more discoverable and predictable in my opinion. Also we could connect the listing and counting APIs together with links. Maybe consider this for next version or squeeze it in while we still have time before the train.

What do you think about this idea in general @eprodromou

I like consistency. Any renaming we choose to do would be safe, fast, and would not threaten our deadline.

"unregistered" might be slightly more accurate than "anon" (or "IP"). But I'll leave the decision to you, I don't want to bikeshed.

The revisions being returned for this filter include both edits by IP users and imported edits that weren't assigned to a local account.

@Pchelolo I like the idea of making the count endpoints match the filters, and I appreciate the attention to detail.

But it's late on Friday before the weekend before the week this stuff is due.

Either I throw some late requirements changes at @BPirkle and @apaskulin, which kind of sucks to do, or I tell the iOS team they have to wait while we make yet another round of changes, which also kind of sucks.

We'll keep the endpoints as they are. We can talk about adding synonyms in the future if we need to (and possibly deprecating the old names).

We'll keep the endpoints as they are.

As I wrote

Maybe consider this for next version

so I'm ok with any outcome here.

Perhaps we could start some kind of 'future improvements' ticket, where we would through inconsistencies like this. Once we accumulate a bunch, we can make a round of improvements. A time before we qualify API as stable will be a good time.

Reviewed this on beta cluster. Seems good.