⚓ T231590 Implement GET Edit Count

	Subject	Repo	Branch	Lines +/-
	Add core REST API endpoint for history edit counts	mediawiki/core	master	+348 -2

Status	Assigned	Task
Invalid	None	T229661 Core REST API in MediaWiki
Resolved	• eprodromou	T231338 Page history API
Resolved	• eprodromou	T231348 Curator gets edit count
Resolved	BPirkle	T231590 Implement GET Edit Count
Resolved	• eprodromou	T231598 Compose Count Queries
Resolved	• eprodromou	T231600 Consult with DBAs on required Endpoint queries

WDoranWMF created this task.Aug 29 2019, 5:04 PM

WDoranWMF added parent tasks: T231348: Curator gets edit count, T231354: Curator gets bot edit count, T231352: Curator gets anonymous edit count, T231349: Curator gets editors count.Aug 29 2019, 5:07 PM

WDoranWMF moved this task from Backlog to User Stories Ready for Engineering Tasks on the Platform Team Workboards (User Stories) board.

WDoranWMF moved this task from User Stories Ready for Engineering Tasks to Engineering Tasks being created on the Platform Team Workboards (User Stories) board.Aug 29 2019, 5:51 PM

WDoranWMF renamed this task from Implement History Count Endpoints to Implement GET History Count Endpoints.Aug 29 2019, 6:22 PM

WDoranWMF mentioned this in T231598: Compose Count Queries.Aug 29 2019, 6:59 PM

WDoranWMF updated the task description. (Show Details)

WDoranWMF updated the task description. (Show Details)Aug 29 2019, 7:09 PM

WDoranWMF triaged this task as Medium priority.Sep 3 2019, 3:04 PM

WDoranWMF edited projects, added Platform Team Workboards (Green), Platform Team Initiatives (MW REST API in PHP); removed Platform Team Workboards (User Stories).

WDoranWMF moved this task from Backlog to Tasks Ready for Estimation on the Platform Team Workboards (Green) board.Sep 3 2019, 5:06 PM

WDoranWMF removed • eprodromou as the assignee of this task.Sep 3 2019, 8:57 PM

WDoranWMF updated the task description. (Show Details)Sep 4 2019, 12:43 PM

WDoranWMF updated the task description. (Show Details)Sep 4 2019, 12:52 PM

WDoranWMF set the point value for this task to 3.

WDoranWMF moved this task from Tasks Ready for Estimation to Ready on the Platform Team Workboards (Green) board.Sep 4 2019, 1:09 PM

BPirkle mentioned this in T232355: Implement Core REST API routes and handler stubs.Sep 9 2019, 3:36 PM

Namespacing, and the details of how it will impact this endpoint, are being discussed in task T232485.

• NHarateh_WMF subscribed.Sep 19 2019, 4:44 PM

So, if we're just returning a single value, I'm not crazy about the object and the level of indirection. Can we just return a number, without the object wrapper?

Are you planning to have one endpoint that combines all the count types? Something like /page/{title}/history/counts that returns all the counts or /page/{title}/history/counts?type=revertededits&type=botedits&type=anonedits that returns selected count types

BPirkle claimed this task.Oct 1 2019, 5:17 PM

BPirkle moved this task from Ready to Doing on the Platform Team Workboards (Green) board.

In T231590#5535950, @NHarateh_WMF wrote:

Are you planning to have one endpoint that combines all the count types?

Hi, Natalia. @Anomie pointed out that getting these counts is pretty hard on our infrastructure, so if we get them as separate HTTP calls, each call has lower impact on our servers, and we can spread them out across multiple servers. I realize it means you need to make 5 different calls; is that going to work for you?

Thanks for clarifying @eprodromou! Yes, if it makes it easier for you, we can make multiple calls

• eprodromou updated the task description. (Show Details)Oct 2 2019, 9:54 PM

@tstarling pointed to an OWASP recommendation to always wrap JSON output in objects, so I'm going to accept that we'll have it wrapped.

Also, I'm splitting up this task into five tasks, one for each user story.

Change 540519 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Add core REST API endpoints for history edit counts

https://gerrit.wikimedia.org/r/540519

gerritbot added a project: Patch-For-Review.Oct 3 2019, 3:22 AM

@eprodromou , could you reconsider splitting the task up? Most of the code is the same between these endpoints - they differ only in the query - so I implemented them using a common base class.

Anomie mentioned this in T234560: identify a reliable user field for use with the revision, page...maybe also the actor table.Oct 3 2019, 7:47 PM

Iflorez subscribed.Oct 3 2019, 8:45 PM

@BPirkle the intention was to give you a chance to split them up with @Pchelolo . I'm glad to heard we've got a patch for this one!

Hi, Natalia. @Anomie pointed out that getting these counts is pretty hard on our infrastructure, so if we get them as separate HTTP calls, each call has lower impact on our servers, and we can spread them out across multiple servers. I realize it means you need to make 5 different calls; is that going to work for you?

Hm... I am not sure this justification would be valid.

All the load here is on database, so it doesn't really matter if a big request is landed on a single app server. If we do see that running 5 individual queries is more efficient than 1 mega-query - we can still do that on the server.
5 requests will always be slower then 1 request. All the set-up work will add up.
If we every to long-term cache the results and use purging, we'd need 5 purges instead of 1

I am NOT advocating that merging these into 1 will necessarily be faster, but I have a feeling it will. It's certainly better from the API perspective.

@Anomie , discussion arose during our Green Team checking meeting today about 5 endpoints vs 1 endpoint. I know you are tasked with unrelated things right now, but if you have time to give an opinion it would be very helpful.

It appears that if we implement 5 queries, the IOS team will simply call all 5 of them together. So implementing 5 separate endpoints means the overhead of 5 calls, and potentially 5 separate cache entries to store/purge. Therefore implementing this as 1 endpoint would be more convenient for the IOS team, and appears to have some benefits. But it may have some drawbacks, such as holding a single process open longer.

It appeared in T231598: Compose Count Queries that you had concerns about one mega-query for all the results. An alternative "single endpoint" implementation would be to make 5 separate queries in one endpoint and then return all the data.

So the questions are:

do we correctly understand that you are not in favor of one "mega query" to supply the data for all 5 counts?
do you have an opinion on 5 separate endpoints vs 1 endpoint that does 5 separate queries?

Of course, things other than the IOS app may eventually use this data, so making a decision only on its behavior may be short sighted. But it will certainly be the initial consumer.

I have to wonder why the IOS team needs all 5 numbers on every page. Then again, I also have to wonder why they need some of these numbers at all.

If there will be clients that only need a subset of these numbers, it would be a win for them to not have to get the numbers they don't want. Particularly if they don't want the count of editors or count of anon edits.

A potential disadvantage to 5 endpoints is that it might encourage them to call all 5 in parallel rather than in series, making for a larger load spike.

I'm not going to try to make a decision between the options, I'll just provide the numbers.

I re-ran the comparison again. All were run with presumably-warm caches.

enwiki ANI:
- For the individual queries, edits took 0.29 seconds, editors 6.83 (4.51 without STRAIGHT_JOIN), bots 0.05, anons 6.86 (14.86 without STRAIGHT_JOIN), reverts 2.21, total 16.24 seconds. Handler_read_key=3363179, Handler_read_next=4447423.
- The combined query (subquery version) took 23.89 seconds. Handler_read_key=4172140, Handler_read_next=1127187, Handler_read_rnd_next=3254423, Handler_tmp_write=3254442.
enwiki Barack Obama:
- For the individual queries, edits took 0.01 seconds, editors 0.20 (0.15 without STRAIGHT_JOIN), bots 0.00, anons 0.24 (0.23 without STRAIGHT_JOIN), reverts 0.11, total 0.56 seconds. Handler_read_key=90346, Handler_read_next=111075.
- The combined query (subquery version) took 0.68 seconds. Handler_read_key=101161, Handler_read_next=28170, Handler_read_rnd_next=54684, Handler_tmp_write=54685.

So the mega-query still seems more expensive, at least on these two pages, unless you're doing the ANI query for anons without the STRAIGHT_JOIN.

I have to wonder why the IOS team needs all 5 numbers on every page. Then again, I also have to wonder why they need some of these numbers at all.

I think the answer to this is on the design mockup at T228783 - it seems to show all the numbers on the same screen, so all will be requested, and most likely all will be requested in parallel.

In T231590#5557632, @Anomie wrote:

I have to wonder why the IOS team needs all 5 numbers on every page. Then again, I also have to wonder why they need some of these numbers at all.

Let's take it as a given that our colleagues are competent and would not waste our development time or our server resources asking for features that users don't need.

If there will be clients that only need a subset of these numbers, it would be a win for them to not have to get the numbers they don't want. Particularly if they don't want the count of editors or count of anon edits.

I agree on this. Also, it's possible we will add other counts in the future based on new needs, and bundling each one into the same endpoint doesn't seem like the best use of resources.

Given that 1) we have designs already for doing counts at different endpoints 2) iOS team has accepted this design 3) we may add more counts in the future and 4) some clients might not use all counts, I think that the right way to do it is with separate endpoints.

@Pchelolo let me know if you can't live with that.

In T231590#5557686, @eprodromou wrote:

Let's take it as a given that our colleagues are competent and would not waste our development time or our server resources asking for features that users don't need.

I've found that sometimes people can decide on a solution for their problem and ask for that solution, when in practice a different solution would actually serve their needs better or would serve other needs beyond their own. It's also possible they may not have correctly estimated the costs when making the request.[1] Thus, I tend to ask.

@Pchelolo's link to T228783 was more informative, as it actually does answer my semi-rhetorical question instead of criticizing me for asking. With that design, yes, they'd clearly need all the numbers for every page-history viewed. I'll leave to others deciding whether the discovered costs are enough to ask them if they want to reevaluate that design.

BTW, in looking at the design mock-up currently in the description of that task I note that it does not seem to be using the counts of total actors or of reverted edits since the change in T228783#5439449, while it does seem to want to know the count of "user edits" (meaning non-IP?) and the count of revisions with rev_minor_edit set. The "user edits" could be calculated as total edits - IP edits,[2] but the count of minor edits would need a new endpoint.

[1]: Personally, my own initial guess was that these would be significantly more costly than they turned out to be.
[2]: Technically wrong in that revision-deleted edits can't be put into either bucket, but probably close enough in practice. For that matter, the number we're returning for "anonymous edits" is also including imported edits that are attributed to an interwiki username, again technically wrong but in practice maybe ok.

BTW, in looking at the design mock-up currently in the description of that task I note that it does not seem to be using the counts of total actors or of reverted edits since the change in T228783#5439449, while it does seem to want to know the count of "user edits" (meaning non-IP?) and the count of revisions with rev_minor_edit set. The "user edits" could be calculated as total edits - IP edits,[2] but the count of minor edits would need a new endpoint.

We thought getting reverted edits would be more difficult than getting minor edits so we updated the design to include minor edits. We'll be happy with either.

In T231590#5560492, @NHarateh_WMF wrote:

We thought getting reverted edits would be more difficult than getting minor edits so we updated the design to include minor edits. We'll be happy with either.

I posted the query for minor edits at T231598#5560533.

Defining a "reverted edit" is certainly harder, but performance-wise it seems about the same for the two sample pages I've been testing.

In T231590#5560422, @Anomie wrote:

@Pchelolo's link to T228783 was more informative, as it actually does answer my semi-rhetorical question instead of criticizing me for asking.

I apologize. I didn't mean it as a criticism. I know you're trying to make sure we do meaningful and useful work. I understand that it makes sense to question assumptions.

In T231590#5560492, @NHarateh_WMF wrote:

We thought getting reverted edits would be more difficult than getting minor edits so we updated the design to include minor edits. We'll be happy with either.

There's a separate user story for minor edit count; it's at T234941. There's not yet any engineering tasks for it. I think we could work on it in our next sprint, if we can get the query approval from DBA.

Change 540519 merged by jenkins-bot:
[mediawiki/core@master] Add core REST API endpoint for history edit counts

https://gerrit.wikimedia.org/r/540519

• NHarateh_WMF awarded a token.Oct 10 2019, 7:37 PM

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.2; 2019-10-15).Oct 10 2019, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 8:11 PM

There's an inconsistency between the count and listing endpoints. that I feel like we could followup on really quick:

For history listing API we have 3 filters:

[ 'anonymous', 'bot', 'reverted' ]

For history counts API we have options:

[ 'anonedits', 'botedits', 'revertededits', 'edits', 'editors' ]

The first 3 have the same semantics, 'edits' is equivalent to the history list with no filter and 'editors' is a little different. Given that we might be adding more filters in the future, this will become a mess.

At least one thing that IMHO must be done is to bring 'anonymous' and 'anon' into either one of those in both APIs. It should be either filter=anonymous and anonymousedits or filter=anon and anonedits.

If we wanted to go further, we could make the APIs even more consistent and have something like /page/Bla/history/count/edits?filter=anonymous and page/Bla/history/count/editors - this will be much more discoverable and predictable in my opinion. Also we could connect the listing and counting APIs together with links. Maybe consider this for next version or squeeze it in while we still have time before the train.

What do you think about this idea in general @eprodromou

BPirkle mentioned this in T234488: Implement bot edit count endpoint.Oct 11 2019, 3:30 AM

BPirkle mentioned this in T234487: Implement anonymous edit count endpoint.

BPirkle mentioned this in T234486: Implement reverted edits count endpoint.

BPirkle mentioned this in T234485: Implement Editors Count Endpoint.

I like consistency. Any renaming we choose to do would be safe, fast, and would not threaten our deadline.

"unregistered" might be slightly more accurate than "anon" (or "IP"). But I'll leave the decision to you, I don't want to bikeshed.

The revisions being returned for this filter include both edits by IP users and imported edits that weren't assigned to a local account.

• Pchelolo moved this task from Doing to Done on the Platform Team Workboards (Green) board.Oct 11 2019, 5:49 PM

@Pchelolo I like the idea of making the count endpoints match the filters, and I appreciate the attention to detail.

But it's late on Friday before the weekend before the week this stuff is due.

Either I throw some late requirements changes at @BPirkle and @apaskulin, which kind of sucks to do, or I tell the iOS team they have to wait while we make yet another round of changes, which also kind of sucks.

We'll keep the endpoints as they are. We can talk about adding synonyms in the future if we need to (and possibly deprecating the old names).

We'll keep the endpoints as they are.

As I wrote

Maybe consider this for next version

so I'm ok with any outcome here.

Perhaps we could start some kind of 'future improvements' ticket, where we would through inconsistencies like this. Once we accumulate a bunch, we can make a round of improvements. A time before we qualify API as stable will be a good time.

• eprodromou mentioned this in T235525: Make counts and history arguments consistent.Oct 15 2019, 3:14 PM

WDoranWMF mentioned this in T235564: Implement Integration Tests for Existing iOS Endpoints.Oct 15 2019, 9:28 PM

BPirkle mentioned this in T235572: Compose query for minor edit count.Oct 15 2019, 10:44 PM

• Pchelolo mentioned this in T234940: Curator gets number of intermediate revisions between two revisions.Oct 16 2019, 4:50 PM

Reviewed this on beta cluster. Seems good.

• eprodromou mentioned this in T236323: Deprecate "revertededits" in favour of "reverted".Oct 23 2019, 8:37 PM

• eprodromou mentioned this in T236324: Deprecate `anonedits` in favour of `anonymous`.Oct 23 2019, 8:39 PM

• eprodromou mentioned this in T236325: Deprecate `botedits` in favor of `bot`.

• eprodromou closed subtask T231598: Compose Count Queries as Resolved.Oct 29 2019, 4:23 PM

• eprodromou closed subtask T231600: Consult with DBAs on required Endpoint queries as Resolved.

Aklapper removed subscribers: Anomie, • NHarateh_WMF.Oct 16 2020, 5:41 PM