Page MenuHomePhabricator

Investigate making a mobileapps wrapper for revision history batching
Closed, ResolvedPublicSpike

Description

Is a mobileapps / Node.js wrapper endpoint that batches history into significant changes feasible? Some thoughts/questions we seek to answer are:

  1. Can we grab the first page and batch it into major changes (experiment first with an easy add/delete byte size threshold for now).
  2. Can we save that result in cache somewhere (associated with the latest revision it's generated on), to be appended to / modified as a new revision ID comes in?
  3. Is it possible to integrate alternative types? (i.e. new talk page discussion, vandalism revert, etc).

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 15 2020, 6:17 PM

Note this spike may need to come after or be investigated alongside a Spike ticket for these related notes from @JMinor:

Let's define what specific recent changes we need as it seems like there are "types"
Can we build 1 of these by hand first to see if it's workable?
What specific change types and filtering heuristics will we use to build the living doc timeline?

LGoto triaged this task as Medium priority.Jun 15 2020, 6:37 PM
LGoto removed a project: Epic.
LGoto moved this task from Needs Triage to Engineering Backlog on the Wikipedia-iOS-App-Backlog board.
This comment was removed by cmadeo.

Change 607802 had a related patch set uploaded (by Tsevener; owner: Tsevener):
[mediawiki/services/mobileapps@master] Add history batching route

https://gerrit.wikimedia.org/r/607802

I'd be interested to know what is needed in addition to already existing APIs, like the page history REST API or the API:Revisions Action API. What's the intended benefit of wrapping the other APIs?

@bearND The data I'm going for is in the "Recent changes view" mock shown in https://phabricator.wikimedia.org/T241253. I figure the basis of it will be one of those revision/history APIs you linked (with smaller changes detected, grouped and counted), but for each large change there's a snippet of the changed text itself. They are also flagging a vandalism reverted revision, a revision that had a Reference added, and interspersing new talk page discussions. I think showing a snippet of the change will require pulling the revision diff, and interspersing talk page discussions would be a separate call to the revision API for the talk page. Those 2 things seemed like enough additional stuff that it would be nice to be wrapped up on the mobileapps end rather than deal with the extra logic client-side.

I assume hitting the diff endpoint for every single revision would be a no-go, so my hope is we find a balance where we're hitting diff on only *some* revisions. That plus keeping the paging size small might keep the load time and server load from getting out of hand I hope? But please feel free to correct me if I'm wrong. It's worth noting that this endpoint would be only called in an EN Wiki experiment variant, shown on a set number of articles (my guess would be in the range of 10-20 of the more popular articles - @JMinor will have more accurate numbers).

@JMinor @cmadeo I have a prototype endpoint propped up in a labs environment. Feel free to explore it. I'm hoping it's a shortcut to manually building one out to see what we ultimately want out of this. It's very slapdash so it might break on you, but you should be able to find some titles that don't break (like United_States, currently). You can access it like this:

First page: https://apps2.wmflabs.org/en.wikipedia.org/v1/page/history-batch/United_States.
Subsequent pages: add nextRvStartId value from previous page as query item rvstartid. E.g. https://apps2.wmflabs.org/en.wikipedia.org/v1/page/history-batch/United_States?rvstartid=964157912
Threshold for signifying small vs. large change is 100 bytes, but you can adjust with the threshold query item, E.g. https://apps2.wmflabs.org/en.wikipedia.org/v1/page/history-batch/United_States?threshold=50 or https://apps2.wmflabs.org/en.wikipedia.org/v1/page/history-batch/United_States?rvstartid=964157912&threshold=50
It analyzes 20 revisions per page, not configurable.

I tried to make the response as verbose as possible so you could tinker with it and get a feel for what you'd get back when you adjust the threshold, but let me know if you are confused by any of it.

Here's what I have in this and some spike findings worth discussing:

  1. Small changes are chunked together and counted, large changes are shown. We determine this with byte size, not characters. I'm not aware of a way to get character differences from the revisions APIs. Unfortunately this means an editor could have deleted a lot and added a lot, but it wouldn't be detected as a significant change if the ultimate byte size evened out. Not sure how big of a deal this is or how often this happens.
  2. I'm detecting a vandalism revert with a very crude logic (tag is mw-rollback and comment contains the words revert and vandalism)
  3. For the large change snippet, I'm returning the line with the largest number bytes changed from the diff endpoint. The diff endpoint works in wikitext, so this will need extra tinkering to see if we can return html instead there. Note the largest-changed line could have both deleted and added characters, so it would be worth considering if we want that displayed and how we'd want to show that.
  4. I am interspersing new talk page topics. That's another one that has some loose logic detecting it so you might see some weirdness there. I'm making some efforts to hide if it was immediately rolled back in the subsequent revision to prevent showing vandalism topics.

As an optimization I am immediately throwing out revisions that are under the small threshold so I don't have to analyze every single revision's diff. I feel like this will save quite a bit on load time, but I didn't test it otherwise. Unfortunately if I kept this optimization I think it would mean we can't reliably determine the Reference added type since we would need to small revision diffs as well to check for references. I'm pretty sure I can't detect the section added in the vandalism revert type for the same reason. I think I could determine some kind of section for large changes since I'm already fetching that diff, I just haven't tackled that part yet. What section do we show for changes that span across sections?

Lastly, I'm not showing editor edits on purpose. If we heavily cache this endpoint that would get out of date quickly. If we need that number to be fresh I suggest we fetch editor counts (hopefully there's a grouped call, I haven't looked yet) client-side.

Change 607802 abandoned by Tsevener:
[mediawiki/services/mobileapps@master] Add history batching route

Reason:
We will be working on this in a new labs instance rather than here and are working against a GitHub fork (https://github.com/tonisevener/mobileapps/tree/significant-changes). If we determine labs won't work I will open a new patch.

https://gerrit.wikimedia.org/r/607802