Page MenuHomePhabricator

Investigate API and client architecture requirements for edit history, diffs and related user actions
Closed, ResolvedPublicSpike

Description

Comments discussion summary:
@Mholloway and @MusikAnimal's advice is not to rely on XTools API in production.


Spike findings:

What's available:

  • Edit metrics for sparkline. Caveat: We can only get metrics for up to 1 year. We can choose between daily and monthly granularity.
  • Page view count for a page.
  • ORES scores.

ORES caveats:

  1. This might be temporary but a request for a larger page times out - https://en.wikipedia.org/w/index.php?title=Barack_Obama&oldid=910566164 (scores for 910566164 revision of Barack Obama on enwiki). Works for a smaller page like Morogoro on enwiki - https://ores.wikimedia.org/v3/scores/enwiki/911077246.
  2. Different wikis have different models - for example, we'll be able to get a articlequality score for enwiki articles but we'll only be able to get goodfaith and damaging scores for plwiki articles.

Caveat: We can't get the items listed below all at once. We request page history in batches (we can get up to 500 revisions in 1 batch) - currently, we request 1 batch of 51 revisions when the view loads and load in more as the user scrolls. For example, the Barack Obama article on English Wikipedia has 7,412 editors - even if we requested the maximum number of revisions in each request, it would take 15 requests to get all the editors data.


What's not available:

  • Reverted edits count
  • Bot edits count
  • Total tags, editors, edits, IP count

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 23 2019, 6:48 PM
JMinor renamed this task from API spike to Investigate API and client architecture requirements for edit history, diffs and related user actions.Jul 23 2019, 9:27 PM
JMinor triaged this task as High priority.

I'd caution you that tools hosted on Cloud VPS, like XTools, generally don't have any service-level agreement. At any time they could go away permanently and without warning.

I don't know whether that's true specifically for XTools. It's a long-standing and well-loved project, linked from enwiki history pages, and with strong support both in CommTech and in the community. We should find out from @MusikAnimal whether there's been any discussion with the Cloud Services team about a specific SLA for XTools, given its importance. I've seen some suggestion on Wikitech that SLAs may be granted to Cloud VPS projects on a project-specific basis.

I'd also be interested in hearing from @MusikAnimal whether there's been any discussion of promoting any of the XTools stats that you mentioned above into the Action API.

Also, I should point out that some of the things you're looking for from XTools are available from the Action API, albeit not necessarily as user-friendly in presentation. For example, this query:

https://en.wikipedia.org/w/api.php?action=query&formatversion=2&titles=Ching_Shih&prop=contributors|revisions&pclimit=500&rvdir=newer&rvlimit=1

gives you the full list of editors plus the anon editor count, as well as the creation date and other info about the initial revision. (If you were interested, you could also get the total number of watchers from prop=info&inprop=watchers.

Total edits (and total minor/IP/maybe bot edits) seems like a small change to add.

For total edit counts per user, do you only need the logged-in user's edit count, or for an arbitrary user? And for the local wiki user, or the global user account? I suspect that https://en.wikipedia.org/w/api.php?action=help&modules=query%2Buserinfo or https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bglobaluserinfo could give you what you need.

Thank you for all the information @Mholloway, that's super helpful!

So it sounds like, without XTools, the endpoints you mentioned would cover:

  1. Count of editors for a given page
  2. Count of edits for a given user (total would have to be calculated on the client side)
  3. Article creation date for the sparkline

Then on the client side, we'd have to calculate:

  1. Count of minor edits for a given page (we could calculate it on the client side from the revisions response, checking for the existence of minor flag)
  2. Count of anon edits for a given page (we could calculate it on the client side from the revisions response, checking for the existence of anon flag)
  3. Count of IP edits for a given page (we could calculate it on the client side from the revisions response, checking for the absence of anon flag)

The problem with calculating these on the client side is that we'd have to get the entire history all at once. Right now, we only get 1 batch of 51 entries at first and load more as the user scrolls down.

What's missing:

  1. Count of reverted edits - do you know if there's any endpoint that serves that information, @Mholloway?

Thanks @Mholloway! We have consolidated our preferred endpoint requirements for this and API questions we have. I realize some of this might already be accessible but wanted to get it out there to get a feel for what is and isn't doable and to document what alternatives we have now that XTools may not be usable. Also tagging @WDoranWMF @Eevans and @Jhernandez, not to bypass any particular person but just to get the contacts I have in one place for thoughts/consideration.

To answer your last question - we need edit counts for an arbitrary user, and those counts would be for their global account, not local wiki counts.

API Questions:

Q1. Other than XTools API (https://xtools.readthedocs.io/en/stable/tools/index.html), is there an endpoint that returns

  1. Count of edits for a given page:
    • total number of edits for a given page
    • total number of editors for a given page
    • count of minor, IP, bot, reverted edits for a given page
  1. Count of edits for a given user

Q2. Using the revisions endpoint (https://www.mediawiki.org/wiki/API:Revisions), is there a way to include the following in the response:

  • information about whether edit was an IP edit
  • information about whether edit was a bot edit
  • information about whether edit was reverted

Wow, this is not the first time the mobile team asked about using XTools APIs! I will say the same thing as before -- while XTools typically has very good uptime (upwards of 99.9% out of the year, according to my data), it probably should not be relied upon by something like the production Wikipedia iOS app. Our APIs are more for research purposes, user scripts, other Cloud Services tools, etc., not for applications that receive significant traffic. I can guarantee however that XTools is not going anywhere, provided it's within our control :)

We should find out from @MusikAnimal whether there's been any discussion with the Cloud Services team about a specific SLA for XTools, given its importance. I've seen some suggestion on Wikitech that SLAs may be granted to Cloud VPS projects on a project-specific basis.

I'm not sure what an SLA entails, but sounds interesting! I'd love to learn more.

I'd also be interested in hearing from @MusikAnimal whether there's been any discussion of promoting any of the XTools stats that you mentioned above into the Action API.

That would be amazing! I was under the assumption most of our queries would be too heavy for production MediaWiki, or would attract an occasional complaint from the DBAs. XTools was meant to fill in the gaps of what MediaWiki doesn't (or shouldn't) do, with the understanding that the client is willing to wait. Queries can and do timeout, especially when there is maintenance on the database replicas.

The slowest part for the ArticleInfo endpoint, in particular, seems to be fetching the total number of revisions to a page. As far as I know this information is not retrievable via Wikimedia APIs, though it is on the action=info page.

I should point out that some of the things you're looking for from XTools are available from the Action API, albeit not necessarily as user-friendly in presentation.

Indeed, this is advertised atop our documentation https://xtools.readthedocs.io/en/stable/api/index.html. Production Wikimedia APIs should be used where possible.

For example, this query:

https://en.wikipedia.org/w/api.php?action=query&formatversion=2&titles=Ching_Shih&prop=contributors|revisions&pclimit=500&rvdir=newer&rvlimit=1

gives you the full list of editors plus the anon editor count, as well as the creation date and other info about the initial revision.

...

Total edits (and total minor/IP/maybe bot edits) seems like a small change to add.

For the purposes of XTools, we'd need the total count of unique editors instead of a list of usernames, since there can be many thousands. Bot edits would need to include former bots, too.

What's missing:
Count of reverted edits - do you know if there's any endpoint that serves that information?

Revert detection is practically a science, so I'm going to guess the answer is no. XTools has a complicated system of checking for reverts, comparing the SHAs of revisions, and going by edit summaries and tags of known (semi-)automated tools (including Rollback, etc.). This still gives you a very rough picture of the number of reverts. It also requires scanning the entire history of a page, which is why we don't offer this information through our APIs (too slow).


Hope this helps, and also thank you for the kind words :)

Thanks for all the clarifications @MusikAnimal, it's very helpful! Moving this spike to Blocked & Waiting now so that the iOS team can reassess. Related development work continues in T228783.

@Milimetric might have some ideas about good ways to achieve this using the analytics infrastructure.

I believe he is out on leave for a while, so I'm pinging @Ottomata instead in case he has some ideas as well.

All the data requested here is available in the mediawiki history dataset. We have not had any requests to query this data from our user-facing interface. It's certainly possible, but not trivial: it's just too much data to allow arbitrary querying without putting it on a *monster* cluster. But, for example, if we just had to answer the questions listed here, we could probably do it much more efficiently. We just need someone to stand up and say "this is important". Also, right now this data is computed monthly. If we need more frequent updates, the same principle applies. It's very hard to update everything incrementally, but for a limited set of questions and queries, we could update near real-time.

In short, reverts, bots, tags, anon, all these things are well understood and we have good data. But it's a lot of data and standing up a public API requires a good collaboration across our teams.