Page MenuHomePhabricator

MW REST API Historical Data Endpoint Needs
Open, HighPublic

Description

Based on the suggestion from @Milimetric I've listed out the details of the data we are seeking to support some Historical data endpoints in the MW REST API.

Background
CPT is building a REST API for MW as part of this project we developed a series of endpoints to support the work of the iOS team. This endpoints were built specifically for retrieval of historical data directly from the DBs.

There were a number of limitations of using the DBs for this purpose and it required the addition of significant limits to the data returned and certain data was not accessible in a cost effective way.

Historical Data Endpoints

For the below, I have included the expected return data and a rough question the returned data proposes to answer. I will ask @eprodromou, the PM for the project, to review the Questions and the Liveness to ensure these are correct. I'll update once that has been done.

  • Edit count:
    • Question: What is the maturity of a page's content?
    • Freshness/Liveness: 1 week
    • Returns: a count of all edits for a given page
  • Editor count:
    • Question: What is the diversity of contribution to this page?
    • Freshness/Liveness: 1 week
    • Returns: a count of the unique editors for a given page
  • Reverted edit count:
    • Question: What is the level of vandalism on this page?
    • Freshness/Liveness: 1 week
    • Returns: a count of all reverted edits for a given page. Specifically, the edits that were reverted not the "reverting" revisions that are tagged
  • Anonymous edit count:
    • Question: What is the provenance of the information and how much is tied to the track record of the contributors?
    • Freshness/Liveness: 1 week
    • Returns: a count of edits to a page by unauthenticated contributors
  • Bot edit count:
    • Question: What is the level of automation present in the page's content
    • Freshness/Liveness: 1 week
    • Returns: a count of all the edits made by bots
  • Minor edit count:
    • Question: What is the level of stability of the page's content relative to number of edits?
    • Freshness/Liveness: 1 week
    • Return: a count of all edits flagged as minor
  • Reverted edit history:
    • Freshness/Liveness: <1 week
    • Return: a list of all reverted edits
  • Anonymous edit history:
    • Freshness/Liveness: <1 week
    • Return: a list of all edits made by unauthenticated contributors -Bot edit history :
    • Freshness/Liveness: <1 week
    • Return: a list of all edits made by bots -Contributors to a Page:
    • Question: Who are the contributors to a page so I can give them attributions for their work
    • Freshness/Liveness: <1 week
    • Return: a list of all those who have contributed content for a given page

With the exception of Reverted Edits Count/History and Contributors to a Page, these endpoints have been implemented as part of the MW REST API in PageHistoryCountHandler and PageHistoryHandler. However, there are significant limits on the data returned to protect performance and load on the DB.

The User Stories for the project are listed on our wiki. Let us know if anything is unclear or if information needs to be expanded.

Event Timeline

WDoranWMF triaged this task as Medium priority.Dec 10 2019, 10:03 PM
WDoranWMF updated the task description. (Show Details)
WDoranWMF added a subscriber: Milimetric.
fdans raised the priority of this task from Medium to High.Dec 23 2019, 5:14 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

ping to @WDoranWMF that team will start working on T258532: [SPIKE] Prototype of incremental updates for mediawiki history for simplewiki , including reverts using apache hudi the upcoming quarter which is a foundational block for the work of being able to serve up to date data to external consumers. Let's talk a bit more about this when there is some availability.