Page MenuHomePhabricator

[Spike: 3 Hours] Investigate page_summary RESTBase API
Closed, ResolvedPublic

Description

Description

We want to be able to display math images within page previews). We're currently delivering plaintext summaries in order to strip parenthetical from summaries, which breaks sections with HTML in the summary (math expressions, most notably). We want to see:

Acceptance criteria

Provide answers to:

  1. What does page_summary do?
  2. Can it let us strip parentheticals but also render Math (or other HTML)?
  3. How could we change it to strip parentheticals, leave <b> elements, and support Math?
  4. Schedule a meeting to talk about #3
  5. Should we just use TextExtracts (it can leave a limited subset of HTML in the extract)?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 18 2016, 1:39 PM
ovasileva triaged this task as High priority.Nov 18 2016, 1:40 PM
ovasileva moved this task from To Triage to Triaged but Future on the Readers-Web-Backlog board.
ovasileva moved this task from Triaged but Future to Upcoming on the Readers-Web-Backlog board.
phuedx updated the task description. (Show Details)Nov 21 2016, 5:26 PM
phuedx updated the task description. (Show Details)
• jhobs renamed this task from [Spike]: Investigate page_summary RESTBase API to [Spike: 3 Hours] Investigate page_summary RESTBase API.Nov 21 2016, 5:29 PM
phuedx renamed this task from [Spike: 3 Hours] Investigate page_summary RESTBase API to [Spike]: Investigate page_summary RESTBase API.Nov 21 2016, 5:31 PM
phuedx updated the task description. (Show Details)
• jhobs renamed this task from [Spike]: Investigate page_summary RESTBase API to [Spike: 3 Hours] Investigate page_summary RESTBase API.Nov 21 2016, 5:41 PM
pmiazga claimed this task.Nov 23 2016, 11:27 PM
pmiazga moved this task from To Do to Doing on the Reading-Web-Sprint-86-🔪🦃 board.

@ovasileva could you schedule a meeting to talk about supporting html ?

Summary

RESTBase /page/summary/{title} request service sends two API requests and stores the response in cache. For page summary API is requesting plain text in uderlying call to the Mediawiki API fo the text extract (TextExtracts API)

TextExtracts API has option called explaintext which RESTBase is setting up to true. This param tells TextrExtracts extension to return plain text summary.
Currently there is no other way to retrieve HTML extracts via RESTBase. If we want to change it it would involve all mobile clients to evaluate requesting the html rather than the plain text extract.

If we want to pursue RESTBase API there are 3 possible scenarios

  1. remove explaintext option but it requires technical discussion across teams
  2. add a switch in RESTBase to switch off explaintext option
  3. create a new RESTBase endpoint just for Hovercards

Technical details

Sample requests for San Francisco

{ 
  uri: 'https://en.wikipedia.org/w/api.php',
  method: 'post',
  headers: { host: 'en.wikipedia.org' },
  body: { 
    format: 'json',
    action: 'query',
    prop: 'info|revisions',
    continue: '',
    rvprop: 'ids|timestamp|user|userid|size|sha1|contentmodel|comment|tags',
    titles: 'San_Francisco',
    formatversion: 1,
    meta: undefined 
  } 
}

and

{ 
  uri: 'https://en.wikipedia.org/w/api.php',
  method: 'post',
  headers: { host: 'en.wikipedia.org' },
  body:  { 
    prop: 'info|extracts|pageimages|revisions|pageterms',
    exsentences: 5,
    explaintext: true,
    piprop: 'thumbnail',
    pithumbsize: 320,
    rvprop: 'timestamp|ids',
    titles: 'San_Francisco',
    wbptterms: 'description',
    action: 'query',
    format: 'json',
    formatversion: 1,
    meta: undefined 
  } 
}

RESTBase response for San Francisco Page summary query:

{
  "title":"San Francisco",
  "extract":"San Francisco (SF) (/sæn frənˈsɪskoʊ/), officially the City and County of San Francisco, is the cultural, commercial, and financial center of Northern California and the only consolidated city-county in California. San Francisco is about 46.9 square miles (121 km2) in area. It is located on the north end of the San Francisco Peninsula. It is the smallest county in the state. It has a density of about 18,451 people per square mile (7,124 people per km2), making it the most densely settled large city (population greater than 200,000) in the state of California and the second-most densely populated major city in the United States after New York City.",
  "thumbnail":{
    "source":"https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg/320px-Golden_Gate_Bridge%2C_SF_%28cropped%29.jpg",
    "width":320,
    "height":228
  },
  "lang":"en",
  "dir":"ltr",
  "timestamp":"2016-11-29T04:27:58Z",
  "description":"consolidated city-county in California, United States"
}

Links:

Issues caused by extracts being plain text:

As an example changing explaintext property would break this change https://phabricator.wikimedia.org/rEPOP37ddb1997f4bf816030dc3bd1ae2cf579068b029

TextExtracts extensions strips those tags from HTML:

	"table",
	"div",
	"ul.gallery",
	".mw-editsection",
	"sup.reference",
	"ol.references",
	".error",
	".nomobile",
	".noprint",
	".noexcerpt"

Source: https://github.com/wikimedia/mediawiki-extensions-TextExtracts/blob/master/extension.json#L38-L48

@pmiazga, what are your answers to questions 4 and 5 from the acceptance criteria?

pmiazga added subscribers: phuedx, • jhobs.EditedDec 2 2016, 8:55 PM
  1. We had a meeting with @phuedx @ovasileva and @jhobs
  2. We decided not to use TextExtracts as it contains many open bugs plus current RESTBase implementation doesn't allow to return HTML, only plaintext. We will create custom service that provides data for hovercards - T113094
bmansurov closed this task as Resolved.Dec 2 2016, 9:23 PM

Thank you.