Page MenuHomePhabricator

Extract JSON API from MCS Page Content API
Closed, InvalidPublic


Follow up from the offsite:

Both the new and old page content APIs output JSON directly. While this is the end goal it has a few downsides:

  1. It does not separate the concerns of finding content within the HTML and formatting it into JSON; i.e., if you chop up the HTML into distinct JSON fields, the original representation and flexibility is lost.
  2. Because changes are not output as HTML any cleanup performed by the API is not easily upstreamed to Parsoid if it is found to be general purpose.

In order to address this, a new API will be written that only extracts the content of the new HTML API (T162179) using known selectors and then outputs the content as JSON. This API will be the basis for all other platform specific JSON APIs that are written so that none of them need to perform any extraction from HTML.

NOTE: If any more content needs to be extracted for new APIs, the extraction should be done here so that we do not have HTML extraction code in different APIs.


In particular this API should extract and return all content identified by the Reading platform teams here (from:

Open questions

Do we include the following? How?

  • Lead section?
  • Title pronunciation URL? Content Service example.
  • Page spoken recording URLs?
  • Whether the page was featured?
  • Whether the page is watched by the logged in user?
  • All links? Or only interlanguage links? Redlinks (please!!!!)?

Do we include the HTML blobs for the following in this response or separate HTML APIs?

  • Article Content
  • References

Event Timeline

Adding compatibility layer as a dependency… Some of the information in the JSON API will be extracted from the HTML Compatibility layer

@Fjalapeno @bearND If the apps are supposed to use the html content service + the metadata JSON service, who would use the JSON content service?

Or is this epic about the JSON metadata service? T177428: Develop Metadata JSON API