Follow up from the offsite:
Both the new and old page content APIs output JSON directly. While this is the end goal it has a few downsides:
1. It does not separate the concerns of finding content within the HTML and formatting it into JSON; i.e., if you chop up the HTML into distinct JSON fields, the original representation and flexibility is lost.
2. Because changes are not output as HTML any cleanup performed by the API is not easily upstreamed to Parsoid if it is found to be general purpose.
In order to address this, a new API will be written that only extracts the content of the new HTML API (T162179) using known selectors and then outputs the content as JSON. This API will be the basis for all other platform specific JSON APIs that are written so that none of them need to perform any extraction from HTML.
NOTE: If any more content needs to be extracted for new APIs, the extraction should be done here so that we do not have HTML extraction code in different APIs.
## Content
In particular this API should extract and return all content identified by the Reading platform teams here (from: https://etherpad.wikimedia.org/p/web-code):
- page id https://www.mediawiki.org/wiki/Manual:Page_table#page_id
- revision id https://www.mediawiki.org/wiki/Manual:Revision_table#rev_id
- tid
- Wikibase item (Wikidata Q number)
- Title
- Display Title
- Wikidata description
- Section List (with mapping to section ids)
- Lead image
- Lat/Long (geolocation)
- Gallery / Image List (licences, owners, image info)
- References List
- Whether the page response was from a redirect
- Language Variants List
- Last Modified Date / User / logged in / anonymous
- Name Sspace
- Name S- Localized namespace Namename
- Page Protection
- ContentModel - https://www.mediawiki.org/wiki/Manual:ContentHandler
- Whether the page is "Main Page" or not,not
- Whether the page is a disambiguation page,
- Ie
- Whether the page is editable
- p- Page issues
- Info box (the html)
- Link to the spoken version
- Hatnotes (disambig links etc)
## Open questions
### Do we include the following? How?
- Lead section?
- Pronunciation url?Title pronunciation URL? [[ https://phabricator.wikimedia.org/diffusion/GMOA/browse/master/lib/parseProperty.js;0742f792d05a5fc093ff0b95fea323db033e3660$26 | Content Service example ]].
- Pronunciation links- Page spoken recording URLs?
- Whether the page was featured?
- Whether the page is watched by the logged in user?
- All links? Or only interlanguage links? Redlinks (please!!!!)?
### Do we include the HTML blobs for the following in this response or separate HTML APIs?
- Article Content
- References