Background
In T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles, many issues were identified with previews not rendering appropriate text for articles. The majority of these issues are identified below. Any issues not identified here will be pushed to a future iteration.
Proposal
The proposal has the following goals:
- Keep the extracts API query module as generic as possible.
- Contain Page Previews knowledge within its extension (see above).
- Support third-party MediaWiki installations.
- Unless the PO/TL decide we're not going to do this and let folk
We create a query module within the Page Previews extension, pp_extract, which can produce an extract that satisfies the requirements above. It:
- Will defer to ExtractFormatter, provided by the TextExtracts extension, to do extract selection and content filtering.
- i.e. it'll filter out everything that doesn't match AC #1, #2, #4, and #5.
- Will not accept any parameters.
- Will only operate on one page – the Page Previews API request is only ever for one page.
Tying this into RESTBase should be as trivial as changing 'extracts' to 'pp_extract' and removing the ex-prefix query parameters.
Plan (YMMV)
- Extract ApiQueryExtracts#getFirstSection( $text, $isPlainText ) to TextExtracts\Extractor#getFirstSection( $html ).
- It might be worth extracting ExtractFormatter::getFirstChars and ::getFirstSentences too…
- Create PagePreviews\HtmlElementFilter, which uses ExtractFormatter with a the HTML element whitelist that satisfies the AC above.
- Create PagePreviews\ParentheticalFilter, which satisfies AC3.
- Create PagePreviews\ApiQueryPPExtract, which ties the above into the API