Up until now, we've mostly gotten away with using the `prop=extracts` MediaWiki API behind RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue. However, as the definition of a page summary starts to become more complicated – in the wake of the simple implementation of HTML previews in {T165018} – and the complexity of generating extracts in the TextExtracts extension it becomes clear(er) that the extension shouldn't be the place where we house the notion of what a page summary is. Forcing this separation has the added benefit of not allowing us to conflate TextExtracts and Page Previews. We (Reading Web) readily admit that we don't know who's using the API and how they are using it.
We now have [[ https://www.mediawiki.org/wiki/User_talk:Phuedx_(WMF)/Reading/Web/Page_Preview_API | a spec for the Page Summary API ]]. The review of the spec is tracked at {T169761}.
== Plan (YMMV)
[x] Create the new Page Summary API (T168848).
[x] Move parenthetical stripping from the client-side to the Page Summary API.
- Related discussion about whether to remove parenthicals or conditionally remove some: T91344.
- Fix remaining issues with parentheticals e.g. T162219
[] check T181314 and T181316 are resolved
[] Add support for disambiguation pages via the Disambiguator extension (T168392)
---
There are many bugs open against TextExtracts that cause unexpected issues with the page summary we display to users. We either need to write a bunch of tests and fix up TextExtracts or build a new API specifically for the purpose of Page Previews.
There are a number of issues that
* We may want to render inline images (see T99793)
* Some HTML tags make sense e.g. sub and sup (T112137)
* Parenthesises are sometimes useful and sometimes not - we need some semantic way to distinguish... (T91344, T164100, T162219)
* Links should get annotated with the title of the page to avoid issues with non-links showing hover cards (T75936)
* Should not show <noinclude> content in the extract (T109869)
* The HTML extract is not always well formed since the extract does not use a DOM parsing library (T166272)
…
* See subtasks.