Up until now, we've mostly gotten away with using the prop=extracts MediaWiki API behind RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue. However, as the definition of a page summary starts to become more complicated – in the wake of the simple implementation of HTML previews in T165018: Page previews can consume new summary-HTML endpoint – and the complexity of generating extracts in the TextExtracts extension it becomes clear(er) that the extension shouldn't be the place where we house the notion of what a page summary is. Forcing this separation has the added benefit of not allowing us to conflate TextExtracts and Page Previews. We (Reading Web) readily admit that we don't know who's using the API and how they are using it.
- Create the new Page Summary API (T168848).
- Move parenthetical stripping from the client-side to the Page Summary API.
- check T181314 and T181316 are resolved
- Add support for disambiguation pages via the Disambiguator extension (T168392)
There are many bugs open against TextExtracts that cause unexpected issues with the page summary we display to users. We either need to write a bunch of tests and fix up TextExtracts or build a new API specifically for the purpose of Page Previews.
There are a number of issues that
- We may want to render inline images (see T99793)
- Some HTML tags make sense e.g. sub and sup (T112137)
- Parenthesises are sometimes useful and sometimes not - we need some semantic way to distinguish... (T164100, T162219). We discussed this here to a conclusion: T91344 (although kept it open but stalled for further discussion)
- Links should get annotated with the title of the page to avoid issues with non-links showing hover cards (T75936)
- Should not show <noinclude> content in the extract (T109869)
- The HTML extract is not always well formed since the extract does not use a DOM parsing library (T166272)
- See subtasks.