[EPIC] The Page Summary API needs to provide useful content for the majority of articles
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdlrobson
	Sep 18 2015, 8:33 PM

Description

Up until now, we've mostly gotten away with using the prop=extracts MediaWiki API behind RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue. However, as the definition of a page summary starts to become more complicated – in the wake of the simple implementation of HTML previews in T165018: Page previews can consume new summary-HTML endpoint – and the complexity of generating extracts in the TextExtracts extension it becomes clear(er) that the extension shouldn't be the place where we house the notion of what a page summary is. Forcing this separation has the added benefit of not allowing us to conflate TextExtracts and Page Previews. We (Reading Web) readily admit that we don't know who's using the API and how they are using it.

We now have a spec for the Page Summary API. The review of the spec is tracked at T169761: Review Summary 2.0 Spec.

Plan (YMMV)

Create the new Page Summary API (T168848).
Move parenthetical stripping from the client-side to the Page Summary API.
- Related discussion about whether to remove parenthicals or conditionally remove some: T91344.
- Fix remaining issues with parentheticals e.g. T162219
check T181314 and T181316 are resolved
Add support for disambiguation pages via the Disambiguator extension (T168392)

There are many bugs open against TextExtracts that cause unexpected issues with the page summary we display to users. We either need to write a bunch of tests and fix up TextExtracts or build a new API specifically for the purpose of Page Previews.

There are a number of issues that

We may want to render inline images (see T99793)
Some HTML tags make sense e.g. sub and sup (T112137)
Parenthesises are sometimes useful and sometimes not - we need some semantic way to distinguish... (T164100, T162219). We discussed this here to a conclusion: T91344 (although kept it open but stalled for further discussion)
Links should get annotated with the title of the page to avoid issues with non-links showing hover cards (T75936)
Should not show <noinclude> content in the extract (T109869)
The HTML extract is not always well formed since the extract does not use a DOM parsing library (T166272)

…

See subtasks.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	None	T177425 Develop General Layer of PCS
Resolved	• Jhernandez	T177426 Develop structured JSON APIs for general consumption
Resolved	• Mholloway	T177431 Develop a Summary JSON API
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Declined	Jdlrobson	T111329 [GOAL] Page previews on mobileweb
Resolved	Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Resolved	phuedx	T113633 Spike: Alternative to TextExtracts for Popups, Gather, Read more
Resolved	phuedx	T109867 Infobox template parameters showing in Hovercard
Duplicate	None	T109869 Hovercards should not show <noinclude> content in the extract
Resolved	Jdlrobson	T74629 Reference lists appearing in extracts for some articles
Resolved	Jdlrobson	T73023 Invalid HTML markup standalone LI element when using references on main page
Duplicate	None	T112137 Hovercards loses manual superscript formatting by requesting plain text
Duplicate	ovasileva	T152414 Previews must display useful text for the majority of articles
Resolved	ovasileva	T159388 Text Extracts used in Page Previews shows strange characters from template
Duplicate	None	T162219 Parentheticals: Words incorrectly concatenated due to too simple removing of spaces
Declined	None	T159065 Fix formulas in HTML extracts
Duplicate	Jdlrobson	T163442 Math formulas are not rendering in preview page dialogs
Resolved	Mhurd	T155573 [Regression] exclude pronunciation guides from article extracts
Declined	• JMinor	T164100 Consider excluding pronunciation guides from TextExtracts
Resolved	phuedx	T112226 Text extracts unable to summarise content contains {{dts}} templates
Resolved	• Pchelolo	T165017 Setup RESTBase HTML extract endpoint
Resolved	Jdlrobson	T165619 Spike: Specify changes to page-summary endpoint
Declined	• Pchelolo	T166163 Create a summary content migration filter
Resolved	phuedx	T167852 "<translate>" and <tvar\|*> tag visible in Page Previews: HtmlFormatter only flattens spans
Declined	Jdlrobson	T168329 exsentences does not work correctly when HTML output used
Duplicate	None	T168625 Make the Page Summary API return an "intro" for a page
Resolved	• Fjalapeno	T167022 Inventory requirements for summary 2.0 endpoint across Reading teams
Resolved	phuedx	T168400 [EPIC] Add notion of emptiness to page summary API
Declined	pmiazga	T168328 Extracts for Wikimedia List articles display partial previews
Declined	None	T168391 Page previews should display generic preview for disambiguation pages
Resolved	phuedx	T114418 Disable TextExtracts on file pages
Declined	None	T168418 Page Previews should accept the service's notion of emptiness
Resolved	• Pchelolo	T164291 Make title-related properties consistent
Resolved	• bearND	T169761 Review Summary 2.0 Spec
Resolved	• Mholloway	T170692 Return common URLs in summary API so clients do not have to perform bug prone string manipulation
Resolved	MSantos	T177619 Return variant URLs and titles in the metadata response
Resolved	• Mholloway	T178446 Expose display titles for a page in all available language variants through the action API
Invalid	None	T69232 Hovercards: Redirect titles should be bolded in the extract
Resolved	Jdlrobson	T170617 Adjust expectations for API consumers when using the TextExtracts API
Invalid	None	T185472 API problems with double spaces in wiki sections
Duplicate	None	T171053 Complete entire page previews test plan for page summary api
Resolved	Jdlrobson	T171052 Add disambiguation page handling in Page Summary API
Resolved	Jdlrobson	T168848 Bootstrap an initial version of the Page Summary API in MCS
Resolved	Jdlrobson	T172021 It shouldn't be possible for coordinates to be the lead paragraph
Resolved	Jdlrobson	T174698 Parenthetical stripping is too aggressive
Resolved	phuedx	T171065 Expose disambiguation property (ppprop=disambiguation) in mobileview api
Resolved	• bearND	T175286 Do side by side comparison of old summary endpoint against new summary endpoint
Resolved	• Mholloway	T176974 Mime vulnerability blocking merges to MCS
Duplicate	None	T176063 Page preview shows only the first two words for a specific article
Resolved	Jdlrobson	T176517 Pages that are redirects give empty summary objects
Resolved	Jdlrobson	T176519 Old templates can lead to sup elements inside summary
Resolved	phuedx	T176521 data-mw attributes should be stripped from summary before scrubbing parentheticals
Resolved	Jdlrobson	T176522 Intro property incorrectly identified in formatted endpoint
Resolved	Jdlrobson	T176525 Parenthetical: New edge case with nested brackets
Resolved	• bearND	T177007 Should we flatten spans in summary output
Resolved	Jdlrobson	T178125 Metadata (coordinates) is sometimes surfaced as lead intro/summary
Resolved	• Mholloway	T178333 Move RESTBase page summary logic to MCS
Resolved	• Mholloway	T178420 What is a "content namespace" for purposes of the summary 2.0 endpoint?
Resolved	Jdlrobson	T183833 [Bug report] Removing parentheses breaks chemical formulas
Resolved	Jdlrobson	T185050 Run comparison of html extracts again
Resolved	• bearND	T185161 Stripping parentheticals can result in trailing punctuation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Why has this been closed? Spam?

Jdlrobson reopened this task as Open.Jul 31 2017, 11:08 PM

Quiddity removed • Zhuanru001 as the assignee of this task.Aug 1 2017, 12:08 AM

Quiddity added a subscriber: • Zhuanru001.

In T113094#3487901, @mobrovac wrote:

Why has this been closed? Spam?

Spam.

Jdlrobson mentioned this in T173639: Hovercard text extract is broken for `* ` sequence in parenthesis.Aug 21 2017, 6:58 PM

Jdlrobson mentioned this in T173640: Hovercard text extract is broken for academic titles before and after names of person.Aug 21 2017, 7:13 PM

Jdlrobson merged a task: T173641: Hovercard article topic bolding fails if the page title has some qualifier.Aug 21 2017, 7:24 PM

Jdlrobson added subscribers: Dvorapa, JAnD, matej_suchanek, Vachovec1.

Jdlrobson removed a project: Patch-For-Review.Aug 22 2017, 6:41 PM

Jdlrobson removed a subtask: T165018: Page previews can consume new summary-HTML endpoint.Aug 22 2017, 7:02 PM

Jdlrobson removed subtasks: T167433: Switch all projects to the new (and yet to be built) summary-html endpoint for page previews, T167429: Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint.

Jdlrobson added parent tasks: T167429: Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint, T167433: Switch all projects to the new (and yet to be built) summary-html endpoint for page previews, T165018: Page previews can consume new summary-HTML endpoint.

Jdlrobson moved this task from Needs Prioritization to Epics/Goals on the Web-Team-Backlog board.Aug 24 2017, 5:24 PM

Jdlrobson removed a subtask: T74546: Strip <br> tags from extracts.

Dvorapa unsubscribed.Aug 25 2017, 8:42 AM

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 8:42 AM

ovasileva reopened subtask T91344: Review exclude all approach to parenthetical elements in summary endpoint as Stalled.Aug 28 2017, 10:17 AM

Jdlrobson created subtask T175286: Do side by side comparison of old summary endpoint against new summary endpoint.Sep 7 2017, 4:14 PM

Jdlrobson closed subtask T168848: Bootstrap an initial version of the Page Summary API in MCS as Resolved.

Quiddity unsubscribed.Sep 7 2017, 9:49 PM

pmiazga added a subtask: T176063: Page preview shows only the first two words for a specific article.Sep 20 2017, 12:17 AM

Jdlrobson created subtask T176517: Pages that are redirects give empty summary objects.Sep 22 2017, 8:35 PM

Jdlrobson created subtask T176519: Old templates can lead to sup elements inside summary.Sep 22 2017, 8:48 PM

Jdlrobson created subtask T176521: data-mw attributes should be stripped from summary before scrubbing parentheticals .Sep 22 2017, 9:10 PM

Jdlrobson created subtask T176522: Intro property incorrectly identified in formatted endpoint.Sep 22 2017, 9:18 PM

Jdlrobson created subtask T176525: Parenthetical: New edge case with nested brackets.Sep 22 2017, 9:36 PM

matej_suchanek unsubscribed.Sep 23 2017, 8:08 AM

phuedx added a subtask: T170692: Return common URLs in summary API so clients do not have to perform bug prone string manipulation .Sep 26 2017, 3:05 PM

Jdlrobson added a project: User-Jdlrobson.Sep 27 2017, 8:42 PM

Jdlrobson moved this task from Inbox to Tracking on the User-Jdlrobson board.Sep 27 2017, 8:58 PM

Jdlrobson created subtask T177007: Should we flatten spans in summary output.Sep 28 2017, 7:07 PM

Jdlrobson closed subtask T175286: Do side by side comparison of old summary endpoint against new summary endpoint as Resolved.Sep 28 2017, 7:22 PM

nshahquinn-wmf unsubscribed.Oct 3 2017, 7:55 PM

• Fjalapeno added a parent task: T177431: Develop a Summary JSON API.Oct 4 2017, 5:55 PM

• Fjalapeno added a project: Page Content Service.Oct 6 2017, 4:03 PM

• Fjalapeno closed subtask T167022: Inventory requirements for summary 2.0 endpoint across Reading teams as Resolved.Oct 6 2017, 4:06 PM

Jdlrobson closed subtask T176525: Parenthetical: New edge case with nested brackets as Resolved.Oct 10 2017, 6:53 PM

Jdlrobson closed subtask T176522: Intro property incorrectly identified in formatted endpoint as Resolved.Oct 11 2017, 11:14 PM

Jdlrobson created subtask T178125: Metadata (coordinates) is sometimes surfaced as lead intro/summary.Oct 12 2017, 11:00 PM

Jdlrobson closed subtask T178125: Metadata (coordinates) is sometimes surfaced as lead intro/summary as Resolved.Oct 13 2017, 7:56 PM

• Mholloway created subtask T178333: Move RESTBase page summary logic to MCS.Oct 16 2017, 8:00 PM

• Mholloway created subtask T178420: What is a "content namespace" for purposes of the summary 2.0 endpoint?.Oct 17 2017, 6:10 PM

Jdlrobson closed subtask T176519: Old templates can lead to sup elements inside summary as Resolved.Oct 17 2017, 6:26 PM

phuedx closed subtask T176521: data-mw attributes should be stripped from summary before scrubbing parentheticals as Resolved.Oct 17 2017, 6:26 PM

Jdlrobson closed subtask T176517: Pages that are redirects give empty summary objects as Resolved.Oct 17 2017, 6:26 PM

• Mholloway closed subtask T178420: What is a "content namespace" for purposes of the summary 2.0 endpoint? as Resolved.Oct 18 2017, 9:54 PM

• bearND closed subtask T170692: Return common URLs in summary API so clients do not have to perform bug prone string manipulation as Resolved.Nov 9 2017, 3:15 AM

• bearND closed subtask T178333: Move RESTBase page summary logic to MCS as Resolved.

Jdlrobson updated the task description. (Show Details)Nov 27 2017, 6:00 PM

Jdlrobson mentioned this in T181314: Post-nominal letters are not properly removed in link previews.

Jdlrobson mentioned this in T181316: IPA is not properly removed in link previews.

Jdlrobson updated the task description. (Show Details)

Jdlrobson added a subtask: T183833: [Bug report] Removing parentheses breaks chemical formulas.Jan 2 2018, 11:05 PM

Jdlrobson mentioned this in T183833: [Bug report] Removing parentheses breaks chemical formulas.Jan 2 2018, 11:08 PM

Jdlrobson merged a task: T185135: Summary truncated after 3 chars ("Dr.").Jan 17 2018, 10:56 PM

Jdlrobson added a subscriber: • Esanders.

Jdlrobson closed subtask T185050: Run comparison of html extracts again as Resolved.Jan 17 2018, 11:30 PM

Jdlrobson added a subtask: T185161: Stripping parentheticals can result in trailing punctuation.

• bearND closed subtask T177007: Should we flatten spans in summary output as Resolved.Jan 18 2018, 11:26 PM

Jdlrobson mentioned this in T185472: API problems with double spaces in wiki sections.Jan 22 2018, 11:04 PM

tagging kanban board for goals tracking

ovasileva moved this task from To Do to Quarterly Goals on the Readers-Web-Kanbanana-Board-Old board.Feb 15 2018, 4:01 PM

Jdlrobson closed subtask T183833: [Bug report] Removing parentheses breaks chemical formulas as Resolved.Feb 22 2018, 10:06 PM

Jdlrobson changed the status of subtask T170617: Adjust expectations for API consumers when using the TextExtracts API from Stalled to Open.

Jdlrobson closed subtask T185161: Stripping parentheticals can result in trailing punctuation as Resolved.Feb 22 2018, 10:09 PM

Jdlrobson updated the task description. (Show Details)Feb 22 2018, 10:11 PM

Jdlrobson closed subtask T171052: Add disambiguation page handling in Page Summary API as Resolved.Feb 22 2018, 10:13 PM

Jdlrobson updated the task description. (Show Details)

Vachovec1 reopened subtask T183833: [Bug report] Removing parentheses breaks chemical formulas as Open.Feb 22 2018, 10:16 PM

Jdlrobson closed subtask T183833: [Bug report] Removing parentheses breaks chemical formulas as Resolved.Mar 6 2018, 7:06 PM

Jdlrobson removed a subtask: T91344: Review exclude all approach to parenthetical elements in summary endpoint.Mar 8 2018, 7:49 PM

Jdlrobson updated the task description. (Show Details)

I believe this can be resolved now @ovasileva
Remaining issues can be addressed via bug fixing.
Note there is one single open sub task: T170617 which would be nice to get done sooner rather than later, but we don't need to track this work under the epic.

ovasileva moved this task from Quarterly Goals to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Mar 11 2018, 5:54 PM

RandomDSdevel awarded a token.Mar 15 2018, 9:14 PM

All looks good, changes have been documented and communicated, and subtasks are resolved. Closing this. Good job everyone.

MBinder_WMF awarded a token.Mar 16 2018, 4:03 PM

@ovasileva: I tried to find documentation about this new API, but wasn't successful. Searching for "Page Summary API" on MediaWiki.org lead me to:

https://www.mediawiki.org/wiki/Extension:PageSummaries, which seems to be an unrelated extension
https://www.mediawiki.org/wiki/Page_Previews/API_Specification, which seems to be the design specification for the API, not documentation for using the finished API

I also looked at Page Previews,
Extension:TextExtracts, and the main MediaWiki API documentation, but didn't see anything useful. Am I just overlooking it?