Page MenuHomePhabricator

[EPIC] Add notion of emptiness to page summary API
Closed, ResolvedPublic

Description

There are a couple of instances where we'd like to signal that a page doesn't have a summary or that the summary is empty:

  • When the page is a disambiguation page (T168391)

We can and should use the Disambiguator extension's API to detect whether or not the page is a disambiguation page.

  • When the last sentence from the page's introductory paragraph ends with a ":" (T168328)

Currently, we test whether the extract is empty after some processing is done on the client. This work can and should be done in the page summary API.

Caveats

We've deferred the decision to pick either the MediaWiki API or RESTBase as the place that we build the page summary API. We need to decide where we're going to move this logic and own that architectural decision.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
Resolved Jhernandez
Resolved Mholloway
ResolvedDereckson
ResolvedJdlrobson
Resolvedovasileva
Resolvedovasileva
ResolvedJdlrobson
DuplicateNone
DuplicateNone
Resolvedovasileva
DeclinedJdlrobson
ResolvedJdlrobson
Resolvedovasileva
Resolved Fjalapeno
Resolvedphuedx
Declinedpmiazga
DeclinedNone
Resolvedphuedx
DeclinedNone

Event Timeline

mobrovac subscribed.

We've deferred the decision to pick either the MediaWiki API or RESTBase as the place that we build the page summary API. We need to decide where we're going to move this logic and own that architectural decision.

This is a point that needs to be determined a priori, IMHO. I think the question is whether the TextExtracts should have the notion of what the page summary is or not. If yes, then this logic belongs there. If not, it should be transferred to RESTBase. Alas, I am not convinced that is a good place for that because RESTBase has no notion of disambiguation or list pages (just as it doesn't have the notion of user pages, talk pages, etc), and given its purpose it probably shouldn't. Furthermore, placing that logic in RESTBase would mean that the Disambiguation extension becomes a required component on which RESTBase depends. Conversely, if done in the TextExtracts, it could check whether the former extension is installed or not before consulting its API.

If we want to return empty extract for the disambiguation page, TextExtracts is definitely a place for that.

If we want to add a is_disambiguation bool to the summary JSON - then it's RESTBase.

Personally I think the former is better.

Jdlrobson triaged this task as Medium priority.Jun 21 2017, 9:12 PM
Jdlrobson moved this task from Incoming to Epics/Goals on the Web-Team-Backlog board.

I think the question is whether the TextExtracts should have the notion of what the page summary is or not. If yes, then this logic belongs there. If not, it should be transferred to RESTBase.

The definition of what a page summary is or isn't shouldn't belong in TextExtracts. TextExtracts is a general purpose API for getting the first N characters/sentences of a page, with an unknown number third-party clients. We should consider TextExtracts and page summaries distinct.

Alas, I am not convinced that is a good place for that because RESTBase has no notion of disambiguation or list pages (just as it doesn't have the notion of user pages, talk pages, etc), and given its purpose it probably shouldn't. Furthermore, placing that logic in RESTBase would mean that the Disambiguation extension becomes a required component on which RESTBase depends. Conversely, if done in the TextExtracts, it could check whether the former extension is installed or not before consulting its API.

The Disambiguator extension provides a MediaWiki API query module that we can use as part of the internal queries that RESTBase makes when generating page summaries. If the extension isn't installed you'll see a warning in the warning property of the MediaWiki API response. Since this is a soft failure, RESTBase would only be loosely coupled to the Disambiguator extension.

As for list pages… neither the MediaWiki API nor the MediaWiki internals have any notion of list pages either AFAIK. This is a special case that we're going to have to resolve via a hack and that hack could live anywhere.

Now, how should the Reading Web team be using RESTBase? Can we move business logic into services, like Mobile-Content-Service for example, or should we only be thinking of RESTBase as a distributed cache that acts RESTfully? If it's the latter, then the Reading Web team will undertake creating a new MediaWiki API query module which'll both generate page summaries and be our definition for what a page summary is. We'll then use RESTBase to help us achieve high volume access.

We are doing pretty good on the performance side of things: more than 95% of external requests (most of which are PP nowadays) never even reach RESTBase; rather, they are served directly from Varnish. This means that we have the flexibility of deciding where and how to do it. I see several options:

  1. Implement the logic in TextExtracts by supplying it an extra parameter that activates the business logic
  2. Implement a new MW API module (possibly in a new extension)
  3. Do the business logic in RESTBase
  4. Implement it in MCS

Out of these, the last one is the most flexible as it allows you to freely tweak the results and fix any issues and get them out quickly. Moreover, MCS is fully WM-centred which is conceptually makes it the canonical place for implementing WP-specific logic. Option 1 feels like a hack, because as you point out, TextExtracts should be generic and work for third parties as well. While the second option is a viable way too, I personally am not in favour of it as I don't think that the proliferation of extensions is a good thing. Also of note is that choosing any of the first two options means that you are dependant on the train and SWAT slots to get your changes out. Finally, implementing it in RESTBase might be the quickest fix now, but this comes with caveats for the future. Most notably, depending on the services team for changes/reviews/merges/deploys which is not sustainable in the long run if you are to own the end point's output. We want to give you the flexibility to iterate and make changes as you see fit. We also don't want to replace one monolith (MW) with another (RB) :P

I'd opt for 4-ish: build a new service so that we get the benefits of 4 – not depending on Services (seriously though, y'all are awesome).

Indeed, I knew I had 5 options in mind when I started typing my reply, but forgot one along the way :) That is also an attractive option, but it would be good to see with the rest of the Reading Services team, as they are planning on renaming/repurposing MCS, cf. T157059: Rename Mobile Content Service?

phuedx claimed this task.

These requirements were folded into the Page Preview API spec some time ago, the review of which is tracked in T169761: Review Summary 2.0 Spec.

For the record, we've also opted to implement this and a lightweight version of TextExtracts as routes in Mobile-Content-Service.