Page MenuHomePhabricator

Bootstrap an initial version of the Page Summary API in MCS
Closed, ResolvedPublic5 Estimated Story Points

Description

Per T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles, you'll find the spec for the Page Summary API here: https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API

Plan (YMMV)

  1. Add a route to the MobileApps service (MCS) that accepts a title and returns some JSON response.
  2. Get the lead section of the page.
  3. Get the intro from the lead section of the page (see https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API#Intros)
  4. Strip all HTML elements from the intro.
  5. Add the HTML elements back in case by case, e.g.
    • span
    • a
    • b
    • i
    • em
    • sup
    • sub
  6. Flatten a tags (see https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API#Flattening_inline_elements)
  7. Strip balanced parentheticals.
NOTE: Disambiguation pages and Wikidata entities can and should be done separately.

AC

Notes

  1. The Page Content API is a superset of the Page Summary API. If the Page Content Service repository has been created prior to working on this, then add this endpoint there.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
Resolved Jhernandez
Resolved Mholloway
DuplicateNone
DeclinedNone
ResolvedDereckson
ResolvedJdlrobson
Resolvedovasileva
DuplicateNone
DeclinedNone
Resolved Nirzar
Resolvedovasileva
ResolvedJdlrobson
DuplicateNone
DuplicateNone
Resolvedovasileva
DeclinedJdlrobson
ResolvedJdlrobson
Resolvedovasileva
ResolvedJdlrobson
DeclinedNone
Resolvedphuedx
DeclinedNone
ResolvedJdlrobson
ResolvedJdlrobson
ResolvedJdlrobson

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jdlrobson set the point value for this task to 5.Jul 18 2017, 4:19 PM

Change 364896 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Use mobile content service instead of TextExtracts to generate summaries

https://gerrit.wikimedia.org/r/364896

Change 364906 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/services/mobileapps@master] Strip any content in brackets and IPA

https://gerrit.wikimedia.org/r/364906

Most of the service is now built with the exception of...

Strip balanced parentheticals.

@ovasileva it would be good to have lots of examples for doing this part. Can you provide some?

Example would look like this:

Summary: <b>Albert Einstein</b> was a German-born theoretical physicist.

I'll make sure to add any examples that we do come up with to the spec.

@ovasileva: Could you add the list of parenthesis here or here.

Most of the work for this is done, with the exception of parenthetical for which I'm finding lots of edge cases (please join the conversation https://www.mediawiki.org/wiki/Topic:Tvv8ff8h44bt74t8).

The following patches need review:

Once that's done it's over to services to change the wiring of https://it.wikipedia.org/api/rest_v1/page/summary/Lombardia so that it sources the summary fields from the new endpoint...

Change 364906 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Strip any content in brackets (with a few exceptions)

https://gerrit.wikimedia.org/r/364906

Remaining work is:

I'll pick that up when this work becomes a priority again (right now distracted by other things)

Jdlrobson added a subscriber: Pchelolo.

Summary from services sync today:
Update from JR: In my opinion the current service is better than the existing service, but has not been “QAed”. Currently returns HTML but proposal is to update to return json https://gerrit.wikimedia.org/r/370694 - not sure how to wire this up. Not committed to working on this right now as distracted by OCG.
First step: Only switch textextracts to MCS; full summary response generation in service will come later.
Action Item: Corey to ask Joe to test JRs plain text patch on iOS.
Action item: Petr to update and deploy text extracts swap on beta cluster and provide feedback to JR about patch if any problems.
Action Item: Bernd run 1000 item comparison on new summary

@Pchelolo https://gerrit.wikimedia.org/r/370694 is the patch to
@bearND when running the 1000 item comparison could you be sure to make sure https://gerrit.wikimedia.org/r/375056 is merged first?

Step 0 - merge the patch and deploy MCS at least in beta so that I could run tests while making an RB PR

@bearND when running the 1000 item comparison could you be sure to make sure https://gerrit.wikimedia.org/r/375056 is merged first?

@Jdlrobson Sure. If not merged yet I can also create the commit for the new large test for summaries to come after your patches, so that whatever you have there is covered, too. Of course, getting them merged first is better.

@Pchelolo @bearND everything appears to be merged. Over to you do comparisons of two endpoints and get this up on beta cluster. let me know if I need to make any further adjustments!

@Jdlrobson I'm going to push a couple of changes soon.

  1. I think we should get rid of the -html in the name since the response is now a JSON object. Why no call it just summary?
  2. Update Content-Type to application/json.
  3. Add spec.yaml entries.

Question for you are you (or we) planning to also supply the other metadata as part of the MCS response later? Right now I only see type, revision, extract, and extract_html.

@Jdlrobson One more question. In the code I only see one type of standard. Shouldn't there be more than that? I thought there was a different one for disambig pages.

@bearND correct. Just standard for the time being. The rest will come later as they need more discussion and focus. However I'm confident the endpoint as it stands will be far superior than current.

Am fine with changing the name.

Change 376137 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Rename endpoint preview-html to summary

https://gerrit.wikimedia.org/r/376137

Change 376137 merged by Ppchelko:
[mediawiki/services/mobileapps@master] Rename endpoint preview-html to summary

https://gerrit.wikimedia.org/r/376137

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-09-07T07:39:16Z] <mobrovac@tin> Started deploy [restbase/deploy@b51c58c]: Use MCS for the summary endpoint - T168848

Mentioned in SAL (#wikimedia-operations) [2017-09-07T07:45:52Z] <mobrovac@tin> Finished deploy [restbase/deploy@b51c58c]: Use MCS for the summary endpoint - T168848 (duration: 06m 36s)

The RESTBase side of things has been deployed, so that it now contacts MCS for getting the info needed for the summary end point. Note, however, that due to the fact that the summary end point is being used by Page Previews (high volume), I haven't dropped the old data from storage. This means that the new format will replace the old one gradually as pages need to be re-rendered.

@mobrovac: When was making that switch discussed? @ovasileva: Were you aware of this?

@mobrovac: When was making that switch discussed? @ovasileva: Were you aware of this?

The formats are fully compatible (the new format only adds two more fields to the output). As for the contents of the extract and extract_html fields, I trusted you guys (Reading) made the necessary precautions to satisfy all of the requirements from all sides, so went ahead and deployed it.

What is your exact concern?

What is your exact concern?

That this:

<snip>the contents of the extract and extract_html fields, I trusted you guys (Reading) made the necessary precautions to satisfy all of the requirements from all sides, so went ahead and deployed it.</snip>

Hasn't been done by any other member of the Reading Web team AFAICT.

If there are no differences between the extract field in the new MCS endpoint and the old RESTBase endpoint, then this deploy is a NOP from the perspective of the Page Previews client as no production wikis are consuming the extract_html field yet. @Jdlrobson: Is this the case or are there differences that we should be aware of?

Per https://phabricator.wikimedia.org/T168848#3571149 I was expecting this to be done on beta cluster only and a side by side comparison of old summaries vs new summaries. We haven't done that yet to my knowledge?

With the deploy to production of the MCS part I assumed that the tests had been carried out in Beta (seeing that you don't need RESTBase for that there in order to compare the outputs). I can undeploy this from the RB side from production if needed.

Not sure why the RB parts were deployed already either. I did see @Pchelolo 's PR, and thought we would test it in beta cluster first. I also wanted to run some tests comparing the output of the top 1000 pages in a few wikis.

Mentioned in SAL (#wikimedia-operations) [2017-09-07T15:25:56Z] <mobrovac@tin> Started deploy [restbase/deploy@77961d0]: Revert using MCS for the summary end point - T168848

Mentioned in SAL (#wikimedia-operations) [2017-09-07T15:32:14Z] <mobrovac@tin> Finished deploy [restbase/deploy@77961d0]: Revert using MCS for the summary end point - T168848 (duration: 06m 18s)

Jdlrobson claimed this task.

Thanks! Im not sure if a deploy only on beta cluster is useful now - we've shown this can be done when needed.
I'd suggest next step is for @bearND to do the side by side comparison: T175286