Bootstrap an initial version of the Page Summary API in MCS
Closed, ResolvedPublic5 Story Points

Description

Per T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles, you'll find the spec for the Page Summary API here: https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API

Plan (YMMV)

  1. Add a route to the MobileApps service (MCS) that accepts a title and returns some JSON response.
  2. Get the lead section of the page.
  3. Get the intro from the lead section of the page (see https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API#Intros)
  4. Strip all HTML elements from the intro.
  5. Add the HTML elements back in case by case, e.g.
    • span
    • a
    • b
    • i
    • em
    • sup
    • sub
  6. Flatten a tags (see https://www.mediawiki.org/wiki/User:Phuedx_(WMF)/Reading/Web/Page_Preview_API#Flattening_inline_elements)
  7. Strip balanced parentheticals.
NOTE: Disambiguation pages and Wikidata entities can and should be done separately.

AC

Notes

  1. The Page Content API is a superset of the Page Summary API. If the Page Content Service repository has been created prior to working on this, then add this endpoint there.

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
OpenNone
Resolvedovasileva
OpenNone
StalledNone
StalledJdlrobson
OpenNone
Stalledovasileva
OpenNone
OpenNone
DeclinedNone
Openphuedx
StalledNone
ResolvedJdlrobson
ResolvedJdlrobson
ResolvedJdlrobson
There are a very large number of changes, so older changes are hidden. Show Older Changes
phuedx updated the task description. (Show Details)Jul 18 2017, 4:17 PM
Jdlrobson set the point value for this task to 5.Jul 18 2017, 4:19 PM
ovasileva updated the task description. (Show Details)Jul 19 2017, 12:53 PM
phuedx updated the task description. (Show Details)Jul 19 2017, 12:56 PM

Change 364896 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Use mobile content service instead of TextExtracts to generate summaries

https://gerrit.wikimedia.org/r/364896

Change 364906 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/services/mobileapps@master] Strip any content in brackets and IPA

https://gerrit.wikimedia.org/r/364906

Most of the service is now built with the exception of...

Strip balanced parentheticals.

@ovasileva it would be good to have lots of examples for doing this part. Can you provide some?

Example would look like this:

Summary: <b>Albert Einstein</b> was a German-born theoretical physicist.

phuedx added a comment.EditedJul 21 2017, 9:53 AM

I'll make sure to add any examples that we do come up with to the spec.

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 25 2017, 5:28 PM
phuedx added a comment.EditedJul 27 2017, 5:14 PM

@ovasileva: Could you add the list of parenthesis here or here.

Most of the work for this is done, with the exception of parenthetical for which I'm finding lots of edge cases (please join the conversation https://www.mediawiki.org/wiki/Topic:Tvv8ff8h44bt74t8).

The following patches need review:

Once that's done it's over to services to change the wiring of https://it.wikipedia.org/api/rest_v1/page/summary/Lombardia so that it sources the summary fields from the new endpoint...

Change 364906 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Strip any content in brackets (with a few exceptions)

https://gerrit.wikimedia.org/r/364906

Remaining work is:

I'll pick that up when this work becomes a priority again (right now distracted by other things)

kaldari removed a subscriber: kaldari.Aug 24 2017, 6:27 PM
Jdlrobson added a subscriber: Pchelolo.

Summary from services sync today:
Update from JR: In my opinion the current service is better than the existing service, but has not been “QAed”. Currently returns HTML but proposal is to update to return json https://gerrit.wikimedia.org/r/370694 - not sure how to wire this up. Not committed to working on this right now as distracted by OCG.
First step: Only switch textextracts to MCS; full summary response generation in service will come later.
Action Item: Corey to ask Joe to test JRs plain text patch on iOS.
Action item: Petr to update and deploy text extracts swap on beta cluster and provide feedback to JR about patch if any problems.
Action Item: Bernd run 1000 item comparison on new summary

@Pchelolo https://gerrit.wikimedia.org/r/370694 is the patch to
@bearND when running the 1000 item comparison could you be sure to make sure https://gerrit.wikimedia.org/r/375056 is merged first?

Step 0 - merge the patch and deploy MCS at least in beta so that I could run tests while making an RB PR

bearND added a comment.Sep 1 2017, 1:49 AM

@bearND when running the 1000 item comparison could you be sure to make sure https://gerrit.wikimedia.org/r/375056 is merged first?

@Jdlrobson Sure. If not merged yet I can also create the commit for the new large test for summaries to come after your patches, so that whatever you have there is covered, too. Of course, getting them merged first is better.

@Pchelolo @bearND everything appears to be merged. Over to you do comparisons of two endpoints and get this up on beta cluster. let me know if I need to make any further adjustments!

bearND added a comment.Sep 5 2017, 8:58 PM

@Jdlrobson I'm going to push a couple of changes soon.

  1. I think we should get rid of the -html in the name since the response is now a JSON object. Why no call it just summary?
  2. Update Content-Type to application/json.
  3. Add spec.yaml entries.

Question for you are you (or we) planning to also supply the other metadata as part of the MCS response later? Right now I only see type, revision, extract, and extract_html.

bearND added a comment.Sep 5 2017, 9:21 PM

@Jdlrobson One more question. In the code I only see one type of standard. Shouldn't there be more than that? I thought there was a different one for disambig pages.

@bearND correct. Just standard for the time being. The rest will come later as they need more discussion and focus. However I'm confident the endpoint as it stands will be far superior than current.

Am fine with changing the name.

Change 376137 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Rename endpoint preview-html to summary

https://gerrit.wikimedia.org/r/376137

Change 376137 merged by Ppchelko:
[mediawiki/services/mobileapps@master] Rename endpoint preview-html to summary

https://gerrit.wikimedia.org/r/376137

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)

Nirzar removed a subscriber: Nirzar.Sep 6 2017, 10:13 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-07T07:39:16Z] <mobrovac@tin> Started deploy [restbase/deploy@b51c58c]: Use MCS for the summary endpoint - T168848

Mentioned in SAL (#wikimedia-operations) [2017-09-07T07:45:52Z] <mobrovac@tin> Finished deploy [restbase/deploy@b51c58c]: Use MCS for the summary endpoint - T168848 (duration: 06m 36s)

Kaartic removed a subscriber: Kaartic.Sep 7 2017, 8:24 AM

The RESTBase side of things has been deployed, so that it now contacts MCS for getting the info needed for the summary end point. Note, however, that due to the fact that the summary end point is being used by Page Previews (high volume), I haven't dropped the old data from storage. This means that the new format will replace the old one gradually as pages need to be re-rendered.

phuedx added a comment.Sep 7 2017, 9:27 AM

@mobrovac: When was making that switch discussed? @ovasileva: Were you aware of this?

@mobrovac: When was making that switch discussed? @ovasileva: Were you aware of this?

The formats are fully compatible (the new format only adds two more fields to the output). As for the contents of the extract and extract_html fields, I trusted you guys (Reading) made the necessary precautions to satisfy all of the requirements from all sides, so went ahead and deployed it.

What is your exact concern?

phuedx added a comment.EditedSep 7 2017, 10:09 AM

What is your exact concern?

That this:

<snip>the contents of the extract and extract_html fields, I trusted you guys (Reading) made the necessary precautions to satisfy all of the requirements from all sides, so went ahead and deployed it.</snip>

Hasn't been done by any other member of the Reading Web team AFAICT.

If there are no differences between the extract field in the new MCS endpoint and the old RESTBase endpoint, then this deploy is a NOP from the perspective of the Page Previews client as no production wikis are consuming the extract_html field yet. @Jdlrobson: Is this the case or are there differences that we should be aware of?

Per https://phabricator.wikimedia.org/T168848#3571149 I was expecting this to be done on beta cluster only and a side by side comparison of old summaries vs new summaries. We haven't done that yet to my knowledge?

^ cc @bearND @Pchelolo any idea what's happened here?

With the deploy to production of the MCS part I assumed that the tests had been carried out in Beta (seeing that you don't need RESTBase for that there in order to compare the outputs). I can undeploy this from the RB side from production if needed.

bearND added a comment.Sep 7 2017, 3:01 PM

Not sure why the RB parts were deployed already either. I did see @Pchelolo 's PR, and thought we would test it in beta cluster first. I also wanted to run some tests comparing the output of the top 1000 pages in a few wikis.

Mentioned in SAL (#wikimedia-operations) [2017-09-07T15:25:56Z] <mobrovac@tin> Started deploy [restbase/deploy@77961d0]: Revert using MCS for the summary end point - T168848

Mentioned in SAL (#wikimedia-operations) [2017-09-07T15:32:14Z] <mobrovac@tin> Finished deploy [restbase/deploy@77961d0]: Revert using MCS for the summary end point - T168848 (duration: 06m 18s)

RB PR has been undeployed.

Jdlrobson closed this task as Resolved.
Jdlrobson claimed this task.

Thanks! Im not sure if a deploy only on beta cluster is useful now - we've shown this can be done when needed.
I'd suggest next step is for @bearND to do the side by side comparison: T175286