Page MenuHomePhabricator

Use cached page leads when creating page summaries to reduce MCS load
Closed, DeclinedPublic

Description

The summary endpoint works by creating a page lead before running additional summary-specific transformations on it.

To reduce production load, we should use existing, cached page leads rather than creating them anew on each summary request.

Outstanding issues

  • RESTBase does not expose tid as a path variable in the mobile-sections endpoints. If we move to using cached mobile-sections-lead responses from RESTBase we'll need to agree either on adding tid support in RESTBase for mobile-sections or updating the MCS summary endpoint not to use or advertise tid.
  • T187098: Extra span at beginning of mobile-sections-lead (probably caused at least in some cases by our own lead paragraph shifting transform)

Event Timeline

Mholloway removed Mholloway as the assignee of this task.

I think for this one we need some input from the Services team, to tell us which ChangeProp events we should emit, and how to do that. IIRC @mobrovac said mobile-sections would need to emit an event to ChangeProp, so we can have an event chain:
/page/html -> /page/mobile-sections -> /page/summary

@bearND You shouldn't emit any ChangeProp events in MCS, it will be done in RESTBase, let's try to avoid making MCS contact EventBus.

Indeed, we will need to make changes to ChangeProp to achieve this. Currently, we assume that summary and mobile-sections content is not interdependent, so we rerender them in parallel. If you start using mobile-sections-lead in summary it will become dependent.

Luckily we already emit events in RESTBase when mobile-sections are changed in order to purge caches, so changes to RESTBase are not needed.

PR for our side of things coming soon. We can deploy it independently from MCS change, but prior to MCS deployment.

Pchelolo edited projects, added Services (doing); removed Services.Jan 19 2018, 8:35 PM

While working on this I've noticed that we only prerender MCS content for wikipedia domains, not for other projects, cause the apps only use it for wikipedia, however summary is pre-rendered for all of the projects.

@bearND how are we going to accommodate this? Does that mean that we need to pregenerate MCS content for all projects now? Or we should have a special set of rules for wikipedias and a separate set of rules for other projects?

@ovasileva @phuedx @Jdlrobson I think eventually we need the summaries pre-rendered on all projects. But just for now, do we need to re-render the summaries for non-Wikipedia projects as well? Is Page Preview available for any of these? I've seen only the Navigation popups gadget on the other WMF wikis I tried, and I think that doesn't use the /page/summary endpoint.

@Pchelolo If we do need to prerender the summaries right now I think the latter option (special rules for Wikipedia) makes the most sense to me. Hopefully the separation would make things simpler to follow. I hope it's not too much maintenance overhead. Maybe like this?

  • Wikipedias: pre-render mobile-sections + summary
  • English Wiktionary: It looks like we already pre-render definitions.
  • Wikidata: I know there are plans for the summary endpoint to add some Wikidata specific behavior later.
  • Everything else: ?
Jdlrobson added a comment.EditedJan 19 2018, 11:40 PM

Page summaries previews [EDIT: sorry got confused] is only available on Wikipedia.

Page summaries are only available on Wikipedia.

That's not entirely correct, for example for wikivoyage we have quite meaningful summaries: https://en.wikivoyage.org/api/rest_v1/page/summary/San_Francisco

The question is whether somebody actually uses these or not and whether we want to use them.

Does MCS support generating summaries for non-wikipedia projects at all?

Sorry my bad. Meant "Page previews" not "Page summaries" :) Yes I'm using the summaries on a side project for Wikivoyage so they should be good and there's no reason not to provide them. Long term we'll want to enable page previews on Wikidata and Wiktionary and improve the summary generation for those projects too, but not now...

@Jdlrobson so is my understanding correct that after we switch to fetching mobile-section-lead for summaries from RESTBase we will essentially enable pregeneration for mobile content for non-wikipedia projects since summaries will be pregenerated for them and as they require mobile content it will also be pregenerated? If that's correct, we can just get rid of the special-case for wikipedias for mobile content. I don't think the additional load will be too high, vast majority of edits are done on wikipedias anyway..

Does MCS support generating summaries for non-wikipedia projects at all?

Currently, while it's still using Parsoid HTML directly, it should work. Once we change the code to use mobile-sections-lead this will require more testing of both mobile-sections-lead (if the summary also uses that for non-WP) and the summary itself.

@Pchelolo Wiktionary will still be different since it has the definitions endpoint. Wikidata might be a special case, too. (https://www.wikidata.org/api/rest_v1/page/html/Q42 doesn't look very user-friendly.)

@Jdlrobson so is my understanding correct that after we switch to fetching mobile-section-lead for summaries from RESTBase we will essentially enable pregeneration for mobile content for non-wikipedia projects since summaries will be pregenerated for them and as they require mobile content it will also be pregenerated? If that's correct, we can just get rid of the special-case for wikipedias for mobile content. I don't think the additional load will be too high, vast majority of edits are done on wikipedias anyway..

+1. Moreover, since we provide the REST API for all the other domains, they ought to be up to date.

@Pchelolo Wiktionary will still be different since it has the definitions endpoint.

Yes, and it gets regularly updated using MCS.

Wikidata might be a special case, too. (https://www.wikidata.org/api/rest_v1/page/html/Q42 doesn't look very user-friendly.)

We don't update WD end points because there is not much sense in doing so (most pages are JSONs).

Change 406058 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/deploy@master] Only rerender summary after mobile-sections were rerendered.

https://gerrit.wikimedia.org/r/406058

Mholloway claimed this task.Feb 2 2018, 3:57 PM

@ovasileva @phuedx @Jdlrobson I think eventually we need the summaries pre-rendered on all projects. But just for now, do we need to re-render the summaries for non-Wikipedia projects as well? Is Page Preview available for any of these? I've seen only the Navigation popups gadget on the other WMF wikis I tried, and I think that doesn't use the /page/summary endpoint.

Page previews should be available (in either beta mode or full deployment) on all projects but: wikibooks, wikisource, wikiversity, wikidata, wikinews, commons. (T164740: Disable page previews on projects where implementation isn't optimal)

bearND added a comment.Feb 2 2018, 5:26 PM

@ovasileva Just to make sure we're on the same page, is the following list complete or is anything missing?
Wikipedia, Wiktionary, Wikiquote, Wikispecies, Wikivoyage, MediaWiki, Meta-Wiki

Any other project in this list besides Wikipedia we want to pre-generate the summary content for?

@Pchelolo Since we have not tested the non-WP ones, we should probably move those to the new summary implementation later.

@bearND - this should be the list minus Wiktionary. The initial plan was for Wiktionary (T164739: Allow page previews to display in wiktionary) and Wikidata ({https://phabricator.wikimedia.org/T111231}) to have separate treatments (both in terms of the summaries themselves as well as the UI). The list would be: Wikipedia, Wikiquote, Wikispecies, Wikivoyage, MediaWiki, Meta-Wiki, although for all non-Wikipedia projects, we can wait until we test more thoroughly.

Mholloway added a comment.EditedFeb 7 2018, 4:30 PM

Notes on the implementation side:

There's a question here about how we want to get the revision-specific timestamp. In the current implementation we're getting that from the dc:modified meta element in the raw Parsoid HTML. Of course, that's no longer available when we're dealing with pregenerated MCS page leads.

We have a couple of options:

  • We could add a modified field to page leads with the timestamp for the requested revision. This is probably my preferred solution. Catches:
    • The page lead already has a lastmodified field that (for whatever reason) shows the modified timestamp for the most recent revision, regardless of the revision requested; adding a separate, revision-specific modified field might be a little weird for clients.
    • We'd need to regenerate all page leads with this new modified field before deploying the summary code that depends on it. (Is this going to happen soon anyway?)
  • Alternatively, we could get the revision-specific timestamp from the MediaWiki action API (by providing an rvstartid value with our metadata request).
  • Or I suppose we could do the first option with a fallback to the second while waiting for old page leads to get updated.

There's also a side issue in which we're not getting etags from RESTBase for mobile-sections-lead responses with a revision parameter specified, and so not able to provide a tid value in those cases. That's a bug and should be fixed soon. I filed subtask T186720 about it.

bearND added a comment.Feb 7 2018, 4:59 PM

Maybe we should run a quick prototype (without the timestamp and whatever we cannot easily get) to see how much of a difference it would actually make?

I've actually got two different branches going locally for this, the main one of which uses the rvstartid option. I was more blocked on the mysterious missing etag issue (which it turns out was caused by a typo in RB (see https://github.com/wikimedia/restbase/pull/950) and should be fixed after the next RB deployment).

bearND added a comment.Feb 7 2018, 5:22 PM

@Mholloway Assuming RB side gets fixed soon, are you still blocked on this?

Shouldn't be.

bearND added a comment.Feb 8 2018, 3:53 PM

@Mholloway What is it blocked on?

It's blocked on T186720 getting deployed.

Change 409112 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] WIP: Use RB-cached page lead objects for creating page summaries

https://gerrit.wikimedia.org/r/409112

Pushed the WIP patch just for feedback on the code. It's expected for tests to fail until T186720 is deployed.

One point to bear in mind is that the summary extraction code was written based on formatted-lead objects, but with this update it will use mobile-sections-lead objects since those are what RESTBase has cached (since the formatted-lead endpoint has not yet been exposed in production). Hence the selector update in extractLeadIntroduction.js. I am hoping the end results will be the same in practice, but we'll need to run compare-extracts.js to be sure.

Mholloway updated the task description. (Show Details)Feb 12 2018, 6:38 PM

Change 409112 abandoned by Mholloway:
Use RB-cached page lead objects for creating page summaries

Reason:
Just getting this out of the review queue pending discussion on the blocking issues on the ticket. To be revived if and when we're ready to move forward.

https://gerrit.wikimedia.org/r/409112

Mholloway changed the task status from Open to Stalled.Feb 21 2018, 8:03 PM
Krinkle moved this task from Backlog to Summary on the Page Content Service board.Jun 6 2018, 2:11 AM
238482n375 triaged this task as Lowest priority.Jun 15 2018, 8:04 AM
238482n375 removed Mholloway as the assignee of this task.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Aklapper raised the priority of this task from Lowest to Needs Triage.Jun 15 2018, 11:32 AM
Aklapper assigned this task to Mholloway.
Mholloway removed Mholloway as the assignee of this task.Jun 19 2018, 12:01 AM
Jhernandez triaged this task as Low priority.Jul 6 2018, 11:12 AM
Jhernandez moved this task from Upcoming to Backlog on the Reading-Infrastructure-Team-Backlog board.
Jhernandez added a subscriber: Jhernandez.

@Mholloway @bearND Should we pursue this? It's been months and it seems we are doing fine without it?

What is left for it? Services has it on "blocked" but I'm not sure what is missing.

The issue is that the page leads cached in mobile-sections-lead get different preprocessing than the page leads used when creating page summaries. IIRC updating summaries to use the cached page leads would take a substantial amount of rework. It's a good idea in principle, but I'm sorry to say that it probably best reflects reality to decline this.

bearND added a comment.Jul 6 2018, 2:40 PM

The other reason is that we don't want to create an unnecessary dependency to the mobile-sections endpoint if it's going to be deprecated in favor of the new PCS endpoints.

Jhernandez closed this task as Declined.Jul 7 2018, 12:19 PM

Cool, I'm declining it then. We can revisit if it is an issue in the future.