Page MenuHomePhabricator

Update RESTBase to get summary content from MCS Summary 1.3 endpoint when development is complete
Closed, ResolvedPublic

Description

When work is done on the MCS 1.3.0 endpoint in MCS, RESTBase will need to be updated to get summary endpoint content from MCS.
(While we informally called this 2.0 before it's actually 1.3.0 since it's backwards compatible to 1.2.0)

Plan

Phase 1 / Day 1

  • MCS to deploy remaining changes for summary implementation.
  • RB to remove the ensure_content_type filter, so that the summary only gets re-rendered right away when a page gets edited or purged.
  • RB to start the switch-over for summaries, again mainly triggered by new edits on the page.

Phase 2 (1-2 days)

  • Reading to verify that the new content looks good
  • No issues found by Services or Reading-Infrastructure team.

Phase 3 (2 days)

  • RB to start dump for cswiki, which would make sure all summaries are re-rendered with the latest version of MCS
  • RB to start dump for top five WP projects. RB can monitor SCB load and control that roll-out.

Phase 4

  • RB to reapply the ensure_content_type filter, so that reads of summaries on all wikis are ensured they get the latest version.

Contingency plan

In case we need to roll back we could set the expected content type version to 1.2.0 again and depending on the severity reapply the ensure_content_type filter.

What else?

We want to get good confidence about {T184557}.
We hope to also get T184753: Use stored page leads when creating page summaries to reduce MCS load and T184751: Benchmark the new page summary API in.

Timing

We started on Tuesday Feb 20, 2018.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hey @bearND, @Fjalapeno and I were discussing whether it's worth switching this over just to the Beta Cluster for a while to give clients time to test, or if it should go straight to production as with our usual deployments. I'm on the fence; the changes are fairly significant, but on the other hand, we've kept the output backward compatible, and if clients are sending an accept header with a content format version (as they should be) then they can switch over whenever they're ready.

What do you think?

In general I support pushing this out to beta cluster. I think this is mostly a question for the Services team. I'm not sure if there would be a lot of testing done on beta cluster, though.

My understanding was that we would not switch summary implementations based on content format version since we deem the new one backwards-compatible. Or are we?

mobrovac subscribed.

In general I support pushing this out to beta cluster. I think this is mostly a question for the Services team. I'm not sure if there would be a lot of testing done on beta cluster, though.

I agree that testing in BetaCluster will not get you much. However, I'd prefer tests to be conducted there before we turn it on in production, both for native apps and page previews.

My understanding was that we would not switch summary implementations based on content format version since we deem the new one backwards-compatible. Or are we?

I think we should bump the content version, but in a semver-backwards-compatible way. We are serving now v1.2.0, so bumping it to v1.3.0 would do the trick.

What is your time-line for rolling this out? I'm thinking that it would be smart to roll it out to BC, keep it there for 3, 4 days to give you time to test to the extent possible and then move to production.

What is your time-line for rolling this out? I'm thinking that it would be smart to roll it out to BC, keep it there for 3, 4 days to give you time to test to the extent possible and then move to production.

We had hoped to roll out by the Thanksgiving holiday, but part of the output depends on a siteinfo addition that was to roll out with wmf.8 (which I believe has now been rolled back twice from the group 2 wikis).

The part we're waiting on isn't essential, so we could do the BC switchover today anyway if that would be useful to the product teams. I guess it depends on how much testing they realistically expect to get done over the holiday break. Seeing as @Fjalapeno and @bearND are both out, I'll make a call and say it's probably fine to wait until next week, too.

A few of reading are working over Thanksgiving so it might be useful to be able to test on the beta cluster as early as today. @ovasileva would be the right person to confirm with about that.

Change 392686 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Change summary format version to 1.3.0

https://gerrit.wikimedia.org/r/392686

Change 392686 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Change summary format version to 1.3.0

https://gerrit.wikimedia.org/r/392686

OK, updated the summary content format version to 1.3.0 and deployed to BC and production. @mobrovac Sorry for the late notice, but is it too late to make the switchover on the Beta Cluster this week?

Update: wmf.8 is live, so I deployed the last piece, and the output should be as shown on T177431.

OK, updated the summary content format version to 1.3.0 and deployed to BC and production. @mobrovac Sorry for the late notice, but is it too late to make the switchover on the Beta Cluster this week?

Unfortunately, yes, because we yet have to make the change in RESTBase's code base and, more importantly, today is this week's last working day, which also makes me think that the switch could not be properly tested by all involved parties any way. We will have to schedule this for some time next week.

Change 395731 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Use MCS for summary in Beta

https://gerrit.wikimedia.org/r/395731

Change 395731 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Use MCS for summary in Beta

https://gerrit.wikimedia.org/r/395731

RESTBase in beta uses MCS for generating the summaries as of now. Please test extensively and then we can lay down a plan for deployment to production.

Change 397824 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Use MCS for summaries in all environments

https://gerrit.wikimedia.org/r/397824

Change 397824 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Use MCS for summaries in all environments

https://gerrit.wikimedia.org/r/397824

This has been reverted due to problems during the deploy. There are still on-going issues reagrding MCS<->Parsoid compatibility, so until these are resolved, we cannot move forward with this.

Mholloway changed the task status from Open to Stalled.Dec 13 2017, 11:36 PM
Mholloway triaged this task as High priority.
mobrovac changed the task status from Stalled to Open.Jan 8 2018, 12:29 PM
mobrovac claimed this task.

According to @bearND and @Mholloway we should be ready to do the switch this week. To that end, PR #938 rewrites the URI protocol of links to https.

Change 402818 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Define the protocol for links contained in page summary

https://gerrit.wikimedia.org/r/402818

Change 402818 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Define the protocol for links contained in page summary

https://gerrit.wikimedia.org/r/402818

Change 402825 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Switch summary to use MCS in all environments

https://gerrit.wikimedia.org/r/402825

@mobrovac: Could you give me a heads up when you're going to deploy that change ^? Thanks!

Repeating my last comment (read in a friendly – not obnoxious! – voice):

@mobrovac, @bearND: Could you give me a heads up when you're going to deploy that change ^? Thanks!

@phuedx @Jdlrobson sorry it's taken so long. I'm going to discuss the schedule with @Pchelolo in a bit.

No that's me who should be blamed it's taken so long, we've had an offsite and didn't do any real work.

bearND renamed this task from Update RESTBase to get summary content from MCS Summary 2.0 endpoint when development is complete to Update RESTBase to get summary content from MCS Summary 1.3 endpoint when development is complete.Feb 1 2018, 7:07 PM
bearND updated the task description. (Show Details)

@phuedx @Jdlrobson Ok, @Pchelolo discussed this with me and I've added our plans for the summary switch-over to the task description.
We probably start on Monday Feb 05, 2018 but roll this out gradually, so it'll take some time.

Edit: moved the meat of the info of this comment to the task description.

I think we might want to delay this. We'll have to make some changes to mobile-sections we need to deploy on Monday for the Android team (T185427). Maybe move it to Tuesday or Wednesday?

I think we might want to delay this. We'll have to make some changes to mobile-sections we need to deploy on Monday for the Android team (T185427). Maybe move it to Tuesday or Wednesday?

Tuesday/Wednesday WFM but whatever you do, don't rush it! Let's get it done cleanly.

Sorry, we found another issue, see new subtask, which delayed the switchover to Monday morning.

Due to T186933 we should postpone the summary switchover until a proper solution has been agreed and deployed.

Update: since the main guys for this in the Services team are out today and tomorrow the earliest would be Tuesday, Feb 20.

I've ran another compare run between the old 1.2.0 (TextExtracts) version and the latest MCS summary change b0be98c. Here are the results. You can also download a zip file (78MB) with all the results.

I have not found any blockers so we plan to do this today starting at 17:30 UTC.

Change 402825 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Config: Switch summary to use MCS in all environments

https://gerrit.wikimedia.org/r/402825

Mentioned in SAL (#wikimedia-operations) [2018-02-20T17:47:14Z] <ppchelko@tin> Started deploy [restbase/deploy@dca0290]: Switch summary implementation to MCS T179875

Mentioned in SAL (#wikimedia-operations) [2018-02-20T18:03:14Z] <ppchelko@tin> Finished deploy [restbase/deploy@dca0290]: Switch summary implementation to MCS T179875 (duration: 16m 01s)

Mentioned in SAL (#wikimedia-operations) [2018-02-20T18:43:24Z] <ppchelko@tin> Started deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875

Mentioned in SAL (#wikimedia-operations) [2018-02-20T18:57:25Z] <ppchelko@tin> Finished deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 (duration: 14m 02s)

Mentioned in SAL (#wikimedia-operations) [2018-02-20T18:57:57Z] <ppchelko@tin> Started deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 take 2

Mentioned in SAL (#wikimedia-operations) [2018-02-20T19:00:43Z] <ppchelko@tin> Finished deploy [restbase/deploy@e9bef90]: Do not return the response for summaery right away, store first T179875 take 2 (duration: 02m 47s)

I guess we're in an unplanned, super-slow-rollout mode right now. While it works for manual purges, new pages showing up in recent changes don't get updated to 1.3.0 yet.

I just purged one page manually and see the updates: https://da.wikipedia.org/api/rest_v1/page/summary/Respiration

curl -sI "https://da.wikipedia.org/api/rest_v1/page/summary/Respiration" | grep Summary
content-type: application/json; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/Summary/1.3.0"

@Pchelolo suspects that something's wrong with ChangeProp for summary.

Looks like the backlog of summary rerenders is worked off now, and I can see pages from the Recent Changes list getting the new MCS summary treatment automatically.

summary-definition-rerender-resource-change_delay in https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscreen&orgId=1&refresh=1m&from=now-12h&to=now was nearly at 4 days several hours ago, now measures in milliseconds. Not sure where it came from or why it was so high. Looks like it worked the backlog off rather quickly.

Screen Shot 2018-02-20 at 10.00.38 PM.png (853×1 px, 155 KB)

The big 5 wikis have been fully dumped for the latest version of the summary end point (v.1.3.4) and we have activated the content type filter in RESTBase which will force a re-render for all smaller wikis, so I think we can declare victory on this one \o/. Thank you all for bringing this into production :)