PCS should use parsoid endpoints in MediaWiki, not RESTbase
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	daniel
	Jun 19 2023, 4:20 PM

Description

PCS should call the /pahe/{title}/html endpoints in MediaWiki (through the service mesh), instead of calling tha page/html/{title} endpoints on RESTbase.

Roll-out strategy

Staging for sanity tests
Execute sanity tests and test the new routed endpoint
Eqiad and Codfw cluster in production

Open Questions

How does that affect ParserCache load? Is this a blocker?

Details

Subject	Repo	Branch	Lines +/-
mobileapps: Switchover outgoing parsoid traffic	operations/deployment-charts	master	+2 -4
mw-api-int: Increase replicas to 240 total	operations/deployment-charts	master	+2 -2
Fix feature flag for outgoing parsoid traffic	mediawiki/services/mobileapps	master	+18 -7
mobileapps: Switchover PCS to core page HTML	operations/deployment-charts	master	+1 -2
mobileapps: Fix MW core request template name	operations/deployment-charts	master	+2 -2
mobileapps: Enable trace logs for debugging	operations/deployment-charts	master	+7 -2
mobileapps: Add missing template for MW parsoid reqs	operations/deployment-charts	master	+7 -1
mobileapps: Use core /page/html output in all envs	operations/deployment-charts	master	+1 -2
Core page html: Add content-language headers	mediawiki/services/mobileapps	master	+28 -2
mobileapps: Configure core page html req template	operations/deployment-charts	master	+6 -0
mobileapps: Use core page html on staging	operations/deployment-charts	master	+1 -1
mobileapps: Add core parsoid HTML support config	operations/deployment-charts	master	+5 -1

Related Objects
Search...

Status	Subtype	Assigned	Task
Stalled		None	T324931 Clean up open RESTBase related tickets
In Progress		None	T262315 <CORE TECHNOLOGY> API Migration & RESTbase Sunset
Open		None	T344944 Move Parsoid endpoints out of RESTbase
In Progress		None	T328559 Replace usage of RESTbase parsoid endpoints
Resolved		Jgiannelos	T339865 PCS should use parsoid endpoints in MediaWiki, not RESTbase
Resolved	PRODUCTION ERROR	cscott	T356368 Revision endpoint: InvalidArgumentException: ParserOutput does not have a render ID
Resolved	PRODUCTION ERROR	Jgiannelos	T356369 Missing content-language header on PCS responses
Resolved		Clement_Goubert	T356497 Raise mw-api-int replicas for increased load from mobileapps

Event Timeline

daniel created this task.Jun 19 2023, 4:20 PM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptJun 19 2023, 4:20 PM

daniel updated the task description. (Show Details)Jun 19 2023, 4:21 PM

daniel mentioned this in T339867: RESTbase: Turn off pre-generation and caching for parsoid endpoints.Jun 19 2023, 4:29 PM

The functionality is already available and requires a feature flag to be enabled in deployment-charts, how should we proceed with the switchover? cc/ @daniel

In T339865#8956035, @MSantos wrote:

The functionality is already available and requires a feature flag to be enabled in deployment-charts, how should we proceed with the switchover? cc/ @daniel

@Jgiannelos IIRC you had some concerns about just doing it... What's missing?

Title resolution isn't an issue (yet) because that would have already happened in RESTbase before hitting PCS, right?

Once we route to PCS directly, redirects coming from MediaWiki will have to be properly processed though.

Jgiannelos claimed this task.Jul 14 2023, 9:17 AM

Jgiannelos moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Change 939292 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Add core parsoid HTML support config

https://gerrit.wikimedia.org/r/939292

gerritbot added a project: Patch-For-Review.Jul 18 2023, 12:26 PM

Jgiannelos moved this task from In Progress to Code Review on the Content-Transform-Team-WIP board.Jul 18 2023, 12:26 PM

Change 939292 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Add core parsoid HTML support config

https://gerrit.wikimedia.org/r/939292

Maintenance_bot removed a project: Patch-For-Review.Jul 31 2023, 12:10 PM

MSantos moved this task from Code Review to To Deploy on the Content-Transform-Team-WIP board.Jul 31 2023, 3:14 PM

MSantos moved this task from Unsorted to PCS Service Pile on the RESTBase Sunsetting board.Aug 18 2023, 3:11 PM

MSantos triaged this task as Medium priority.Aug 21 2023, 3:51 PM

MSantos updated the task description. (Show Details)Aug 21 2023, 3:54 PM

MSantos updated the task description. (Show Details)

MSantos moved this task from To Deploy to Blocked on the Content-Transform-Team-WIP board.Oct 10 2023, 2:45 PM

Jgiannelos moved this task from Blocked to In Progress on the Content-Transform-Team-WIP board.Jan 12 2024, 2:03 PM

Change 991787 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Use core page html on staging

https://gerrit.wikimedia.org/r/991787

gerritbot added a project: Patch-For-Review.Jan 19 2024, 3:00 PM

Change 991787 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Use core page html on staging

https://gerrit.wikimedia.org/r/991787

Change 992130 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Configure core page html req template

https://gerrit.wikimedia.org/r/992130

Change 992130 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Configure core page html req template

https://gerrit.wikimedia.org/r/992130

Maintenance_bot removed a project: Patch-For-Review.Jan 22 2024, 12:31 PM

Jgiannelos moved this task from In Progress to Code Review on the Content-Transform-Team-WIP board.Jan 22 2024, 2:43 PM

Jgiannelos updated the task description. (Show Details)Jan 23 2024, 11:02 AM

Change 992412 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Core page html: Add content-language headers

https://gerrit.wikimedia.org/r/992412

gerritbot added a project: Patch-For-Review.Jan 23 2024, 1:20 PM

Change 992412 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Core page html: Add content-language headers

https://gerrit.wikimedia.org/r/992412

@Ladsgroup As part of double checking things before switching over outgoing traffic of PCS from RESTBase (/page/html) to MW (rest.php/v1/page/<article>/with_html) we discussed the topic of ParserCache capacity.
Do you have any concerns about any potential problems we are going to cause be putting the PCS traffic load to ParserCache instead of having RESTBase doing the heavylifting with cassandra as storage?

Hi, I need numbers and estimates to tell you whether it'd work or not. We increased its capacity recently so it should be easier now but I still need numbers!

Maintenance_bot removed a project: Patch-For-Review.Jan 23 2024, 3:31 PM

Which numbers/metrics would be useful to prepare to evaluate if things are gonna work with ParserCache?

How many new entries will be added to PC (daily or in total) and how many reads will be done (I hope it's behind a WANcache, the general parsing for read is behind it). That's it.

In T339865#9481631, @Ladsgroup wrote:

How many new entries will be added to PC (daily or in total) and how many reads will be done (I hope it's behind a WANcache, the general parsing for read is behind it). That's it.

We are already doing active pre-generation on all changes to keep the parsoid cache in restbase updated. The requests from PCS will hit the same PC entries. So there should be no additional writes.

Thanks @daniel
In terms of more read traffic, it will increase by the same amount of read traffic parsoid on restbase is currently doing because of pregeneration.

As long as reads are cached by WAN, I think it should be fine. Just give a heads up before deploy so we could connect the dots easily in case something happens.

Change 992975 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Use core /page/html output in all envs

https://gerrit.wikimedia.org/r/992975

gerritbot added a project: Patch-For-Review.Jan 25 2024, 4:26 PM

MSantos moved this task from Code Review to To Deploy on the Content-Transform-Team-WIP board.Jan 29 2024, 4:12 PM

After running difftesting between staging and prod with sample of ~40k requests here are the findings:

The vast majority of inconsistencies are minor (eg. timestamp on head->meta)
p95 of all testcases had less than 2 lines of different content (mostly metadata)
p99 of all testcases had less than 8 lines of different content
p999 of all testcases had less than 55 lines of different content

I assume that some of the failures for the percentiles greater than p95 could be transient.
I am rerunning the testcases from diff > p95 to see how many of those were transient or actual issues.

Overall i think i am confident for switching over the traffic in terms of compatibility.

Failures re-run:

It looks like the numbers are roughly the same so not many transient failures.
After looking at the diffs most of them are improvements to the output. Trying a few page purges also fixed things. It looks like RESTbase had stale content.

After purging failures from previous runs and re-running the tests it looks like the root cause was stale restbase content and now diffs are minimal. I think we are good to switchover traffic from RESTBase to MW cc @Ladsgroup

Thanks. cc @Marostegui this impacts PC, a little bit more reads there but we should be fine

Change 994177 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Switchover PCS to core page HTML

https://gerrit.wikimedia.org/r/994177

Change 992975 abandoned by Jgiannelos:

[operations/deployment-charts@master] mobileapps: Use core /page/html output in all envs

Reason:

https://gerrit.wikimedia.org/r/992975

Change 994199 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Add missing template for MW parsoid reqs

https://gerrit.wikimedia.org/r/994199

Change 994199 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Add missing template for MW parsoid reqs

https://gerrit.wikimedia.org/r/994199

While testing page/summary I am getting timeouts on staging.

Change 994209 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Enable trace logs for debugging

https://gerrit.wikimedia.org/r/994209

Change 994209 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Enable trace logs for debugging

https://gerrit.wikimedia.org/r/994209

Change 994215 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Fix MW core request template name

https://gerrit.wikimedia.org/r/994215

Change 994215 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Fix MW core request template name

https://gerrit.wikimedia.org/r/994215

After running the same ~40k test requests on page/summary endpoints diffing output between staging and prod things look OK.

We did have some mismatches on metadata which is expected
Out of 40k reqs only ~10 had mismatches in content

A few more tests:

routes related to static assets (css/js/i18n) respond properly (also not affected by parsoid)
wikitext to mobile-html is not affected by parsoid
page/talk responses also look OK
page/media-list responses look OK

Change 994177 merged by Jgiannelos:

[operations/deployment-charts@master] mobileapps: Switchover PCS to core page HTML

https://gerrit.wikimedia.org/r/994177

This is now in production. In terms of error rate I don't see any increase in the metrics. We do have a severe increase in latency:
codfw:
https://grafana.wikimedia.org/goto/EuQQlbtSz?orgId=1
https://grafana.wikimedia.org/goto/f3FXlxtIk?orgId=1

eqiad:
https://grafana.wikimedia.org/goto/tTb9lbtSz?orgId=1
https://grafana.wikimedia.org/goto/JvV9_xpIz?orgId=1

By looking at the downstream upstream/latency change they correlate

There is an increase in error rate on restbase level. Here are the logs:
https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.05?id=GHA0YI0BRtLP5wy6SsI8
https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.05?id=Lss0YI0BySCoT0gdrxsd

Reverted after: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/994712

Logs from the 2 spikes in errors:

In T339865#9502550, @Jgiannelos wrote:

This is now in production. In terms of error rate I don't see any increase in the metrics. We do have a severe increase in latency:
codfw:
https://grafana.wikimedia.org/goto/EuQQlbtSz?orgId=1
https://grafana.wikimedia.org/goto/f3FXlxtIk?orgId=1

eqiad:
https://grafana.wikimedia.org/goto/tTb9lbtSz?orgId=1
https://grafana.wikimedia.org/goto/JvV9_xpIz?orgId=1

By looking at the downstream upstream/latency change they correlate

Some of that latency is probably mw-api-int being saturated, rps almost doubled and php worker saturation shot up.
https://grafana.wikimedia.org/goto/XWvfk-pIk?orgId=1

We'll need to add some more replicas before the next try.

I did some investigation on the etag compatibility between before/after the switchover and here is how PCS works:

The etags we use have this format: <page revision>/<tid>

If not page related we just use some sort of hashing:
- eg. for CSS the etag is <css_hash>/<timestamp now in tid format>
Most of the etag set operations on PCS use only the revision of the MW resource which is compatible before and after the switchover
The ones that also use the uuid of parsoid output which is only stable on RESTBase are:
- /page/media-list
- /page/talk
- /metadata
  - I don't think its exposed in RESTBase
- mobile-sections
  - Only exposed to kiwix
  - Soon to be decommissioned

Overall with summary and mobile-html being the vast majority of the requests the etag incompatibilities should not create inconsistencies.

For example from turnilo over the last 7 days (in descending order):

Total requests
- summary: 35.6m
- mobile-html: 0.6m
- mobile-sections: 416k
- talk: 1.8k

Change 997439 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Fix feature flag for outgoing parsoid traffic

https://gerrit.wikimedia.org/r/997439

Change 997439 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Fix feature flag for outgoing parsoid traffic

https://gerrit.wikimedia.org/r/997439

cscott closed subtask T356368: Revision endpoint: InvalidArgumentException: ParserOutput does not have a render ID as Resolved.Feb 21 2024, 6:12 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2024, 6:30 PM

Change 1007317 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Switchover outgoing parsoid traffic

https://gerrit.wikimedia.org/r/1007317

gerritbot added a project: Patch-For-Review.Feb 28 2024, 12:09 PM

Change 1007584 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: Increase replicas to 240 total

https://gerrit.wikimedia.org/r/1007584

Change 1007584 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: Increase replicas to 240 total

https://gerrit.wikimedia.org/r/1007584

Clement_Goubert mentioned this in T359114: Slow and failed deployments.Mar 5 2024, 10:33 AM

Change 1007317 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Switchover outgoing parsoid traffic

https://gerrit.wikimedia.org/r/1007317

Maintenance_bot removed a project: Patch-For-Review.Mar 5 2024, 1:30 PM

Jgiannelos moved this task from To Deploy to To Verify on the Content-Transform-Team-WIP board.Mar 6 2024, 11:31 AM

Jgiannelos closed subtask T356369: Missing content-language header on PCS responses as Resolved.Mar 6 2024, 3:22 PM

Jgiannelos closed this task as Resolved.Mar 6 2024, 3:26 PM

Clement_Goubert closed subtask T356497: Raise mw-api-int replicas for increased load from mobileapps as Resolved.Mar 7 2024, 12:55 PM

PCS should use parsoid endpoints in MediaWiki, not RESTbaseClosed, ResolvedPublicActions