Page MenuHomePhabricator

PCS should use parsoid endpoints in MediaWiki, not RESTbase
Closed, ResolvedPublic

Description

PCS should call the /pahe/{title}/html endpoints in MediaWiki (through the service mesh), instead of calling tha page/html/{title} endpoints on RESTbase.

Roll-out strategy

  • Staging for sanity tests
  • Execute sanity tests and test the new routed endpoint
  • Eqiad and Codfw cluster in production

Open Questions

  • How does that affect ParserCache load? Is this a blocker?

Event Timeline

The functionality is already available and requires a feature flag to be enabled in deployment-charts, how should we proceed with the switchover? cc/ @daniel

The functionality is already available and requires a feature flag to be enabled in deployment-charts, how should we proceed with the switchover? cc/ @daniel

@Jgiannelos IIRC you had some concerns about just doing it... What's missing?

Title resolution isn't an issue (yet) because that would have already happened in RESTbase before hitting PCS, right?

Once we route to PCS directly, redirects coming from MediaWiki will have to be properly processed though.

Change 939292 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Add core parsoid HTML support config

https://gerrit.wikimedia.org/r/939292

Change 939292 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Add core parsoid HTML support config

https://gerrit.wikimedia.org/r/939292

MSantos triaged this task as Medium priority.Aug 21 2023, 3:51 PM
MSantos updated the task description. (Show Details)

Change 991787 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Use core page html on staging

https://gerrit.wikimedia.org/r/991787

Change 991787 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Use core page html on staging

https://gerrit.wikimedia.org/r/991787

Change 992130 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Configure core page html req template

https://gerrit.wikimedia.org/r/992130

Change 992130 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Configure core page html req template

https://gerrit.wikimedia.org/r/992130

Change 992412 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Core page html: Add content-language headers

https://gerrit.wikimedia.org/r/992412

Change 992412 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Core page html: Add content-language headers

https://gerrit.wikimedia.org/r/992412

@Ladsgroup As part of double checking things before switching over outgoing traffic of PCS from RESTBase (/page/html) to MW (rest.php/v1/page/<article>/with_html) we discussed the topic of ParserCache capacity.
Do you have any concerns about any potential problems we are going to cause be putting the PCS traffic load to ParserCache instead of having RESTBase doing the heavylifting with cassandra as storage?

Hi, I need numbers and estimates to tell you whether it'd work or not. We increased its capacity recently so it should be easier now but I still need numbers!

Which numbers/metrics would be useful to prepare to evaluate if things are gonna work with ParserCache?

How many new entries will be added to PC (daily or in total) and how many reads will be done (I hope it's behind a WANcache, the general parsing for read is behind it). That's it.

How many new entries will be added to PC (daily or in total) and how many reads will be done (I hope it's behind a WANcache, the general parsing for read is behind it). That's it.

We are already doing active pre-generation on all changes to keep the parsoid cache in restbase updated. The requests from PCS will hit the same PC entries. So there should be no additional writes.

Thanks @daniel
In terms of more read traffic, it will increase by the same amount of read traffic parsoid on restbase is currently doing because of pregeneration.

As long as reads are cached by WAN, I think it should be fine. Just give a heads up before deploy so we could connect the dots easily in case something happens.

Change 992975 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Use core /page/html output in all envs

https://gerrit.wikimedia.org/r/992975

After running difftesting between staging and prod with sample of ~40k requests here are the findings:

  • The vast majority of inconsistencies are minor (eg. timestamp on head->meta)
  • p95 of all testcases had less than 2 lines of different content (mostly metadata)
  • p99 of all testcases had less than 8 lines of different content
  • p999 of all testcases had less than 55 lines of different content

I assume that some of the failures for the percentiles greater than p95 could be transient.
I am rerunning the testcases from diff > p95 to see how many of those were transient or actual issues.

Overall i think i am confident for switching over the traffic in terms of compatibility.

Failures re-run:

It looks like the numbers are roughly the same so not many transient failures.
After looking at the diffs most of them are improvements to the output. Trying a few page purges also fixed things. It looks like RESTbase had stale content.

After purging failures from previous runs and re-running the tests it looks like the root cause was stale restbase content and now diffs are minimal. I think we are good to switchover traffic from RESTBase to MW cc @Ladsgroup

Thanks. cc @Marostegui this impacts PC, a little bit more reads there but we should be fine

Change 994177 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Switchover PCS to core page HTML

https://gerrit.wikimedia.org/r/994177

Change 992975 abandoned by Jgiannelos:

[operations/deployment-charts@master] mobileapps: Use core /page/html output in all envs

Reason:

https://gerrit.wikimedia.org/r/992975

Change 994199 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Add missing template for MW parsoid reqs

https://gerrit.wikimedia.org/r/994199

Change 994199 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Add missing template for MW parsoid reqs

https://gerrit.wikimedia.org/r/994199

While testing page/summary I am getting timeouts on staging.

Change 994209 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Enable trace logs for debugging

https://gerrit.wikimedia.org/r/994209

Change 994209 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Enable trace logs for debugging

https://gerrit.wikimedia.org/r/994209

Change 994215 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Fix MW core request template name

https://gerrit.wikimedia.org/r/994215

Change 994215 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Fix MW core request template name

https://gerrit.wikimedia.org/r/994215

After running the same ~40k test requests on page/summary endpoints diffing output between staging and prod things look OK.

  • We did have some mismatches on metadata which is expected
  • Out of 40k reqs only ~10 had mismatches in content

A few more tests:

  • routes related to static assets (css/js/i18n) respond properly (also not affected by parsoid)
  • wikitext to mobile-html is not affected by parsoid
  • page/talk responses also look OK
  • page/media-list responses look OK

Change 994177 merged by Jgiannelos:

[operations/deployment-charts@master] mobileapps: Switchover PCS to core page HTML

https://gerrit.wikimedia.org/r/994177

This is now in production. In terms of error rate I don't see any increase in the metrics. We do have a severe increase in latency:
codfw:
https://grafana.wikimedia.org/goto/EuQQlbtSz?orgId=1
https://grafana.wikimedia.org/goto/f3FXlxtIk?orgId=1

eqiad:
https://grafana.wikimedia.org/goto/tTb9lbtSz?orgId=1
https://grafana.wikimedia.org/goto/JvV9_xpIz?orgId=1

By looking at the downstream upstream/latency change they correlate

This is now in production. In terms of error rate I don't see any increase in the metrics. We do have a severe increase in latency:
codfw:
https://grafana.wikimedia.org/goto/EuQQlbtSz?orgId=1
https://grafana.wikimedia.org/goto/f3FXlxtIk?orgId=1

eqiad:
https://grafana.wikimedia.org/goto/tTb9lbtSz?orgId=1
https://grafana.wikimedia.org/goto/JvV9_xpIz?orgId=1

By looking at the downstream upstream/latency change they correlate

Some of that latency is probably mw-api-int being saturated, rps almost doubled and php worker saturation shot up.
https://grafana.wikimedia.org/goto/XWvfk-pIk?orgId=1

We'll need to add some more replicas before the next try.

I did some investigation on the etag compatibility between before/after the switchover and here is how PCS works:

The etags we use have this format: <page revision>/<tid>

  • If not page related we just use some sort of hashing:
    • eg. for CSS the etag is <css_hash>/<timestamp now in tid format>
  • Most of the etag set operations on PCS use only the revision of the MW resource which is compatible before and after the switchover
  • The ones that also use the uuid of parsoid output which is only stable on RESTBase are:
    • /page/media-list
    • /page/talk
    • /metadata
      • I don't think its exposed in RESTBase
    • mobile-sections
      • Only exposed to kiwix
      • Soon to be decommissioned

Overall with summary and mobile-html being the vast majority of the requests the etag incompatibilities should not create inconsistencies.

For example from turnilo over the last 7 days (in descending order):

  • Total requests
    • summary: 35.6m
    • mobile-html: 0.6m
    • mobile-sections: 416k
    • talk: 1.8k

Change 997439 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Fix feature flag for outgoing parsoid traffic

https://gerrit.wikimedia.org/r/997439

Change 997439 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Fix feature flag for outgoing parsoid traffic

https://gerrit.wikimedia.org/r/997439

Change 1007317 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Switchover outgoing parsoid traffic

https://gerrit.wikimedia.org/r/1007317

Change 1007584 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: Increase replicas to 240 total

https://gerrit.wikimedia.org/r/1007584

Change 1007584 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: Increase replicas to 240 total

https://gerrit.wikimedia.org/r/1007584

Change 1007317 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Switchover outgoing parsoid traffic

https://gerrit.wikimedia.org/r/1007317