Page MenuHomePhabricator

Action API via rest-gateway production rollout
Closed, ResolvedPublic

Description

  • group0 at 10%
  • group0 at 50% (optional)
  • group0 at 100%
  • group1 at 10%
  • group1 at 50%
  • group1 at 100%
  • Pause to examine capacity and scale up if needed
  • all non-enwiki at 10%
  • all non-enwiki at 50%
  • all non-enwiki at 100%
  • capacity check
  • enwiki at 10%
  • enwiki at 50%
  • enwiki at 100%
  • Cleanup

Event Timeline

Clement_Goubert triaged this task as High priority.

Change #1198929 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group0 10%

https://gerrit.wikimedia.org/r/1198929

Change #1198930 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group0 50%

https://gerrit.wikimedia.org/r/1198930

Change #1198931 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group0 100%

https://gerrit.wikimedia.org/r/1198931

Change #1198932 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group1 10%

https://gerrit.wikimedia.org/r/1198932

Change #1198933 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group1 50%

https://gerrit.wikimedia.org/r/1198933

Change #1198934 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group1 100%

https://gerrit.wikimedia.org/r/1198934

Change #1198935 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group2 10%

https://gerrit.wikimedia.org/r/1198935

Change #1198936 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group2 50%

https://gerrit.wikimedia.org/r/1198936

Change #1198937 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway group2 100%

https://gerrit.wikimedia.org/r/1198937

Change #1198938 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway enwiki 10%

https://gerrit.wikimedia.org/r/1198938

Change #1198939 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway enwiki 50%

https://gerrit.wikimedia.org/r/1198939

Change #1198940 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway enwiki 100%

https://gerrit.wikimedia.org/r/1198940

Change #1198941 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: action api to rest-gateway cleanup

https://gerrit.wikimedia.org/r/1198941

Change #1198929 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group0 10%

https://gerrit.wikimedia.org/r/1198929

Change #1198930 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group0 50%

https://gerrit.wikimedia.org/r/1198930

Change #1198931 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group0 100%

https://gerrit.wikimedia.org/r/1198931

Change #1198932 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group1 10%

https://gerrit.wikimedia.org/r/1198932

Clement_Goubert changed the task status from Open to In Progress.Nov 3 2025, 9:56 AM
Clement_Goubert updated the task description. (Show Details)

Change #1198933 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group1 50%

https://gerrit.wikimedia.org/r/1198933

Change #1198934 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group1 100%

https://gerrit.wikimedia.org/r/1198934

Change #1198935 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group2 10%

https://gerrit.wikimedia.org/r/1198935

Change #1198936 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group2 50%

https://gerrit.wikimedia.org/r/1198936

Change #1198937 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway group2 100%

https://gerrit.wikimedia.org/r/1198937

Change #1198938 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway enwiki 10%

https://gerrit.wikimedia.org/r/1198938

I was checking some graphs this evening while finalizing the plan for winding down PHP_ENGINE routing tomorrow, and I ran into something odd. Compare mw-api-ext traffic served by its next release over the last 3 days in eqiad vs. codfw (note: you can see the same effect on the set of releases that share an endpoints object with main but it's much noisier).

Specifically, what shifted a bunch of mw-api-ext traffic from codfw to eqiad around 15:00 UTC on the 11th and 9:00 UTC on the 12th?

I then realized this has to be when https://gerrit.wikimedia.org/r/1198936 and https://gerrit.wikimedia.org/r/1198937 were applied across the CDN fleet.

Although we created rest-gateway-ro.discovery.wmnet and updated multi-dc.lua to ensure A/A routing to rest-gateway works, rest-gateway itself does not implement A/A routing. Stated differently, any mw-api-ext-bound traffic routed via rest-gateway will always hit the primary DC on the upstream side.

[Edit: To quantify this a bit, over the past 2 weeks, mw-api-ext-bound traffic landing at rest-gateway in eqiad has grown by ~ 2.4k rps. While that's an upper bound on what could have potentially served locally rather than traveling all the way to codfw, the sizable delta that's grown between sites over that time suggests the vast majority could be served locally.]

I'm wondering when the enwiki 50% increment is planned, and whether it would be possible to hold until I've shunted all next-destined traffic back to main (which I aim to do tomorrow). The reason I ask is this that getting next out of the picture will vastly simplify any capacity changes necessary to accommodate the growing codfw bias.

Longer term, is there currently a plan to reintroduce A/A routing by implementing it on this new leg of the request path?

I was checking some graphs this evening while finalizing the plan for winding down PHP_ENGINE routing tomorrow, and I ran into something odd. Compare mw-api-ext traffic served by its next release over the last 3 days in eqiad vs. codfw (note: you can see the same effect on the set of releases that share an endpoints object with main but it's much noisier).

Specifically, what shifted a bunch of mw-api-ext traffic from codfw to eqiad around 15:00 UTC on the 11th and 9:00 UTC on the 12th?

I then realized this has to be when https://gerrit.wikimedia.org/r/1198936 and https://gerrit.wikimedia.org/r/1198937 were applied across the CDN fleet.

Although we created rest-gateway-ro.discovery.wmnet and updated multi-dc.lua to ensure A/A routing to rest-gateway works, rest-gateway itself does not implement A/A routing. Stated differently, any mw-api-ext-bound traffic routed via rest-gateway will always hit the primary DC on the upstream side.

Aaaah yep that would do it.

[Edit: To quantify this a bit, over the past 2 weeks, mw-api-ext-bound traffic landing at rest-gateway in eqiad has grown by ~ 2.4k rps. While that's an upper bound on what could have potentially served locally rather than traveling all the way to codfw, the sizable delta that's grown between sites over that time suggests the vast majority could be served locally.]

I'm wondering when the enwiki 50% increment is planned, and whether it would be possible to hold until I've shunted all next-destined traffic back to main (which I aim to do tomorrow). The reason I ask is this that getting next out of the picture will vastly simplify any capacity changes necessary to accommodate the growing codfw bias.

Yep, sure, holding for now.

Longer term, is there currently a plan to reintroduce A/A routing by implementing it on this new leg of the request path?

I don't think we've discussed that in depth, but we can probably figure something out.

Change #1204865 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] rest-gateway: Point to DC-local mw-api-ext deployment

https://gerrit.wikimedia.org/r/1204865

We received notifications from users that the search API which is configured to allow 50s timeouts to support costly search requests is now failing at 15s with an upstream request timeout (T410007). The user reported that the behavior started to change around nov 11th which is apparently when we started to roll out this new route on group2 wikis. I'm not 100% sure that this change is the cause of this new behavior but IIUC on all wikis except enwiki we now route api.php requests to the rest-gateway. If I'm not mistaken the rest-gateway has a default timeout of 15s which might explain this new behavior? Are there ways to vary this timeout based on the target action API?

Change #1206205 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] rest-gateway: set 53s timeout for action API

https://gerrit.wikimedia.org/r/1206205

Change #1206205 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: set 53s timeout for action API

https://gerrit.wikimedia.org/r/1206205

We received notifications from users that the search API which is configured to allow 50s timeouts to support costly search requests is now failing at 15s with an upstream request timeout (T410007). The user reported that the behavior started to change around nov 11th which is apparently when we started to roll out this new route on group2 wikis. I'm not 100% sure that this change is the cause of this new behavior but IIUC on all wikis except enwiki we now route api.php requests to the rest-gateway. If I'm not mistaken the rest-gateway has a default timeout of 15s which might explain this new behavior? Are there ways to vary this timeout based on the target action API?

Thanks a lot for letting us know - we've increased the overall timeout for the action API to 53s in the gateway, so this shouldn't interfere with request flow in future. It looks like previously failing queries are now working. I'll update the linked task to explain further.

Change #1204865 merged by jenkins-bot:

[operations/deployment-charts@master] rest-gateway: Point to DC-local mw-api-ext deployment

https://gerrit.wikimedia.org/r/1204865

Change #1198939 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway enwiki 50%

https://gerrit.wikimedia.org/r/1198939

Change #1198940 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway enwiki 100%

https://gerrit.wikimedia.org/r/1198940

Change #1198941 merged by Clément Goubert:

[operations/puppet@production] trafficserver: action api to rest-gateway cleanup

https://gerrit.wikimedia.org/r/1198941

Clement_Goubert updated the task description. (Show Details)