Page MenuHomePhabricator

Turn up PHP 8.1 Shellbox deployments
Closed, ResolvedPublic

Description

Once 8.1-based service images are available (T374502), we need to turn up 8.1 deployments of Shellbox.

Notably, these deployments are not going to be directly addressable, and will instead route via the service associated with the existing 7.4-based "main" release (similar to how canary releases work in other use cases).

By trading replica counts between the two parallel releases, we can progressively migrate traffic over to 8.1.

Event Timeline

Change #1074494 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: add support for service.deployment: none

https://gerrit.wikimedia.org/r/1074494

Change #1074495 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: add migration release

https://gerrit.wikimedia.org/r/1074495

Summary of the current state: We'll have the first 8.1-based service images available soon, at which point we should be unblocked to start testing.

I would propose that we start with a shellbox service that (1) has a limited (Debian) package dependency surface (i.e., to avoid changing too many things at once, given the switch to bullseye) but (2) does not receive a terribly large amount of traffic.

Two that seem to satisfy #1 are syntaxhighlight and constraints (php-rpc image variant), and among those, the former serves significantly less traffic. Thus, I'd propose we start with that (see patches already linked to this task).

Change #1074494 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: add support for service.deployment: none

https://gerrit.wikimedia.org/r/1074494

Change #1074495 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: add migration release

https://gerrit.wikimedia.org/r/1074495

The changes to support routed_via: main appear to work as expected:

swfrench@deploy2002:~$ curl -v 'https://staging.svc.eqiad.wmnet:4014/healthz'
 ... snip ...
< HTTP/1.1 200 OK
< date: Thu, 03 Oct 2024 19:59:45 GMT
< server: wikimedia
< x-powered-by: PHP/8.1.30
< backend-timing: D=4489 t=1727985585191015
< content-type: application/json
< x-envoy-upstream-service-time: 4
< transfer-encoding: chunked
< 
{
    "__": "Shellbox running",
    "pid": 9
}
swfrench@deploy2002:~$ curl -v 'https://staging.svc.eqiad.wmnet:4014/healthz'
 ... snip ...
< HTTP/1.1 200 OK
< date: Thu, 03 Oct 2024 20:01:47 GMT
< server: wikimedia
< x-powered-by: PHP/7.4.33
< backend-timing: D=1174 t=1727985707125583
< content-type: application/json
< x-envoy-upstream-service-time: 1
< transfer-encoding: chunked
< 
{
    "__": "Shellbox running",
    "pid": 18
}
swfrench@deploy2002:~$ kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
shellbox-main-69d47549b4-jxdq4        5/5     Running   0          63d
shellbox-migration-85fb86497f-czzsx   5/5     Running   0          9m44s
swfrench@deploy2002:~$ kubectl get endpoints
NAME                        ENDPOINTS                             AGE
shellbox-main-tls-service   10.64.75.113:4014,10.64.75.185:4014   611d

i.e., with some probability roughly proportional to numbers of pods on each deployment (glossing over some details like the fact that this is random per connection rather than per request, as that's not really relevant to the test we're doing here), we'll hit either main (7.4) or migration (8.1).

Next, I'll look into what's necessary to do some basic functional testing (e.g., whether there exists a tool to generate properly hmac-sha256 signed requests).

From some very basic testing that emulates pygmentize commands issued by SyntaxHighlight, the migration release indeed seems to work as expected (i.e., produces the same results as main).

I think there are two key questions to resolve before moving ahead with our first production migration:

Schedule: We need to decide on a migration schedule - i.e., what fractions of capacity (traffic) to split between main and migration and on what schedule.

Here, I would propose we start with some relatively small fraction of traffic (10% or less) and let that sit for a couple of days. If that shakes out no issues, then we should move relatively quickly to 50% and then later 100% on successive days, ideally near the start of a work week.

Shellbox version: In short, we should make sure that the same shellbox application code is being run in the 7.4- and 8.1-based images.

This part is a little complicated on account of managing image-build configuration and shellbox application code in the same repo / branch. Between the live shellbox-syntaxhighlight main release image (2024-06-10-140015) and the first-available 8.1-based image live in migration (2024-10-01-174300), there are two unreleased changes of note:

If there's no significant concerns about deploying these, then we should catch up main to minimize the number of differences to rule out when debugging.

Change #1081266 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: add "migration" in prod

https://gerrit.wikimedia.org/r/1081266

Returning to this, there are two issues I'd like to resolve before I'd consider this ready-to-go:

  1. There are a number of unapplied diffs against various prod shellbox instances, mainly as a result of not being deployed in some months (most are trivial horizontal changes - e.g., envoy image bump).
  2. There remains the issue of unreleased code / dependency changes between the now-live 2024-06-10-140015 images and the 2024-10-15-214239 images (the first with 8.1 builds covering all variants).

The primary changes of note in #2 remain those in T375243#10203654. Also notably, shellbox-video is running a 2024-09-11-160805 image, which means it has already put some miles on those composer package bumps (the bug fix in https://gerrit.wikimedia.org/r/1067545 wasn't merged until 9/20).

While it's non-ideal to be deploying a version of the shellbox code that does not map to a released version (i.e., it's neither 4.0.2 nor the more recent 4.1.0), (a) that is the reality of what we've been doing for some time and (b) these seem to be fairly low-risk changes (i.e., in a relative sense, less so than advancing all the way to 4.1.0, which carries some significant new functionality).

So, before closing this out, I plan to:

  • Pin Shellbox at 2024-06-10-140015 for everything except shellbox-video (to be pinned to 2024-09-11-160805) and resolve all non-service-image diffs.
  • Move all Shellbox instances forward to 2024-10-15-214239 images.
  • Configure "migration" releases of all Shellbox instances, scaled to zero (i.e., serving no traffic)

Change #1082317 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: pin all instances at live image version

https://gerrit.wikimedia.org/r/1082317

Change #1082318 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox-syntaxhighlight: upgrade to 2024-10-15-214239

https://gerrit.wikimedia.org/r/1082318

Change #1082319 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: upgrade to 2024-10-15-214239 (all)

https://gerrit.wikimedia.org/r/1082319

jijiki triaged this task as Medium priority.Oct 23 2024, 12:10 PM

Change #1082572 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] shellbox: add migration release (all)

https://gerrit.wikimedia.org/r/1082572

Change #1082317 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: pin all instances at live image version

https://gerrit.wikimedia.org/r/1082317

Change #1082318 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: upgrade to 2024-10-15-214239

https://gerrit.wikimedia.org/r/1082318

Mentioned in SAL (#wikimedia-operations) [2024-10-29T18:37:44Z] <swfrench-wmf> shellbox-syntaxhighlight updated to shellbox 2024-10-15-214239 - T375243

While I don't anticipate any issues (and indeed have not observed any so far), I'm going to let syntaxhighlight soak for a bit before merging https://gerrit.wikimedia.org/r/1082319 and updating the remaining instances.

No issues observed after updating syntaxhighlight (service errors or latency, attributable errors / exceptions logged by mediawiki). I'll move ahead with the other instances this morning.

Change #1082319 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: upgrade to 2024-10-15-214239 (all)

https://gerrit.wikimedia.org/r/1082319

Mentioned in SAL (#wikimedia-operations) [2024-10-30T19:40:16Z] <swfrench-wmf> all shellbox instances updated to shellbox 2024-10-15-214239 - T375243

Change #1081266 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-syntaxhighlight: add "migration" in prod

https://gerrit.wikimedia.org/r/1081266

Change #1082572 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: add migration release (all)

https://gerrit.wikimedia.org/r/1082572

Mentioned in SAL (#wikimedia-operations) [2024-11-04T20:26:33Z] <swfrench-wmf> zero-replica "migration" releases created for all shellbox instances - T375243

All migration releases have been turned up. The traffic migration itself will be tracked in T377038.