Page MenuHomePhabricator

Turn up PHP 8.1-flavored mw-debug k8s deployment
Closed, ResolvedPublic

Description

Once 8.1 base images (T372602) are available and multi-base-image "flavor" builds are available in scap (T370934), turn up a new 8.1-flavored deployment in the existing mw-debug namespace, along with LVS and DNS discovery services, etc.

In theory, once we fully migrate to 8.1, we can turn this all down again. Alternatively, we could keep it around, but scaled to zero, for later re-use when we migrate to PHP 8.3, etc.

If we go that route, we should give it a suitably generic name, e.g., "next" or "migration" or something.

Related Objects

StatusSubtypeAssignedTask
StalledNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
StalledNone
StalledNone
OpenNone
OpenNone
StalledNone
StalledKrinkle
ResolvedScott_French
Resolveddduvall
ResolvedClement_Goubert
OpenScott_French

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptAug 15 2024, 9:24 PM

Allocating:

Mentioned in SAL (#wikimedia-operations) [2024-09-10T18:04:05Z] <swfrench-wmf> ran sre.dns.netbox after adding mwdebug-next LVS VIPs for T372604

Change #1071932 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/dns@master] wmnet: A and PTR records for mwdebug-next in svc

https://gerrit.wikimedia.org/r/1071932

Change #1071932 merged by Scott French:

[operations/dns@master] wmnet: A and PTR records for mwdebug-next in svc

https://gerrit.wikimedia.org/r/1071932

Change #1071933 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] service: add basic configuration for mwdebug-next

https://gerrit.wikimedia.org/r/1071933

Mentioned in SAL (#wikimedia-operations) [2024-09-10T18:38:24Z] <swfrench-wmf> ran authdns-update on dns1004 (18:25 UTC) for T372604

Change #1071945 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-debug: add initial "next" release

https://gerrit.wikimedia.org/r/1071945

Change #1071957 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: parameterize PHP version via chart value

https://gerrit.wikimedia.org/r/1071957

Change #1071945 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: add initial "next" release

https://gerrit.wikimedia.org/r/1071945

Change #1072578 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "mw-debug: add initial "next" release"

https://gerrit.wikimedia.org/r/1072578

Change #1072578 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mw-debug: add initial "next" release"

https://gerrit.wikimedia.org/r/1072578

Alas, among the ways mw-debug is special, it configures an extra NodePort service that skips envoy (i.e., direct to apache) and in theory permits serving via HTTP for testing / benchmarking purposes:

service:
  deployment: production
  expose_http: true
  port:
    nodePort: 8444

See, e.g., https://gerrit.wikimedia.org/r/703739.

As I forgot about this, I failed to override the nodePort on the "next" release to a distinct value, which of course failed to apply:

Service "mediawiki-next" is invalid: spec.ports[0].nodePort: Invalid value: 8444: provided port is already allocated

Two options come to mind:

  1. I can simply re-spin https://gerrit.wikimedia.org/r/1071945 with a nodePort: 8453 override in values-next.yaml (port doesn't seem to be used for anything else per code-search and follows the same +4000 convention).
  2. I'm not 100% sure this is serving any useful purpose at the moment, given that any HTTP request directed at a wiki would receive a 302 to the canonical HTTPS URI. If that's the case, I can remove this service from mw-debug.

@akosiaris - Any preference among these options? Or @Joe if you have any context to share on current applicability of the original use case, that would be greatly appreciated as well.

Alas, among the ways mw-debug is special, it configures an extra NodePort service that skips envoy (i.e., direct to apache) and in theory permits serving via HTTP for testing / benchmarking purposes:

service:
  deployment: production
  expose_http: true
  port:
    nodePort: 8444

See, e.g., https://gerrit.wikimedia.org/r/703739.

As I forgot about this, I failed to override the nodePort on the "next" release to a distinct value, which of course failed to apply:

Service "mediawiki-next" is invalid: spec.ports[0].nodePort: Invalid value: 8444: provided port is already allocated

Two options come to mind:

  1. I can simply re-spin https://gerrit.wikimedia.org/r/1071945 with a nodePort: 8453 override in values-next.yaml (port doesn't seem to be used for anything else per code-search and follows the same +4000 convention).
  2. I'm not 100% sure this is serving any useful purpose at the moment, given that any HTTP request directed at a wiki would receive a 302 to the canonical HTTPS URI. If that's the case, I can remove this service from mw-debug.

@akosiaris - Any preference among these options? Or @Joe if you have any context to share on current applicability of the original use case, that would be greatly appreciated as well.

The http port is used for benchmarking as most HTTP benchmarking tools don't use persistent connections, and thus TLS negotiation becomes a major bottleneck for the benchmarking tools. So I'd say, personally, that it's useful to have it exposed as we would definitely want to run benchmarks on whatever -next represents in many circumstances.

Thanks, @Joe. Great, so if the original use case is still applicable (and, agreed, being able to benchmark "next" is a desirable property), then it's straightforward to expose this on 8453. I'll post a re-spin of my patch with that shortly.

Change #1072764 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-debug: add initial "next" release (attempt 2)

https://gerrit.wikimedia.org/r/1072764

Change #1072794 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/dns@master] wmnet: add geoip discovery DYNA record for mw-debug-next

https://gerrit.wikimedia.org/r/1072794

Change #1072796 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] [DNM] service: move mwdebug-next to lvs_setup

https://gerrit.wikimedia.org/r/1072796

Change #1072798 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] [DNM] service: move mwdebug-next to production

https://gerrit.wikimedia.org/r/1072798

Summary of the current state: All of the necessary patches to turn up the "next" release are ready to go, but I'd like to wait until at least the "support multiple releases per namespace in scap" part of T370934 is complete (patch out for review).

That would allow us to turn up next with it temporarily referencing the same 7.4-based images as the existing pinkunicorn release and have it updated as part of normal scap deployments, with development continuing in parallel to support multi-base-image-flavor builds.

While we could turn up next now and point at the pinkunicorn helmfile values (see, e.g., https://gerrit.wikimedia.org/r/1072764), that adds a source of potential confusion in which next can only be updated with a manual helmfile deployment (and is thus likely to be running stale code at any given time).

Change #1077481 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: add mw-debug "next" release to mw_releases

https://gerrit.wikimedia.org/r/1077481

Change #1078007 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-debug: remove temporary release value override

https://gerrit.wikimedia.org/r/1078007

The scap functionality described in T372604#10189848 is now live and appears to work as expected (T370934#10200417).

On Monday, I plan to merge and apply https://gerrit.wikimedia.org/r/1072764, followed by https://gerrit.wikimedia.org/r/1077481 and a test scap deployment. Once that's done and working as expected, I'll remove the temporary override to track "pinkunicorn" (https://gerrit.wikimedia.org/r/1078007) and continue with the remaining turnup changes (LVS service, etc.).

Change #1072764 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: add initial "next" release (attempt 2)

https://gerrit.wikimedia.org/r/1072764

mw-debug next is now up in eqiad and codfw - appears healthy and successfully serves Special:BlankPage on port 4453

Change #1077481 merged by Scott French:

[operations/puppet@production] hieradata: add mw-debug "next" release to mw_releases

https://gerrit.wikimedia.org/r/1077481

Mentioned in SAL (#wikimedia-operations) [2024-10-07T17:26:26Z] <swfrench@deploy2002> Started scap sync-world: Testing scap after mw-debug next bring-up - T372604

Mentioned in SAL (#wikimedia-operations) [2024-10-07T17:29:11Z] <swfrench@deploy2002> Finished scap sync-world: Testing scap after mw-debug next bring-up - T372604 (duration: 02m 45s)

Change #1078007 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: remove temporary release value override

https://gerrit.wikimedia.org/r/1078007

Remaining steps for the initial turn-up:

Change #1071933 merged by Scott French:

[operations/puppet@production] service: add basic configuration for mwdebug-next

https://gerrit.wikimedia.org/r/1071933

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:03:58Z] <swfrench-wmf> ran disable-puppet on 'A:lvs and (A:eqiad or A:codfw)' - T372604

Change #1072796 merged by Scott French:

[operations/puppet@production] service: move mwdebug-next to lvs_setup

https://gerrit.wikimedia.org/r/1072796

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:09:05Z] <swfrench-wmf> ran and enabled puppet-agent on 'A:lvs and A:eqiad' - T372604

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:12:05Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:17:53Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:21:53Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:27:53Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:34:38Z] <swfrench-wmf> ran and enabled puppet-agent on 'A:lvs and A:codfw' - T372604

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:35:04Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:35:50Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:39:08Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T372604)

Mentioned in SAL (#wikimedia-operations) [2024-10-08T17:45:07Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T372604)

LVS setup is done and mwdebug-next.svc.(codfw|eqiad).wmnet work as expected.

Change #1072798 merged by Scott French:

[operations/puppet@production] service: move mwdebug-next to production

https://gerrit.wikimedia.org/r/1072798

Mentioned in SAL (#wikimedia-operations) [2024-10-08T18:50:51Z] <swfrench@cumin2002> conftool action : set/pooled=true; selector: dnsdisc=mwdebug-next,name=codfw [reason: pooling mwdebug-next in codfw to match mwdebug - T372604]

Change #1072794 merged by Scott French:

[operations/dns@master] wmnet: add geoip discovery DYNA record for mwdebug-next

https://gerrit.wikimedia.org/r/1072794

Mentioned in SAL (#wikimedia-operations) [2024-10-08T18:54:56Z] <swfrench-wmf> ran authdns-update on dns1004 to pick up mwdebug-next record - T372604

Alright, with the exception of a couple of minor follow-ups, I think that's about as far as we can get for now without the 8.1-based images.

Change #1078736 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/cookbooks@master] sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES

https://gerrit.wikimedia.org/r/1078736

Scott_French changed the task status from Open to Stalled.Oct 24 2024, 4:01 PM
Scott_French triaged this task as Medium priority.

Change #1071957 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: parameterize PHP version via chart value

https://gerrit.wikimedia.org/r/1071957

Mentioned in SAL (#wikimedia-operations) [2024-10-31T17:11:08Z] <swfrench@deploy2002> Started scap sync-world: Deployment to pick up PHP version parameterization - T372604 T377040

Mentioned in SAL (#wikimedia-operations) [2024-10-31T17:13:00Z] <swfrench@deploy2002> Finished scap sync-world: Deployment to pick up PHP version parameterization - T372604 T377040 (duration: 01m 52s)

Change #1085491 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: ensure default php.version is a string

https://gerrit.wikimedia.org/r/1085491

Change #1085491 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: ensure default php.version is a string

https://gerrit.wikimedia.org/r/1085491

Mentioned in SAL (#wikimedia-operations) [2024-10-31T23:35:38Z] <swfrench@deploy2002> Started scap sync-world: Deployment to clear noop chart diff from 1085491 - T372604 T377040

Mentioned in SAL (#wikimedia-operations) [2024-10-31T23:37:28Z] <swfrench@deploy2002> Finished scap sync-world: Deployment to clear noop chart diff from 1085491 - T372604 T377040 (duration: 01m 49s)

Change #1085494 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mwdebug-next: php.version to 8.1

https://gerrit.wikimedia.org/r/1085494

Change #1087983 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: switch mw-debug "next" to 8.1

https://gerrit.wikimedia.org/r/1087983

Change #1087984 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] scap: add mw-debug "next" testservers check

https://gerrit.wikimedia.org/r/1087984

Change #1085494 merged by jenkins-bot:

[operations/deployment-charts@master] mwdebug-next: php.version to 8.1

https://gerrit.wikimedia.org/r/1085494

Change #1087983 merged by Scott French:

[operations/puppet@production] hieradata: switch mw-debug "next" to 8.1

https://gerrit.wikimedia.org/r/1087983

Mentioned in SAL (#wikimedia-operations) [2024-11-13T18:48:17Z] <swfrench@deploy2002> Started scap sync-world: Deployment to switch mwdebug-next to publish-81 - T372604

Mentioned in SAL (#wikimedia-operations) [2024-11-13T18:50:11Z] <swfrench@deploy2002> Finished scap sync-world: Deployment to switch mwdebug-next to publish-81 - T372604 (duration: 01m 53s)

Scott_French changed the task status from Stalled to In Progress.Wed, Nov 13, 7:26 PM

The mwdebug-next deployments are now running 8.1 and pass the "standard" suite of httpbb checks that we use to validate deployments.

There one or two straggler items I'd like to close out before resolving, but this is pretty close to done.

Change #1078736 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: add mwdebug-next to MEDIAWIKI_SERVICES

https://gerrit.wikimedia.org/r/1078736

Change #1092309 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-debug: remove replicas override on -next

https://gerrit.wikimedia.org/r/1092309

Change #1092309 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: remove replicas override on -next

https://gerrit.wikimedia.org/r/1092309

Change #1087984 merged by Scott French:

[operations/puppet@production] scap: add mw-debug "next" testservers check

https://gerrit.wikimedia.org/r/1087984

Mentioned in SAL (#wikimedia-operations) [2024-11-18T19:15:38Z] <swfrench@deploy2002> Started scap sync-world: Test deployment after adding mwdebug-next check command - T372604

Mentioned in SAL (#wikimedia-operations) [2024-11-18T19:17:10Z] <swfrench@deploy2002> Finished scap sync-world: Test deployment after adding mwdebug-next check command - T372604 (duration: 01m 31s)

Alright, I believe that's everything tracked here. The next and pinkunicorn deployments should be pretty much identical at this point, aside from the obvious difference in image.