In T365253#9829691, @MoritzMuehlenhoff wrote:

In T365253#9829677, @elukey wrote:

I checked the dragonfly repo and I have a question about building for bookworm (didn't find it in https://wikitech.wikimedia.org/wiki/Dragonfly) - since we are going to have two os versions, how should we manage the master branch's debian changelog? Namely, should I create a new branch from master for bookworm, or do you prefer another road?

I'd keep it simple and simply move master to bookworm, the legacy packages won't be updated any further and the dragonfly* super nodes will also need to be moved off buster soon.

Thu, Jun 13, 7:31 AM · Machine-Learning-Team, serviceops, Kubernetes

Wed, Jun 12

JMeybohm triaged T362408: Migration to containerd and away from docker as High priority.

Wed, Jun 12, 3:18 PM · Prod-Kubernetes, Kubernetes, serviceops

JMeybohm updated the task description for T362978: Update all helm modules and charts to be compatible with the restricted PSS.

Wed, Jun 12, 2:17 PM · Patch-For-Review, serviceops, Prod-Kubernetes

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.

Wed, Jun 12, 1:21 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops

JMeybohm closed T346638: Rename the envoy's uses_ingress option to sets_sni as Resolved.

Wed, Jun 12, 9:06 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Tue, Jun 11

JMeybohm added a comment to T321899: Create mw-videoscaler helmfile deployment.

Not sure if this is the source of it, but full CI runs do fail because the services/mw-videoscaler/staging fails to render

Tue, Jun 11, 4:00 PM · Release-Engineering-Team (Seen), serviceops, MW-on-K8s

JMeybohm created T367200: mw-script fails to render in CI.

Tue, Jun 11, 3:57 PM · MW-on-K8s, serviceops

JMeybohm closed T345274: Remove similar-users service from k8s as Resolved.

Finally gone...

Tue, Jun 11, 1:12 PM · Similarusers, serviceops

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .

Tue, Jun 11, 1:12 PM · Patch-For-Review, Machine-Learning-Team, serviceops

JMeybohm updated the task description for T345274: Remove similar-users service from k8s.

Tue, Jun 11, 1:01 PM · Similarusers, serviceops

JMeybohm added a comment to T335491: Provide better long-term storage for translation models.

I just had to deploy machinetranslation for T346638: Rename the envoy's uses_ingress option to sets_sni and noticed container startup times of around 5 minutes (thinking something went totally wrong). I'm still seeing data getting pulled from peopleweb - are the plans to improve this still ongoing?

Tue, Jun 11, 1:01 PM · Language-Team (Language-2024-April-June), SRE-swift-storage, MinT, CX-deployments

JMeybohm updated the task description for T345274: Remove similar-users service from k8s.

Tue, Jun 11, 11:49 AM · Similarusers, serviceops

JMeybohm updated the task description for T345274: Remove similar-users service from k8s.

Tue, Jun 11, 11:25 AM · Similarusers, serviceops

JMeybohm updated the task description for T345274: Remove similar-users service from k8s.

Tue, Jun 11, 11:24 AM · Similarusers, serviceops

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .

Tue, Jun 11, 10:48 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Fri, Jun 7

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.

Fri, Jun 7, 3:38 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops

JMeybohm updated subscribers of T362310: Implement global ratelimiting in our service mesh.

The ratelimit service has been deployed to staging and prod wikikube clusters.
What's left to be done is to configure cirrus-streaming-updater to use it (see https://wikitech.wikimedia.org/wiki/Ratelimit#Enable/opt_in_to_rate_limiting). From all the values files I'm not sure which components (all?) should be rate limited, so I'd like to leave that change to you @pfischer / @bking / @dcausse. Feel free to send it my way for review/sync with me for the deployment so we can verify everything works as expected.

Fri, Jun 7, 12:14 PM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

JMeybohm updated the task description for T362310: Implement global ratelimiting in our service mesh.

Fri, Jun 7, 9:53 AM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

Thu, Jun 6

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .

Thu, Jun 6, 4:39 PM · Patch-For-Review, Machine-Learning-Team, serviceops

JMeybohm added a comment to T366481: registry2004 sometimes reporting: too many open files problems.

I see that we run nginx with the default debian nginx.conf which has worker_connections 768; and no worker_rlimit_nofile set. The generic tlsproxy module in puppet uses worker_connections 131072 (no idea where that number comes from) and worker_rlimit_nofile 131072 * 2.

Thu, Jun 6, 9:45 AM · Patch-For-Review, serviceops, Wikimedia-production-error

Wed, Jun 5

JMeybohm triaged T366481: registry2004 sometimes reporting: too many open files problems as High priority.

Unfortunately this is actually nginx complaining:

Wed, Jun 5, 2:22 PM · Patch-For-Review, serviceops, Wikimedia-production-error

JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.

Wed, Jun 5, 1:00 PM · serviceops, Prod-Kubernetes, Kubernetes

JMeybohm edited P11638 smaller_gerritbot_comments.js.

Wed, Jun 5, 12:02 PM · JavaScript, Phabricator

JMeybohm edited P11638 smaller_gerritbot_comments.js.

Wed, Jun 5, 12:00 PM · JavaScript, Phabricator

Mon, Jun 3

JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

In T353464#9854759, @jijiki wrote:

In T353464#9854751, @CDanis wrote:

In T353464#9854420, @JMeybohm wrote:

In T353464#9845947, @jijiki wrote:

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

I think more that we ran out of time to make changes last week. Removing them from the etcd cluster ahead of time seems fine to me, at least.

Any objections to wait for T366204 and T366205 to be completed before we remove the ganeti VMs?

While I you re probably right, I think it is feels slightly easier to just get rid of the old stuff all at once, logistically

Mon, Jun 3, 2:29 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm created T366470: Support creating phab tasks in wmflib.phabricator.

Mon, Jun 3, 12:58 PM · Infrastructure-Foundations, SRE-tools

JMeybohm merged T366085: Relabel kubernetes2032 to wikikube-worker2002 into T365712: Relabel codfw Kubernetes hosts .

Mon, Jun 3, 12:45 PM · SRE, serviceops, ops-codfw, DC-Ops

JMeybohm merged task T366085: Relabel kubernetes2032 to wikikube-worker2002 into T365712: Relabel codfw Kubernetes hosts .

Mon, Jun 3, 12:45 PM · SRE, ops-codfw, DC-Ops

JMeybohm merged T366468: Relabel kubernetes2023 to wikikube-worker2001 into T365712: Relabel codfw Kubernetes hosts .

Mon, Jun 3, 12:43 PM · SRE, serviceops, ops-codfw, DC-Ops

JMeybohm merged task T366468: Relabel kubernetes2023 to wikikube-worker2001 into T365712: Relabel codfw Kubernetes hosts .

Mon, Jun 3, 12:43 PM · SRE, ops-codfw, DC-Ops

JMeybohm created T366468: Relabel kubernetes2023 to wikikube-worker2001.

Mon, Jun 3, 12:35 PM · SRE, ops-codfw, DC-Ops

JMeybohm added a comment to T366465: Extend puppet ipresolve() to support SRV records.

In T366465#9854525, @Joe wrote:

I also checked ferm's own @resolve function and it doesn't support SRV records although adding such support wouldn't be too hard either.

Mon, Jun 3, 12:19 PM · Puppet-Core, serviceops, Infrastructure-Foundations

JMeybohm added a comment to T366465: Extend puppet ipresolve() to support SRV records.

In T366465#9854495, @taavi wrote:

Does dnsquery::srv do what you need here?

Mon, Jun 3, 12:13 PM · Puppet-Core, serviceops, Infrastructure-Foundations

JMeybohm created T366465: Extend puppet ipresolve() to support SRV records.

Mon, Jun 3, 11:36 AM · Puppet-Core, serviceops, Infrastructure-Foundations

JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

In T353464#9845947, @jijiki wrote:

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

Mon, Jun 3, 11:20 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm added a comment to T365855: Stop hardcoding k8s master (k8s API) endpoint IP addresses.

This is basically T287491: Allow to address Kubernetes API servers from NetworkPolicy
IMHO the easiest and less intrusive way to do this with an upstream helm chart is to just add a calico networkpolicy template to the chart (the file could even be prefixed with wmf-) that just creates that one policy. The linked phab task should contain some examples for that.

Mon, Jun 3, 9:37 AM · Observability-Tracing

Thu, May 23

JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.

Thu, May 23, 11:28 AM · serviceops, Prod-Kubernetes, Kubernetes

JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.

Thu, May 23, 11:28 AM · serviceops, Prod-Kubernetes, Kubernetes

JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

kubernetes2023 is still cordoned and depooled for additional tests of the move v-lan process

Thu, May 23, 9:34 AM · Kubernetes, Prod-Kubernetes, serviceops

JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

After the reimage I needed to run the following for calico to start up properly:

Thu, May 23, 9:31 AM · Kubernetes, Prod-Kubernetes, serviceops

JMeybohm created T365687: Improve calico-typha firewall rules.

Thu, May 23, 9:30 AM · serviceops, Prod-Kubernetes, Kubernetes

Wed, May 22

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.

Wed, May 22, 4:29 PM · serviceops, MW-on-K8s

JMeybohm updated the task description for T365571: Rename wikikube worker nodes during OS reimage.

Wed, May 22, 3:01 PM · Kubernetes, Prod-Kubernetes, serviceops

JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

I've cleared out kubernetes2023 and kubernetes2032 for you to run tests. As the hosts are pooled=inactive and cordoned in k8s all you have to do is to downtime them (which the cookbooks probably do).

Wed, May 22, 2:41 PM · Kubernetes, Prod-Kubernetes, serviceops

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.

Wed, May 22, 1:41 PM · serviceops, MW-on-K8s

JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

@ayounsi I think we could test the rename cookbook together with T350152: Automation to change a server's vlan on the already cordoned kubernetes2023.codfw.wmnet, right?

Wed, May 22, 1:33 PM · Kubernetes, Prod-Kubernetes, serviceops

JMeybohm added a subtask for T336861: Fix naming confusion around main/wikikube kubernetes clusters: T365571: Rename wikikube worker nodes during OS reimage.

Wed, May 22, 11:35 AM · Patch-For-Review, serviceops, Prod-Kubernetes

JMeybohm added a parent task for T365571: Rename wikikube worker nodes during OS reimage: T336861: Fix naming confusion around main/wikikube kubernetes clusters.

Wed, May 22, 11:35 AM · Kubernetes, Prod-Kubernetes, serviceops

JMeybohm created T365571: Rename wikikube worker nodes during OS reimage.

Wed, May 22, 9:58 AM · Kubernetes, Prod-Kubernetes, serviceops

May 21 2024

JMeybohm added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I might be missing something obvious here, but I've two questions:

Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?
Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?

May 21 2024, 8:41 AM · Patch-For-Review, MW-on-K8s, serviceops, SRE Observability (FY2023/2024-Q4), Observability-Metrics

May 17 2024

JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.

May 17 2024, 3:11 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm closed T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes as Resolved.

Both staging clusters have been migrated to stacked control-planes

May 17 2024, 2:23 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm closed T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, a subtask of T353464: Migrate wikikube control planes to hardware nodes, as Resolved.

May 17 2024, 2:20 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm added a comment to T365253: Allow Kubernetes workers to be deployed on Bookworm.

For T362408: Migration to containerd and away from docker we're planning to backport containerd from bookworm to bullseye. Maybe it would be feasible to backport runc as well (although this won't help you with T363191: Test if we can avoid ROCm debian packages on k8s nodes ofc.)?

May 17 2024, 1:58 PM · Machine-Learning-Team, serviceops, Kubernetes

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

May 17 2024, 1:53 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm triaged T365224: ipoid charts app.job module has out of band changes as High priority.

May 17 2024, 8:56 AM · serviceops

JMeybohm created T365224: ipoid charts app.job module has out of band changes.

May 17 2024, 8:56 AM · serviceops

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .

May 17 2024, 8:18 AM · Patch-For-Review, Machine-Learning-Team, serviceops

May 16 2024

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .

May 16 2024, 1:11 PM · Patch-For-Review, Machine-Learning-Team, serviceops

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

May 16 2024, 12:19 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm added a comment to T362310: Implement global ratelimiting in our service mesh.

Successfully published image docker-registry.discovery.wmnet/ratelimit:9.0.2-20240503.3fcc360, supporting hot reload of gRPC certs. This should unblock deploying the ratelimit service.

May 16 2024, 7:46 AM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

JMeybohm committed rOSERb835affd35dc: Vendor dependencies.

Vendor dependencies

May 16 2024, 7:14 AM

JMeybohm committed rOSERfc2df91ff321: Add CertProvider to hot reload TLS certs for gRPC service.

Add CertProvider to hot reload TLS certs for gRPC service

May 16 2024, 7:06 AM

May 15 2024

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

May 15 2024, 1:50 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

May 15 2024, 1:06 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

May 14 2024

JMeybohm added a comment to T362984: GPU errors in hf image in ml-staging.

You got me @elukey :-p
For reasons I did not try to understand yet, the mknod cgroup permission is the culprit. Without it, the access() call fails:

May 14 2024, 6:11 PM · Lift-Wing, Machine-Learning-Team

JMeybohm added a comment to T362984: GPU errors in hf image in ml-staging.

Two more data points that don't help at all:

jayme@ml-staging2001:~$ sudo docker exec -it --user 0 k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False
jayme@ml-staging2001:~$ sudo docker exec -it --user 0 --privileged k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False

May 14 2024, 5:06 PM · Lift-Wing, Machine-Learning-Team

JMeybohm added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be able to see the impact there pretty quickly as well: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=sessionstore

May 14 2024, 11:48 AM · Patch-For-Review, serviceops, Data-Persistence

JMeybohm closed T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.

May 14 2024, 9:55 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm closed T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver as Resolved.

May 14 2024, 9:55 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm closed T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.

May 14 2024, 8:31 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm closed T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver as Resolved.

It's not clear to me what happened here. The makevm call was unable to detect a puppet run, at that time sudo rules where not present on the machine and puppet did not ran. I've terminated makevm and ran a reimage with more or less the same result but at the time the cookbook was waiting for puppet, there sudo wasn't even installed. install_console was not accessible in both cases, manual root logins via ganeti console did not work as well.
Over night things seem to have cleared up. Even though the cookbook failed puppet did ran, I'm able to login and sudo works (as well as successive puppet runs)...

May 14 2024, 8:31 AM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

May 13 2024

JMeybohm added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

kubestagemaster2005 got stuck at:

May 13 2024, 4:28 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

JMeybohm added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

kubestagemaster2004 is done (I messed up the phab ID in the cumin command, so report ended up in https://phabricator.wikimedia.org/T363310#9790605)

May 13 2024, 2:51 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

JMeybohm created T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

May 13 2024, 2:18 PM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

JMeybohm created T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

May 13 2024, 1:16 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

JMeybohm added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

I tend to agree, also for sake of alignment of sessionstore with the rest of our services. Unfortunately this feels like the more involved change (A change to sessionstore was somewhat high risk ...) - but I think it's also true that it should not add much to latency.

May 13 2024, 7:41 AM · Patch-For-Review, serviceops, Data-Persistence

JMeybohm added a comment to T357353: Application Security Review Request : NetworkSession MediaWiki extension .

In T357353#9782050, @EBernhardson wrote:

In T357353#9748141, @sbassett wrote:

In T357353#9748131, @pfischer wrote:

@EBernhardson, according to @JMeybohm there is no way to limit the IP ranges of pod/service/namespace to associated them closely with an application (SUP).

Ok. I know it's hackier, but could that instead be managed via extension config?

I had a ponder, but I'm not sure how yet. With both our app and the mw app servers living in k8s there might be something related we can do with network policies, but I'm not certain.

To be precise here: If the service backing this as well as all consumers are running in the same k8s cluster we could implement network policies that will only allow access from certain workload in the cluster. But I would advice against relying on that because:

We still have mediawiki appservers in hardware and there will probably be some snowflakes for which I don't know the implications for this
We won't be able to use this service cross-dc (as we are with all other active/active services), e.g. depooling in an emergency etc. (which would make this a snowflake)

May 13 2024, 7:34 AM · NetworkSession, Discovery-Search (Current work), secscrum, Security, Application Security Reviews

May 8 2024

JMeybohm added a comment to T364472: Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment.

I've not looked in detail (and I probably will not be able to before end of next week) but immediately worrisome to me is the daemonset running with superpowers we decided to not do this with calico for example and instead we distribute the cni plugin via debian packages - would that potentially be an option as well or has it been considered?

May 8 2024, 4:49 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review

JMeybohm renamed T362310: Implement global ratelimiting in our service mesh from SUP rate-limit fetch to Implement global ratelimiting in our service mesh.

May 8 2024, 4:22 PM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

May 7 2024

JMeybohm updated subscribers of T287491: Allow to address Kubernetes API servers from NetworkPolicy.

I did deploy the cert-manager changes to aux, @brouberol did dse and @klausman will take care of ml clusters, thanks all

May 7 2024, 9:11 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

May 6 2024

JMeybohm closed T362938: Degraded RAID on mw2382 as Resolved.

In T362938#9774724, @Jhancock.wm wrote:

Forgot I left it there. All yours now!

May 6 2024, 4:38 PM · serviceops, SRE, ops-codfw

JMeybohm committed rOSERc6e5b3951b04: Add .gitreview (authored by QChris).

Add .gitreview

May 6 2024, 3:10 PM

JMeybohm committed rOSERd58d2f1ab9c3: Vendor dependencies.

Vendor dependencies

May 6 2024, 3:10 PM

JMeybohm closed T364148: Configure Gerrit permissions on operations/software/envoyproxy/ratelimiter for merging from upstream as Resolved.

Pushing both branches worked now, thanks!

May 6 2024, 3:09 PM · Release-Engineering-Team, Gerrit

JMeybohm updated subscribers of T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

May 6 2024, 12:29 PM · Patch-For-Review, serviceops, Data-Persistence

JMeybohm added a comment to T362938: Degraded RAID on mw2382.

In T362938#9764496, @Jhancock.wm wrote:

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

May 6 2024, 8:06 AM · serviceops, SRE, ops-codfw

JMeybohm added a comment to T364148: Configure Gerrit permissions on operations/software/envoyproxy/ratelimiter for merging from upstream.

Thank you!, almost there. It now fails with:

May 6 2024, 7:54 AM · Release-Engineering-Team, Gerrit

May 3 2024

JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

In T345823#9766568, @cmooney wrote:

In T345823#9763845, @JMeybohm wrote:

We decided during migration of production to a bigger Pod IP space that this will not be necessary for staging and it actually is not. The issue there (as we figured later) that the IP space is split into /26 blocks, effectively limiting the cluster size to 4 nodes (including control-plane). The change to the IP block size was made to overcome this limitation without the need of changing the Pod IP space (and therefore having to reconfigure that in various places).

Ok cool, and yep makes perfect sense in staging we won't have a high number of pods. Plan sounds good so, I just wanted to make sure we weren't being too conservative with the allocations.

Regarding the current limit of 50 routes announced max from each host I think that is still ok? We're still slightly confused about how it tripped, seems like during the change the host briefly sent more than we expected? But should be ok in general?

May 3 2024, 9:39 AM · Prod-Kubernetes, Kubernetes, serviceops

May 2 2024

JMeybohm claimed T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

In T363996#9763646, @MoritzMuehlenhoff wrote:

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

May 2 2024, 1:53 PM · Patch-For-Review, serviceops, Data-Persistence

JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

In T345823#9763715, @cmooney wrote:

Not sure if it might be worth taking a step back and weighing up what's happening here?

As I understand it there is a /24 IPv4 allocation for POD IPs for this cluster, and with the current IP block size at /26 that only provides 4 blocks?

Without knowing the details there are probably two ways to deal with this:

Allocate a large block than a /24 for such use, providing more /26 blocks that can be used

Keep the /24 overall allocation as it is, but make the IP blocks smaller so there are more overall (/28, /29, /30 or whatever)

From a netops perspective we are relatively agnostic here, however given this is private IP space we have some flexibility. We definitely should try to avoid making any decisions that will potentially bite us down the road. Are we potentially putting too much of a limit on the number of potential PODs per host if we use a block size of /28 or less? Might it be better to keep those block allocations at /26 to allow for growth?

Should be fine either way, but just want to raise the question. We also need to size the 'prefix limit' on our network gear appropriately, current value of 50 should be ok for /28, but we may want to adjust up if using /30 or /32.

May 2 2024, 1:31 PM · Prod-Kubernetes, Kubernetes, serviceops

JMeybohm closed T345823: Wikikube staging clusters are out of IPv4 Pod IP's as Resolved.

May 2 2024, 8:51 AM · Prod-Kubernetes, Kubernetes, serviceops

JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

staging-eqiad has been migrated to /28 blocks as well

May 2 2024, 8:51 AM · Prod-Kubernetes, Kubernetes, serviceops

JMeybohm added a comment to T362938: Degraded RAID on mw2382.

In T362938#9761317, @jnuche wrote:

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

May 2 2024, 8:14 AM · serviceops, SRE, ops-codfw

JMeybohm
User

Projects (7)
View All

Calendar

Today

Tomorrow

Sunday

User Details

Recent Activity
View All

Wed, Jun 19

Tue, Jun 18

Mon, Jun 17

Thu, Jun 13

Wed, Jun 12

Tue, Jun 11

Fri, Jun 7

Thu, Jun 6

Wed, Jun 5

Mon, Jun 3

Thu, May 23

Wed, May 22

May 21 2024

May 17 2024

May 16 2024

May 15 2024

May 14 2024

May 13 2024

May 8 2024

May 7 2024

May 6 2024

May 3 2024

May 2 2024

JMeybohmUser

Projects (7)View All

Calendar

Today

Tomorrow

Sunday

User Details

Recent ActivityView All

Wed, Jun 19

Tue, Jun 18

Mon, Jun 17

Thu, Jun 13

Wed, Jun 12

Tue, Jun 11

Fri, Jun 7

Thu, Jun 6

Wed, Jun 5

Mon, Jun 3

Thu, May 23

Wed, May 22

May 21 2024

May 17 2024

May 16 2024

May 15 2024

May 14 2024

May 13 2024

May 8 2024

May 7 2024

May 6 2024

May 3 2024

May 2 2024

JMeybohm
User

Projects (7)
View All

Recent Activity
View All