Page MenuHomePhabricator

JMeybohm
User

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Apr 2 2020, 9:01 AM (220 w, 20 h)
Availability
Available
IRC Nick
jayme
LDAP User
JMeybohm
MediaWiki User
JMeybohm (WMF) [ Global Accounts ]

Recent Activity

Wed, Jun 19

JMeybohm updated the task description for T367544: Cloud VPS "packaging" project Buster deprecation.
Wed, Jun 19, 2:53 PM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)

Tue, Jun 18

JMeybohm triaged T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30) as Low priority.
Tue, Jun 18, 2:01 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created T367880: Set AppArmor profile via SecurityContext rather than annotations (k8s >=1.30).
Tue, Jun 18, 2:01 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a project to T365742: Remove blubberoid LVS/k8s service: serviceops.
Tue, Jun 18, 11:06 AM · serviceops, Traffic, Release-Engineering-Team (Priority Backlog 📥), Release Pipeline (Blubber)

Mon, Jun 17

JMeybohm added a comment to T367544: Cloud VPS "packaging" project Buster deprecation.

Any objections to just remove the VM since we moved to (re-)packaging upstream (https://wikitech.wikimedia.org/wiki/Envoy#Building_envoy_for_WMF)?

Mon, Jun 17, 12:47 PM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)

Thu, Jun 13

JMeybohm added a project to T365687: Improve calico-typha firewall rules: serviceops.
Thu, Jun 13, 8:19 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T365253: Allow Kubernetes workers to be deployed on Bookworm.

I checked the dragonfly repo and I have a question about building for bookworm (didn't find it in https://wikitech.wikimedia.org/wiki/Dragonfly) - since we are going to have two os versions, how should we manage the master branch's debian changelog? Namely, should I create a new branch from master for bookworm, or do you prefer another road?

I'd keep it simple and simply move master to bookworm, the legacy packages won't be updated any further and the dragonfly* super nodes will also need to be moved off buster soon.

Thu, Jun 13, 7:31 AM · Machine-Learning-Team, serviceops, Kubernetes

Wed, Jun 12

JMeybohm triaged T362408: Migration to containerd and away from docker as High priority.
Wed, Jun 12, 3:18 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated the task description for T362978: Update all helm modules and charts to be compatible with the restricted PSS.
Wed, Jun 12, 2:17 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Wed, Jun 12, 1:21 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T346638: Rename the envoy's uses_ingress option to sets_sni as Resolved.
Wed, Jun 12, 9:06 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Tue, Jun 11

JMeybohm added a comment to T321899: Create mw-videoscaler helmfile deployment.

Not sure if this is the source of it, but full CI runs do fail because the services/mw-videoscaler/staging fails to render

Tue, Jun 11, 4:00 PM · Release-Engineering-Team (Seen), serviceops, MW-on-K8s
JMeybohm created T367200: mw-script fails to render in CI.
Tue, Jun 11, 3:57 PM · MW-on-K8s, serviceops
JMeybohm closed T345274: Remove similar-users service from k8s as Resolved.

Finally gone...

Tue, Jun 11, 1:12 PM · Similarusers, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Tue, Jun 11, 1:12 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm updated the task description for T345274: Remove similar-users service from k8s.
Tue, Jun 11, 1:01 PM · Similarusers, serviceops
JMeybohm added a comment to T335491: Provide better long-term storage for translation models.

I just had to deploy machinetranslation for T346638: Rename the envoy's uses_ingress option to sets_sni and noticed container startup times of around 5 minutes (thinking something went totally wrong). I'm still seeing data getting pulled from peopleweb - are the plans to improve this still ongoing?

Tue, Jun 11, 1:01 PM · Language-Team (Language-2024-April-June), SRE-swift-storage, MinT, CX-deployments
JMeybohm updated the task description for T345274: Remove similar-users service from k8s.
Tue, Jun 11, 11:49 AM · Similarusers, serviceops
JMeybohm updated the task description for T345274: Remove similar-users service from k8s.
Tue, Jun 11, 11:25 AM · Similarusers, serviceops
JMeybohm updated the task description for T345274: Remove similar-users service from k8s.
Tue, Jun 11, 11:24 AM · Similarusers, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Tue, Jun 11, 10:48 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Fri, Jun 7

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Fri, Jun 7, 3:38 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated subscribers of T362310: Implement global ratelimiting in our service mesh.

The ratelimit service has been deployed to staging and prod wikikube clusters.
What's left to be done is to configure cirrus-streaming-updater to use it (see https://wikitech.wikimedia.org/wiki/Ratelimit#Enable/opt_in_to_rate_limiting). From all the values files I'm not sure which components (all?) should be rate limited, so I'd like to leave that change to you @pfischer / @bking / @dcausse. Feel free to send it my way for review/sync with me for the deployment so we can verify everything works as expected.

Fri, Jun 7, 12:14 PM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch
JMeybohm updated the task description for T362310: Implement global ratelimiting in our service mesh.
Fri, Jun 7, 9:53 AM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

Thu, Jun 6

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Thu, Jun 6, 4:39 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm added a comment to T366481: registry2004 sometimes reporting: too many open files problems.

I see that we run nginx with the default debian nginx.conf which has worker_connections 768; and no worker_rlimit_nofile set. The generic tlsproxy module in puppet uses worker_connections 131072 (no idea where that number comes from) and worker_rlimit_nofile 131072 * 2.

Thu, Jun 6, 9:45 AM · Patch-For-Review, serviceops, Wikimedia-production-error

Wed, Jun 5

JMeybohm triaged T366481: registry2004 sometimes reporting: too many open files problems as High priority.

Unfortunately this is actually nginx complaining:

Wed, Jun 5, 2:22 PM · Patch-For-Review, serviceops, Wikimedia-production-error
JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.
Wed, Jun 5, 1:00 PM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Wed, Jun 5, 12:02 PM · JavaScript, Phabricator
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Wed, Jun 5, 12:00 PM · JavaScript, Phabricator

Mon, Jun 3

JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

I think more that we ran out of time to make changes last week. Removing them from the etcd cluster ahead of time seems fine to me, at least.

Any objections to wait for T366204 and T366205 to be completed before we remove the ganeti VMs?

While I you re probably right, I think it is feels slightly easier to just get rid of the old stuff all at once, logistically

Mon, Jun 3, 2:29 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm created T366470: Support creating phab tasks in wmflib.phabricator.
Mon, Jun 3, 12:58 PM · Infrastructure-Foundations, SRE-tools
JMeybohm merged T366085: Relabel kubernetes2032 to wikikube-worker2002 into T365712: Relabel codfw Kubernetes hosts .
Mon, Jun 3, 12:45 PM · SRE, serviceops, ops-codfw, DC-Ops
JMeybohm merged task T366085: Relabel kubernetes2032 to wikikube-worker2002 into T365712: Relabel codfw Kubernetes hosts .
Mon, Jun 3, 12:45 PM · SRE, ops-codfw, DC-Ops
JMeybohm merged T366468: Relabel kubernetes2023 to wikikube-worker2001 into T365712: Relabel codfw Kubernetes hosts .
Mon, Jun 3, 12:43 PM · SRE, serviceops, ops-codfw, DC-Ops
JMeybohm merged task T366468: Relabel kubernetes2023 to wikikube-worker2001 into T365712: Relabel codfw Kubernetes hosts .
Mon, Jun 3, 12:43 PM · SRE, ops-codfw, DC-Ops
JMeybohm created T366468: Relabel kubernetes2023 to wikikube-worker2001.
Mon, Jun 3, 12:35 PM · SRE, ops-codfw, DC-Ops
JMeybohm added a comment to T366465: Extend puppet ipresolve() to support SRV records.

I also checked ferm's own @resolve function and it doesn't support SRV records although adding such support wouldn't be too hard either.

Mon, Jun 3, 12:19 PM · Puppet-Core, serviceops, Infrastructure-Foundations
JMeybohm added a comment to T366465: Extend puppet ipresolve() to support SRV records.

Does dnsquery::srv do what you need here?

Mon, Jun 3, 12:13 PM · Puppet-Core, serviceops, Infrastructure-Foundations
JMeybohm created T366465: Extend puppet ipresolve() to support SRV records.
Mon, Jun 3, 11:36 AM · Puppet-Core, serviceops, Infrastructure-Foundations
JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

Current status:

As I see it we're currently also still running the ganeti etcd instances in codfw and eqiad which I think does limit the performance of the etcd cluster by quite a bit. Was it a deliberate decision to not remove them?

Mon, Jun 3, 11:20 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T365855: Stop hardcoding k8s master (k8s API) endpoint IP addresses.

This is basically T287491: Allow to address Kubernetes API servers from NetworkPolicy
IMHO the easiest and less intrusive way to do this with an upstream helm chart is to just add a calico networkpolicy template to the chart (the file could even be prefixed with wmf-) that just creates that one policy. The linked phab task should contain some examples for that.

Mon, Jun 3, 9:37 AM · Observability-Tracing

Thu, May 23

JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.
Thu, May 23, 11:28 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.
Thu, May 23, 11:28 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

kubernetes2023 is still cordoned and depooled for additional tests of the move v-lan process

Thu, May 23, 9:34 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

After the reimage I needed to run the following for calico to start up properly:

Thu, May 23, 9:31 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created T365687: Improve calico-typha firewall rules.
Thu, May 23, 9:30 AM · serviceops, Prod-Kubernetes, Kubernetes

Wed, May 22

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Wed, May 22, 4:29 PM · serviceops, MW-on-K8s
JMeybohm updated the task description for T365571: Rename wikikube worker nodes during OS reimage.
Wed, May 22, 3:01 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

I've cleared out kubernetes2023 and kubernetes2032 for you to run tests. As the hosts are pooled=inactive and cordoned in k8s all you have to do is to downtime them (which the cookbooks probably do).

Wed, May 22, 2:41 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Wed, May 22, 1:41 PM · serviceops, MW-on-K8s
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

@ayounsi I think we could test the rename cookbook together with T350152: Automation to change a server's vlan on the already cordoned kubernetes2023.codfw.wmnet, right?

Wed, May 22, 1:33 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T336861: Fix naming confusion around main/wikikube kubernetes clusters: T365571: Rename wikikube worker nodes during OS reimage.
Wed, May 22, 11:35 AM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a parent task for T365571: Rename wikikube worker nodes during OS reimage: T336861: Fix naming confusion around main/wikikube kubernetes clusters.
Wed, May 22, 11:35 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created T365571: Rename wikikube worker nodes during OS reimage.
Wed, May 22, 9:58 AM · Kubernetes, Prod-Kubernetes, serviceops

May 21 2024

JMeybohm added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I might be missing something obvious here, but I've two questions:

  • Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?
  • Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?
May 21 2024, 8:41 AM · Patch-For-Review, MW-on-K8s, serviceops, SRE Observability (FY2023/2024-Q4), Observability-Metrics

May 17 2024

JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
May 17 2024, 3:11 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes as Resolved.

Both staging clusters have been migrated to stacked control-planes

May 17 2024, 2:23 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, a subtask of T353464: Migrate wikikube control planes to hardware nodes, as Resolved.
May 17 2024, 2:20 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T365253: Allow Kubernetes workers to be deployed on Bookworm.

For T362408: Migration to containerd and away from docker we're planning to backport containerd from bookworm to bullseye. Maybe it would be feasible to backport runc as well (although this won't help you with T363191: Test if we can avoid ROCm debian packages on k8s nodes ofc.)?

May 17 2024, 1:58 PM · Machine-Learning-Team, serviceops, Kubernetes
JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
May 17 2024, 1:53 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm triaged T365224: ipoid charts app.job module has out of band changes as High priority.
May 17 2024, 8:56 AM · serviceops
JMeybohm created T365224: ipoid charts app.job module has out of band changes.
May 17 2024, 8:56 AM · serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
May 17 2024, 8:18 AM · Patch-For-Review, Machine-Learning-Team, serviceops

May 16 2024

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
May 16 2024, 1:11 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
May 16 2024, 12:19 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T362310: Implement global ratelimiting in our service mesh.

Successfully published image docker-registry.discovery.wmnet/ratelimit:9.0.2-20240503.3fcc360, supporting hot reload of gRPC certs. This should unblock deploying the ratelimit service.

May 16 2024, 7:46 AM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch
JMeybohm committed rOSERb835affd35dc: Vendor dependencies.
Vendor dependencies
May 16 2024, 7:14 AM
JMeybohm committed rOSERfc2df91ff321: Add CertProvider to hot reload TLS certs for gRPC service.
Add CertProvider to hot reload TLS certs for gRPC service
May 16 2024, 7:06 AM

May 15 2024

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
May 15 2024, 1:50 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
May 15 2024, 1:06 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

May 14 2024

JMeybohm added a comment to T362984: GPU errors in hf image in ml-staging.

You got me @elukey :-p
For reasons I did not try to understand yet, the mknod cgroup permission is the culprit. Without it, the access() call fails:

May 14 2024, 6:11 PM · Lift-Wing, Machine-Learning-Team
JMeybohm added a comment to T362984: GPU errors in hf image in ml-staging.

Two more data points that don't help at all:

jayme@ml-staging2001:~$ sudo docker exec -it --user 0 k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False
jayme@ml-staging2001:~$ sudo docker exec -it --user 0 --privileged k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False
May 14 2024, 5:06 PM · Lift-Wing, Machine-Learning-Team
JMeybohm added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be able to see the impact there pretty quickly as well: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=sessionstore

May 14 2024, 11:48 AM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm closed T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.
May 14 2024, 9:55 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver as Resolved.
May 14 2024, 9:55 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.
May 14 2024, 8:31 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver as Resolved.

It's not clear to me what happened here. The makevm call was unable to detect a puppet run, at that time sudo rules where not present on the machine and puppet did not ran. I've terminated makevm and ran a reimage with more or less the same result but at the time the cookbook was waiting for puppet, there sudo wasn't even installed. install_console was not accessible in both cases, manual root logins via ganeti console did not work as well.
Over night things seem to have cleared up. Even though the cookbook failed puppet did ran, I'm able to login and sudo works (as well as successive puppet runs)...

May 14 2024, 8:31 AM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

May 13 2024

JMeybohm added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

kubestagemaster2005 got stuck at:

May 13 2024, 4:28 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

kubestagemaster2004 is done (I messed up the phab ID in the cumin command, so report ended up in https://phabricator.wikimedia.org/T363310#9790605)

May 13 2024, 2:51 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
JMeybohm created T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.
May 13 2024, 2:18 PM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm created T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.
May 13 2024, 1:16 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

I tend to agree, also for sake of alignment of sessionstore with the rest of our services. Unfortunately this feels like the more involved change (A change to sessionstore was somewhat high risk ...) - but I think it's also true that it should not add much to latency.

May 13 2024, 7:41 AM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm added a comment to T357353: Application Security Review Request : NetworkSession MediaWiki extension .

@EBernhardson, according to @JMeybohm there is no way to limit the IP ranges of pod/service/namespace to associated them closely with an application (SUP).

Ok. I know it's hackier, but could that instead be managed via extension config?

I had a ponder, but I'm not sure how yet. With both our app and the mw app servers living in k8s there might be something related we can do with network policies, but I'm not certain.

To be precise here: If the service backing this as well as all consumers are running in the same k8s cluster we could implement network policies that will only allow access from certain workload in the cluster. But I would advice against relying on that because:

  • We still have mediawiki appservers in hardware and there will probably be some snowflakes for which I don't know the implications for this
  • We won't be able to use this service cross-dc (as we are with all other active/active services), e.g. depooling in an emergency etc. (which would make this a snowflake)
May 13 2024, 7:34 AM · NetworkSession, Discovery-Search (Current work), secscrum, Security, Application Security Reviews

May 8 2024

JMeybohm added a comment to T364472: Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment.

I've not looked in detail (and I probably will not be able to before end of next week) but immediately worrisome to me is the daemonset running with superpowers we decided to not do this with calico for example and instead we distribute the cni plugin via debian packages - would that potentially be an option as well or has it been considered?

May 8 2024, 4:49 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review
JMeybohm renamed T362310: Implement global ratelimiting in our service mesh from SUP rate-limit fetch to Implement global ratelimiting in our service mesh.
May 8 2024, 4:22 PM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

May 7 2024

JMeybohm updated subscribers of T287491: Allow to address Kubernetes API servers from NetworkPolicy.

I did deploy the cert-manager changes to aux, @brouberol did dse and @klausman will take care of ml clusters, thanks all

May 7 2024, 9:11 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

May 6 2024

JMeybohm closed T362938: Degraded RAID on mw2382 as Resolved.

Forgot I left it there. All yours now!

May 6 2024, 4:38 PM · serviceops, SRE, ops-codfw
JMeybohm committed rOSERc6e5b3951b04: Add .gitreview (authored by QChris).
Add .gitreview
May 6 2024, 3:10 PM
JMeybohm committed rOSERd58d2f1ab9c3: Vendor dependencies.
Vendor dependencies
May 6 2024, 3:10 PM
JMeybohm closed T364148: Configure Gerrit permissions on operations/software/envoyproxy/ratelimiter for merging from upstream as Resolved.

Pushing both branches worked now, thanks!

May 6 2024, 3:09 PM · Release-Engineering-Team, Gerrit
JMeybohm updated subscribers of T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

May 6 2024, 12:29 PM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

May 6 2024, 8:06 AM · serviceops, SRE, ops-codfw
JMeybohm added a comment to T364148: Configure Gerrit permissions on operations/software/envoyproxy/ratelimiter for merging from upstream.

Thank you!, almost there. It now fails with:

May 6 2024, 7:54 AM · Release-Engineering-Team, Gerrit

May 3 2024

JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

We decided during migration of production to a bigger Pod IP space that this will not be necessary for staging and it actually is not. The issue there (as we figured later) that the IP space is split into /26 blocks, effectively limiting the cluster size to 4 nodes (including control-plane). The change to the IP block size was made to overcome this limitation without the need of changing the Pod IP space (and therefore having to reconfigure that in various places).

Ok cool, and yep makes perfect sense in staging we won't have a high number of pods. Plan sounds good so, I just wanted to make sure we weren't being too conservative with the allocations.

Regarding the current limit of 50 routes announced max from each host I think that is still ok? We're still slightly confused about how it tripped, seems like during the change the host briefly sent more than we expected? But should be ok in general?

May 3 2024, 9:39 AM · Prod-Kubernetes, Kubernetes, serviceops

May 2 2024

JMeybohm claimed T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

May 2 2024, 1:53 PM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

Not sure if it might be worth taking a step back and weighing up what's happening here?

As I understand it there is a /24 IPv4 allocation for POD IPs for this cluster, and with the current IP block size at /26 that only provides 4 blocks?

Without knowing the details there are probably two ways to deal with this:

  1. Allocate a large block than a /24 for such use, providing more /26 blocks that can be used
  2. Keep the /24 overall allocation as it is, but make the IP blocks smaller so there are more overall (/28, /29, /30 or whatever)

From a netops perspective we are relatively agnostic here, however given this is private IP space we have some flexibility. We definitely should try to avoid making any decisions that will potentially bite us down the road. Are we potentially putting too much of a limit on the number of potential PODs per host if we use a block size of /28 or less? Might it be better to keep those block allocations at /26 to allow for growth?

Should be fine either way, but just want to raise the question. We also need to size the 'prefix limit' on our network gear appropriately, current value of 50 should be ok for /28, but we may want to adjust up if using /30 or /32.

May 2 2024, 1:31 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T345823: Wikikube staging clusters are out of IPv4 Pod IP's as Resolved.
May 2 2024, 8:51 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

staging-eqiad has been migrated to /28 blocks as well

May 2 2024, 8:51 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

May 2 2024, 8:14 AM · serviceops, SRE, ops-codfw