Page MenuHomePhabricator
Feed Advanced Search

Yesterday

JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.
Thu, May 23, 11:28 AM · Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T365687: Improve calico-typha firewall rules.
Thu, May 23, 11:28 AM · Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

kubernetes2023 is still cordoned and depooled for additional tests of the move v-lan process

Thu, May 23, 9:34 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

After the reimage I needed to run the following for calico to start up properly:

Thu, May 23, 9:31 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created T365687: Improve calico-typha firewall rules.
Thu, May 23, 9:30 AM · Prod-Kubernetes, Kubernetes

Wed, May 22

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Wed, May 22, 4:29 PM · Patch-For-Review, serviceops, MW-on-K8s
JMeybohm updated the task description for T365571: Rename wikikube worker nodes during OS reimage.
Wed, May 22, 3:01 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

I've cleared out kubernetes2023 and kubernetes2032 for you to run tests. As the hosts are pooled=inactive and cordoned in k8s all you have to do is to downtime them (which the cookbooks probably do).

Wed, May 22, 2:41 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Wed, May 22, 1:41 PM · Patch-For-Review, serviceops, MW-on-K8s
JMeybohm added a comment to T365571: Rename wikikube worker nodes during OS reimage.

@ayounsi I think we could test the rename cookbook together with T350152: Automation to change a server's vlan on the already cordoned kubernetes2023.codfw.wmnet, right?

Wed, May 22, 1:33 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T336861: Fix naming confusion around main/wikikube kubernetes clusters: T365571: Rename wikikube worker nodes during OS reimage.
Wed, May 22, 11:35 AM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a parent task for T365571: Rename wikikube worker nodes during OS reimage: T336861: Fix naming confusion around main/wikikube kubernetes clusters.
Wed, May 22, 11:35 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created T365571: Rename wikikube worker nodes during OS reimage.
Wed, May 22, 9:58 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Tue, May 21

JMeybohm added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I might be missing something obvious here, but I've two questions:

  • Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?
  • Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?
Tue, May 21, 8:41 AM · Patch-For-Review, MW-on-K8s, serviceops, SRE Observability (FY2023/2024-Q4), Observability-Metrics

Fri, May 17

JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Fri, May 17, 3:11 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes as Resolved.

Both staging clusters have been migrated to stacked control-planes

Fri, May 17, 2:23 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, a subtask of T353464: Migrate wikikube control planes to hardware nodes, as Resolved.
Fri, May 17, 2:20 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T365253: Allow Kubernetes workers to be deployed on Bookworm.

For T362408: Migration to containerd and away from docker we're planning to backport containerd from bookworm to bullseye. Maybe it would be feasible to backport runc as well (although this won't help you with T363191: Test if we can avoid ROCm debian packages on k8s nodes ofc.)?

Fri, May 17, 1:58 PM · Machine-Learning-Team, serviceops, Kubernetes
JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
Fri, May 17, 1:53 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm triaged T365224: ipoid charts app.job module has out of band changes as High priority.
Fri, May 17, 8:56 AM · serviceops
JMeybohm created T365224: ipoid charts app.job module has out of band changes.
Fri, May 17, 8:56 AM · serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Fri, May 17, 8:18 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Thu, May 16

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Thu, May 16, 1:11 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
Thu, May 16, 12:19 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T362310: Implement global ratelimiting in our service mesh.

Successfully published image docker-registry.discovery.wmnet/ratelimit:9.0.2-20240503.3fcc360, supporting hot reload of gRPC certs. This should unblock deploying the ratelimit service.

Thu, May 16, 7:46 AM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch
JMeybohm committed rOSERb835affd35dc: Vendor dependencies.
Vendor dependencies
Thu, May 16, 7:14 AM
JMeybohm committed rOSERfc2df91ff321: Add CertProvider to hot reload TLS certs for gRPC service.
Add CertProvider to hot reload TLS certs for gRPC service
Thu, May 16, 7:06 AM

Wed, May 15

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
Wed, May 15, 1:50 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
Wed, May 15, 1:06 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Tue, May 14

JMeybohm added a comment to T362984: GPU errors in hf image in ml-staging.

You got me @elukey :-p
For reasons I did not try to understand yet, the mknod cgroup permission is the culprit. Without it, the access() call fails:

Tue, May 14, 6:11 PM · Lift-Wing, Machine-Learning-Team
JMeybohm added a comment to T362984: GPU errors in hf image in ml-staging.

Two more data points that don't help at all:

jayme@ml-staging2001:~$ sudo docker exec -it --user 0 k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False
jayme@ml-staging2001:~$ sudo docker exec -it --user 0 --privileged k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False
Tue, May 14, 5:06 PM · Lift-Wing, Machine-Learning-Team
JMeybohm added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be able to see the impact there pretty quickly as well: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=sessionstore

Tue, May 14, 11:48 AM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm closed T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.
Tue, May 14, 9:55 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver as Resolved.
Tue, May 14, 9:55 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.
Tue, May 14, 8:31 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver as Resolved.

It's not clear to me what happened here. The makevm call was unable to detect a puppet run, at that time sudo rules where not present on the machine and puppet did not ran. I've terminated makevm and ran a reimage with more or less the same result but at the time the cookbook was waiting for puppet, there sudo wasn't even installed. install_console was not accessible in both cases, manual root logins via ganeti console did not work as well.
Over night things seem to have cleared up. Even though the cookbook failed puppet did ran, I'm able to login and sudo works (as well as successive puppet runs)...

Tue, May 14, 8:31 AM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

Mon, May 13

JMeybohm added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

kubestagemaster2005 got stuck at:

Mon, May 13, 4:28 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

kubestagemaster2004 is done (I messed up the phab ID in the cumin command, so report ended up in https://phabricator.wikimedia.org/T363310#9790605)

Mon, May 13, 2:51 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
JMeybohm created T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.
Mon, May 13, 2:18 PM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm created T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.
Mon, May 13, 1:16 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

I tend to agree, also for sake of alignment of sessionstore with the rest of our services. Unfortunately this feels like the more involved change (A change to sessionstore was somewhat high risk ...) - but I think it's also true that it should not add much to latency.

Mon, May 13, 7:41 AM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm added a comment to T357353: Application Security Review Request : NetworkSession MediaWiki extension .

@EBernhardson, according to @JMeybohm there is no way to limit the IP ranges of pod/service/namespace to associated them closely with an application (SUP).

Ok. I know it's hackier, but could that instead be managed via extension config?

I had a ponder, but I'm not sure how yet. With both our app and the mw app servers living in k8s there might be something related we can do with network policies, but I'm not certain.

To be precise here: If the service backing this as well as all consumers are running in the same k8s cluster we could implement network policies that will only allow access from certain workload in the cluster. But I would advice against relying on that because:

  • We still have mediawiki appservers in hardware and there will probably be some snowflakes for which I don't know the implications for this
  • We won't be able to use this service cross-dc (as we are with all other active/active services), e.g. depooling in an emergency etc. (which would make this a snowflake)
Mon, May 13, 7:34 AM · Discovery-Search (Current work), secscrum, Security, Application Security Reviews

Wed, May 8

JMeybohm added a comment to T364472: Assess the suitability of the upstream ceph-csi-rbd helm chart for deployment.

I've not looked in detail (and I probably will not be able to before end of next week) but immediately worrisome to me is the daemonset running with superpowers we decided to not do this with calico for example and instead we distribute the cni plugin via debian packages - would that potentially be an option as well or has it been considered?

Wed, May 8, 4:49 PM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Patch-For-Review
JMeybohm renamed T362310: Implement global ratelimiting in our service mesh from SUP rate-limit fetch to Implement global ratelimiting in our service mesh.
Wed, May 8, 4:22 PM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch

Tue, May 7

JMeybohm updated subscribers of T287491: Allow to address Kubernetes API servers from NetworkPolicy.

I did deploy the cert-manager changes to aux, @brouberol did dse and @klausman will take care of ml clusters, thanks all

Tue, May 7, 9:11 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Mon, May 6

JMeybohm closed T362938: Degraded RAID on mw2382 as Resolved.

Forgot I left it there. All yours now!

Mon, May 6, 4:38 PM · serviceops, SRE, ops-codfw
JMeybohm committed rOSERc6e5b3951b04: Add .gitreview (authored by QChris).
Add .gitreview
Mon, May 6, 3:10 PM
JMeybohm committed rOSERd58d2f1ab9c3: Vendor dependencies.
Vendor dependencies
Mon, May 6, 3:10 PM
JMeybohm closed T364148: Configure Gerrit permissions on operations/software/envoyproxy/ratelimiter for merging from upstream as Resolved.

Pushing both branches worked now, thanks!

Mon, May 6, 3:09 PM · Release-Engineering-Team, Gerrit
JMeybohm updated subscribers of T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

Mon, May 6, 12:29 PM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

@JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Thanks.

Mon, May 6, 8:06 AM · serviceops, SRE, ops-codfw
JMeybohm added a comment to T364148: Configure Gerrit permissions on operations/software/envoyproxy/ratelimiter for merging from upstream.

Thank you!, almost there. It now fails with:

Mon, May 6, 7:54 AM · Release-Engineering-Team, Gerrit

Fri, May 3

JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

We decided during migration of production to a bigger Pod IP space that this will not be necessary for staging and it actually is not. The issue there (as we figured later) that the IP space is split into /26 blocks, effectively limiting the cluster size to 4 nodes (including control-plane). The change to the IP block size was made to overcome this limitation without the need of changing the Pod IP space (and therefore having to reconfigure that in various places).

Ok cool, and yep makes perfect sense in staging we won't have a high number of pods. Plan sounds good so, I just wanted to make sure we weren't being too conservative with the allocations.

Regarding the current limit of 50 routes announced max from each host I think that is still ok? We're still slightly confused about how it tripped, seems like during the change the host briefly sent more than we expected? But should be ok in general?

Fri, May 3, 9:39 AM · Prod-Kubernetes, Kubernetes, serviceops

Thu, May 2

JMeybohm claimed T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

Thu, May 2, 1:53 PM · Patch-For-Review, serviceops, Data-Persistence
JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

Not sure if it might be worth taking a step back and weighing up what's happening here?

As I understand it there is a /24 IPv4 allocation for POD IPs for this cluster, and with the current IP block size at /26 that only provides 4 blocks?

Without knowing the details there are probably two ways to deal with this:

  1. Allocate a large block than a /24 for such use, providing more /26 blocks that can be used
  2. Keep the /24 overall allocation as it is, but make the IP blocks smaller so there are more overall (/28, /29, /30 or whatever)

From a netops perspective we are relatively agnostic here, however given this is private IP space we have some flexibility. We definitely should try to avoid making any decisions that will potentially bite us down the road. Are we potentially putting too much of a limit on the number of potential PODs per host if we use a block size of /28 or less? Might it be better to keep those block allocations at /26 to allow for growth?

Should be fine either way, but just want to raise the question. We also need to size the 'prefix limit' on our network gear appropriately, current value of 50 should be ok for /28, but we may want to adjust up if using /30 or /32.

Thu, May 2, 1:31 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T345823: Wikikube staging clusters are out of IPv4 Pod IP's as Resolved.
Thu, May 2, 8:51 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

staging-eqiad has been migrated to /28 blocks as well

Thu, May 2, 8:51 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

Scap failed to connect to this host today during the MediaWiki train while trying to preload the MW image:
15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Would it be possibly to remove it temporarily from the list of K8s workers while work is done on it?

Thu, May 2, 8:14 AM · serviceops, SRE, ops-codfw
JMeybohm created T363971: scap should not run mediawiki-image-download on pooled=inactive servers.
Thu, May 2, 8:10 AM · Release-Engineering-Team, Scap

Tue, Apr 30

JMeybohm claimed T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

For the record: /30 blocks led to too many prefix announcements so the BGP sessions got blocked by the routers. As I wasn't sure about the actual limit there, I went with /28 which is still way better than /26 and allows additional nodes to join the cluster (and still get an ip block).

Tue, Apr 30, 4:13 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T345823: Wikikube staging clusters are out of IPv4 Pod IP's.

I've moved staging-codfw to /28 blocks using the process outlined in the calico docs. Instead of re-scheduling all pods twice, I just drained both nodes and left the pods in state Pending during the migration.
I had to delete the ipam block allocations and affinities manually as they where not freed automatically (or I was to impatient).

Tue, Apr 30, 3:56 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

Host is set pooled=inactive, cordoned in k8s, removed from BGP and shut down, so all yours

Tue, Apr 30, 11:01 AM · serviceops, SRE, ops-codfw
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

@Jhancock.wm I did shutdown the server for now. Could you please try do drain flea power and see if the controller comes back after? If not please open a case with Dell

Tue, Apr 30, 10:44 AM · serviceops, SRE, ops-codfw
JMeybohm added a comment to T362938: Degraded RAID on mw2382.

@Jhancock.wm I've tried powercycling the system and to restart iDRAC to see if the storage controller "comes back" but no luck. During boot I did see 2 SATA drives listed, though.
Ofc. /dev/sdb is now no longer discovered from mdadm so it should be without IO (if that helps identifying). Not really sure how to proceed here as it seems odd that the storage controller fully disappeared from iDRAC

Tue, Apr 30, 10:16 AM · serviceops, SRE, ops-codfw
JMeybohm added a comment to T363407: Proper service names in trace data.

Ok, understood. The only thing I'm really worrying about is that metrics change/get less intuitive with this. For example in here it's pretty clear what the filter means (selecting "local_service"). I think we will loose clarity here if local_service changes to mw-web.eqiad.main. Maybe adding local as suffix/prefix would help here (and you could strip that out again in OTTL?

OK fair enough. How about we add LOCAL_ as a prefix so it also sorts to the top in grafana?

Tue, Apr 30, 8:29 AM · Patch-For-Review, Observability-Tracing

Mon, Apr 29

JMeybohm added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

I've added the new, stacked control-plan with some manual intervention as etcd did not come up initially which makes kube-apiserver-safe-restart wait forever for to acquire as lock.

Mon, Apr 29, 4:59 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Mon, Apr 29, 2:20 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm triaged T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes as High priority.
Mon, Apr 29, 11:41 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver, a subtask of T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes, as Resolved.
Mon, Apr 29, 9:15 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver as Resolved.
Mon, Apr 29, 9:15 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T363407: Proper service names in trace data.

Ok, understood. The only thing I'm really worrying about is that metrics change/get less intuitive with this. For example in here it's pretty clear what the filter means (selecting "local_service"). I think we will loose clarity here if local_service changes to mw-web.eqiad.main. Maybe adding local as suffix/prefix would help here (and you could strip that out again in OTTL?

Mon, Apr 29, 8:21 AM · Patch-For-Review, Observability-Tracing

Fri, Apr 26

JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

I ran a couple of very basic benchmarks (commands in the attached filed) against single node etcd instances running on:

Fri, Apr 26, 2:59 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Thu, Apr 25

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Thu, Apr 25, 9:34 AM · Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Thu, Apr 25, 9:34 AM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm added a comment to T363407: Proper service names in trace data.

Thanks for the write-up!
What is not very clear to me is what part of the work would need to be done anyways (in case we'd have a envoy version >= 1.24). The reason I'm asking this is that envoy 1.23 is EOL since a year or so, so we need to look at an upgrade anyways.

Thu, Apr 25, 7:53 AM · Patch-For-Review, Observability-Tracing
JMeybohm added a comment to T362954: Fix rendering issue in modules.app.job when cronjobs are enabled and private values are defined.

@JMeybohm any objections?

Nope. Fine by me!

Thu, Apr 25, 7:27 AM · serviceops, Patch-For-Review, Kubernetes
JMeybohm added a comment to T348284: Handle sidecar containers in one-off Kubernetes jobs.

[...] The right thing to do is probably filter the watch call, so the controller never finds out about pods it doesn't care about, instead of filtering them out after being notified. That's an easy code change for monitoring a single namespace, but I have to dig a little to see if it's still easy for monitoring several (which we don't do now but will eventually).

Thu, Apr 25, 7:25 AM · MW-on-K8s, serviceops

Apr 24 2024

JMeybohm updated the task description for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
Apr 24 2024, 12:31 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.
Apr 24 2024, 10:43 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm created T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.
Apr 24 2024, 10:43 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm created T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.
Apr 24 2024, 10:25 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Apr 23 2024

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

No event logs from the past 24 of either wikikube eqiad or codfw are available in logstash.

Apr 23 2024, 11:48 AM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Apr 23 2024, 10:16 AM · Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops

Apr 19 2024

JMeybohm updated the task description for T362978: Update all helm modules and charts to be compatible with the restricted PSS.
Apr 19 2024, 1:22 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm lowered the priority of T355237: Update cache.mrouter modules in deployment-charts from High to Medium.
Apr 19 2024, 1:04 PM · serviceops
JMeybohm added a comment to T355237: Update cache.mrouter modules in deployment-charts.

I think I fixed all that because it was blocking me.

Apr 19 2024, 1:02 PM · serviceops
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 19 2024, 12:44 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm triaged T362978: Update all helm modules and charts to be compatible with the restricted PSS as High priority.
Apr 19 2024, 12:42 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm created T362978: Update all helm modules and charts to be compatible with the restricted PSS.
Apr 19 2024, 12:41 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 19 2024, 12:41 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm reopened T355237: Update cache.mrouter modules in deployment-charts as "Open".

This breaks in CI when actually enabled.

Apr 19 2024, 12:12 PM · serviceops
JMeybohm added a project to T362310: Implement global ratelimiting in our service mesh: serviceops-radar.
Apr 19 2024, 8:33 AM · serviceops, Patch-For-Review, Discovery-Search (Current work), CirrusSearch
JMeybohm added a comment to T362518: Deprecate buster-backports.

Successfully published image docker-registry.discovery.wmnet/httpd-fcgi:2.4.38-10-u5-20240415
Successfully published image docker-registry.discovery.wmnet/mediawiki-httpd:0.1.8-s2-20240415

Apr 19 2024, 8:29 AM · Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T362518: Deprecate buster-backports.

httpd-fcgi + dependent images seem to not have successfully rebuild on Monday. checking.

Apr 19 2024, 8:25 AM · Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated subscribers of T362954: Fix rendering issue in modules.app.job when cronjobs are enabled and private values are defined.

@jijiki I recall you had dug up some things around jobs/cronjobs as well, maybe you can take a look?

Apr 19 2024, 8:03 AM · serviceops, Patch-For-Review, Kubernetes
JMeybohm added a project to T362954: Fix rendering issue in modules.app.job when cronjobs are enabled and private values are defined: serviceops.
Apr 19 2024, 8:02 AM · serviceops, Patch-For-Review, Kubernetes

Apr 18 2024

JMeybohm updated the task description for T362766: 2024-04-17 mw-on-k8s eqiad outage.
Apr 18 2024, 12:17 PM · serviceops, Sustainability (Incident Followup)

Apr 17 2024

JMeybohm added a comment to T362766: 2024-04-17 mw-on-k8s eqiad outage.

coredns related changes
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020778
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020765
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020789
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020774

Apr 17 2024, 11:33 AM · serviceops, Sustainability (Incident Followup)
JMeybohm added a comment to T350034: Have the function orchestrator emit application-level events to Prometheus for observability.

I was looking at the metrics as per our conversation yesterday and I do see the application responding with 404 to GET /metrics requests. Did you configure a different metrics path? If so, that must be provided to prometheus via the prometheus.io/path annotation of the Pods.

Apr 17 2024, 8:37 AM · function-orchestrator, Abstract Wikipedia team

Apr 16 2024

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 16 2024, 6:26 PM · Patch-For-Review, serviceops, Prod-Kubernetes