Page MenuHomePhabricator
Feed Advanced Search

Apr 16 2024

JMeybohm created P60658 check-apparmor_seccomp.sh.
Apr 16 2024, 4:22 PM
JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Apr 16 2024, 1:28 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Apr 16 2024, 1:25 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 16 2024, 12:29 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 16 2024, 12:28 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Apr 16 2024, 11:18 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm claimed T353464: Migrate wikikube control planes to hardware nodes.
Apr 16 2024, 11:04 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T358936: Kubernetes apiserver probe failures on restart.

We had this happening again in eqiad today because of a (planned) apiserver safe restart. We'll prioritize T353464: Migrate wikikube control planes to hardware nodes to give more resources to wikikube apiservers.

Apr 16 2024, 11:03 AM · Prod-Kubernetes, serviceops, SRE
JMeybohm raised the priority of T287491: Allow to address Kubernetes API servers from NetworkPolicy from Low to High.

We should prioritize T353464: Migrate wikikube control planes to hardware nodes because of T358936: Kubernetes apiserver probe failures on restart. Would be nice to have this done to lower configuration overhead, so raising this as well.

Apr 16 2024, 11:02 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm raised the priority of T353464: Migrate wikikube control planes to hardware nodes from Medium to High.
Apr 16 2024, 11:00 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

Updating (restarting) rsyslog in wikikube codfw again led to quite a bump in events followed by a (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-codfw alert. Maybe restarting rsyslog cluster wide after a week is a viable strategy to determine if rsyslog is still "loosing" messages/getting stuck in some error state as in my understanding, there should not be a bump of events after a restart if everything is fine.

Apr 16 2024, 8:58 AM · Patch-For-Review, Observability-Logging, serviceops

Apr 15 2024

JMeybohm updated the task description for T362518: Deprecate buster-backports.
Apr 15 2024, 4:49 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T315560: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters.

@JMeybohm Is this something still needed?

Apr 15 2024, 3:47 PM · Infrastructure-Foundations, SRE-tools, Spicerack
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Apr 15 2024, 2:32 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Apr 15 2024, 2:29 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T362518: Deprecate buster-backports.

Production images rebuild is done:

Apr 15 2024, 1:51 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Apr 15 2024, 1:50 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Apr 15 2024, 10:57 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T362408: Migration to containerd and away from docker.

@akosiaris could you please double check in your test environment that containerd will still enforce the default apparmor profile (see Remove apparmor.security.beta.kubernetes.io/defaultProfileName in T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21) like docker currently does?

Apr 15 2024, 8:03 AM · Prod-Kubernetes, Kubernetes, serviceops

Apr 12 2024

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 12 2024, 7:46 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 12 2024, 7:43 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 12 2024, 7:43 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 12 2024, 7:42 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm renamed T290020: Enable audit logging for kube-apiserver from Evaluate and enable audit logging for kube-apiserver to Enable audit logging for kube-apiserver.
Apr 12 2024, 7:39 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I've added a more comprehensive list of @elukey's test at https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement#Violation_error_handling
Bottom line is: With PSP's and VAP's we only get events, with PSS and kyverno we get additional user warnings (or even full rejections in case of kyverno)

Apr 12 2024, 5:48 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

Elastic Integrations aren't available to us in an OpenSearch world. However, the mapping data from that link would be useful if we choose to transform these logs to ECS and not use a dedicated index.

Apr 12 2024, 3:40 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

I've merged the attached patch and the logs are ingested into the logstash-k8s- index (https://logstash.wikimedia.org/app/discover#/view/7f276c90-f8a0-11ee-be54-8fd74c07934f). Unfortunately the event dates are off as the date of ingestion is used instead of a timestamp from the actual data. I suppose this is something that will be fixed by using a dedicated index

Apr 12 2024, 7:55 AM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Apr 12 2024, 7:32 AM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

Thanks! Elastic does maintain a audit_log integration at https://docs.elastic.co/integrations/kubernetes/audit-logs / https://github.com/elastic/integrations/tree/main/packages/kubernetes/data_stream/audit_logs - does that make things easier?

Apr 12 2024, 7:26 AM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes

Apr 11 2024

JMeybohm updated the task description for T290020: Enable audit logging for kube-apiserver.
Apr 11 2024, 12:48 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a project to T290020: Enable audit logging for kube-apiserver: Observability-Logging.
Apr 11 2024, 12:44 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes

Apr 10 2024

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Apr 10 2024, 3:04 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T359423: Migrate charts to Calico Network Policies.

Grouped the todo list by chart, some of those also need mesh.configuration updates due to T346638: Rename the envoy's uses_ingress option to sets_sni which could be bundled with this

Apr 10 2024, 12:46 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Apr 10 2024, 12:45 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T346638: Rename the envoy's uses_ingress option to sets_sni .

Unfortunately version 1.4.3 of mesh.configuration still uses uses_ingress in one if-block. So the initially assumed version requirement was not correct and there are still a bunch of charts to update. :/

Apr 10 2024, 12:37 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Apr 10 2024, 12:34 PM · Patch-For-Review, Machine-Learning-Team, serviceops

Apr 9 2024

JMeybohm updated the task description for T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator.
Apr 9 2024, 10:47 AM · Patch-For-Review, Data-Platform-SRE, Wikidata, Wikidata-Query-Service
JMeybohm added a comment to T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error.

Can someone clarify what the problem here is? From WBQC’s perspective, it’s totally expected that some of these regex checks will fail (though there’s some confusion about which shellbox errors we should or shouldn’t try to catch, see T304084 and especially T304084#8561863). But we might need to make some changes to keep the service mesh monitoring happy? (“exceeding retry limit” also sounds concerning – we don’t really want these requests to be retried, I think.)

Apr 9 2024, 9:36 AM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox

Apr 8 2024

JMeybohm renamed T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error from shellbox-constraints listener is constantly exceeding the retry limit to shellbox-constraints returning 500 on preg_match error.
Apr 8 2024, 4:03 PM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox
JMeybohm added a comment to T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error.
Apr 8 2024, 2:57 PM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox
JMeybohm created T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error.
Apr 8 2024, 2:15 PM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox
JMeybohm added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

We had a bunch of error again today. All of them connection errors to eventgate-analytics and main leading to 503 after exceeding the retry limit. There where ERROR ferm input drop default policy not set, ferm might not have been started correctly alerts during that time but I'm not convinced those are related.

image.png (1×3 px, 251 KB)

image.png (990×1 px, 138 KB)

Apr 8 2024, 2:01 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
JMeybohm added a comment to T361706: 2024-04-03 calico/typha down.

Regarding the number of connections per typha: There is a logic in typha that will balance the connections between instances by disconnecting the ones exceeding a dynamic threshold that is calculated based on the number of typhas and the number of nodes (see https://github.com/projectcalico/calico/blob/v3.23.3/typha/pkg/k8s/rebalance.go#L35).

Apr 8 2024, 10:04 AM · Prod-Kubernetes, Wikimedia-Incident
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Apr 8 2024, 9:53 AM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Apr 8 2024, 8:39 AM · JavaScript, Phabricator
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Apr 8 2024, 8:33 AM · JavaScript, Phabricator
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Apr 8 2024, 8:32 AM · JavaScript, Phabricator
JMeybohm added a comment to T361706: 2024-04-03 calico/typha down.

The investigation from last Friday showed the first failed probe for calico-typha-75d4649699-h7vgq was recorded at 13:10:31, which was probably a consequence of the process not being able to allocate additional memory (Apr 3 13:10:28 kubernetes1022 kernel: [18672041.401461] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=99). That typha instance had ~230 client connected at that time (out of ~636) which then must have tried to re-establish connections with one of the remaining 2 typha's, kicking those over the memory limit threshold as well.
After updating the grafana dashboard ab bit it was clean that the client distribution between the 3 typha instances is naturally pretty uneven and so is the memory usage. As the most loaded instance already reaches ~500MiB (from its 600MiB limit) I will increase the limit to 1GiB per instance.

Apr 8 2024, 8:18 AM · Prod-Kubernetes, Wikimedia-Incident

Apr 5 2024

JMeybohm reopened T357616: Logs from containers sometimes not visible in logstash as "Open".

This is still happening. K8s event logs seem suspiciously empty on April 4th for eqiad for example.

Apr 5 2024, 1:26 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Apr 5 2024, 10:15 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Apr 4 2024

JMeybohm closed T357616: Logs from containers sometimes not visible in logstash as Resolved.

The restarts do work properly and we've not seen "Too many open files" errors since their implementation. I'm optimistically resolving this for now

Apr 4 2024, 11:30 AM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T361728: SwaggerProbeHasFailures for citoid since last deployment.

Weirdly it looks like that first downtime I see in the grafana was everyone? Since the swagger endpoint is unhappy for everyone?

That was because of T361706: 2024-04-03 calico/typha down

Apr 4 2024, 11:22 AM · serviceops-radar, Citoid

Apr 3 2024

JMeybohm added a project to T361728: SwaggerProbeHasFailures for citoid since last deployment: serviceops-radar.
Apr 3 2024, 5:50 PM · serviceops-radar, Citoid
JMeybohm created T361728: SwaggerProbeHasFailures for citoid since last deployment.
Apr 3 2024, 5:50 PM · serviceops-radar, Citoid
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

Observability-Logging could you maybe advice on if/how/where we could potentially store these audit logs to make them more accessible?
They come as Json lines with the format specified in https://kubernetes.io/docs/reference/config-api/apiserver-audit.v1/#audit-k8s-io-v1-Event

Apr 3 2024, 12:50 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm updated the task description for T290020: Enable audit logging for kube-apiserver.
Apr 3 2024, 12:42 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes

Mar 28 2024

JMeybohm renamed T290020: Enable audit logging for kube-apiserver from Evaluate and enable audit logging for kubeapi-server to Evaluate and enable audit logging for kube-apiserver.
Mar 28 2024, 3:59 PM · Patch-For-Review, Observability-Logging, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm updated the task description for T343787: Find a replacement for the unmaintained eventrouter.
Mar 28 2024, 11:10 AM · Technical-Debt, serviceops, Prod-Kubernetes, Kubernetes

Mar 27 2024

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

A systemd timer has been deployed to all kubernetes nodes that will check every hour if rsyslog has accumulated >10k fd's to deleted files/folders and restarts it if that's the case...let's see.

Mar 27 2024, 4:37 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Mar 27 2024, 4:16 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

btw, in case this is relevant, https://logstash.wikimedia.org/goto/b060c8f0c137245fc0d63b9329583abe shows a spike of a bunch of requests with the same request ID, but those logs for handling web requests. I can file a separate task for that if you think it's worth investigating further.

Mar 27 2024, 1:45 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

This seems to be an issue with how imfile implements inotify watches for symlinks (or symlinks to symlinks maybe?):

Mar 27 2024, 12:19 PM · Patch-For-Review, Observability-Logging, serviceops

Mar 26 2024

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Mar 26 2024, 4:03 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

[26.03.24 14:55] <jinxer-wm> (KubernetesRsyslogDown) firing: rsyslog on mw1483:9105 is missing kubernetes logs

Mar 26 2024, 3:24 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).

Deployed v0.0.3 of the chart incl. rdb to all wikikube and staging clusters as well as dse

Mar 26 2024, 2:27 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T360612: Add redis (rdb) instances to external-services as Resolved.
Mar 26 2024, 1:32 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T360612: Add redis (rdb) instances to external-services, a subtask of T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies), as Resolved.
Mar 26 2024, 1:30 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mar 25 2024

JMeybohm edited P11638 smaller_gerritbot_comments.js.
Mar 25 2024, 10:18 AM · JavaScript, Phabricator

Mar 22 2024

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

Unfortunately we did not gain any insides from the new metrics (dashboard at https://grafana-rw.wikimedia.org/d/KimNkFTIk/jayme-omkafka) as of now. We where also not able to spot another incarnation of this issue or reproduce it somehow

Mar 22 2024, 2:14 PM · Patch-For-Review, Observability-Logging, serviceops
BTullis awarded T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) a Hungry Hippo token.
Mar 22 2024, 1:44 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mar 21 2024

JMeybohm added a comment to T360637: Bump memory for registry[12]00[34] VMs.

Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps

Mar 21 2024, 2:28 PM · Patch-For-Review, serviceops, Machine-Learning-Team
JMeybohm created T360612: Add redis (rdb) instances to external-services.
Mar 21 2024, 10:13 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Prod-Kubernetes, Kubernetes, serviceops

Mar 19 2024

JMeybohm updated subscribers of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I've summarized my findings at https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement @akosiaris, @elukey: I'd like you to take a look and ask questions if you find the time.

Mar 19 2024, 5:29 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).

I know think we might have misunderstood each other in T359334. I do think your proposal in https://phabricator.wikimedia.org/T359334#9606340 is the way to go, but what I initially meant was grouping the source data in global_config in a similar way. I now reply here because this relates to the external-services chart and how we generate data for it.

Mar 19 2024, 10:11 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mar 18 2024

JMeybohm added a comment to T359416: Add Dragonfly to the ML k8s clusters.

That's right. By default the supernode does act as a CDN in front of the docker-registry but I intentionally disabled that behavior as there's no benefit of that in our infra. I would assume that the supernodes do still query the registry for some sanity check data maybe but the seeding happens directly from the docker-registry instances.

Mar 18 2024, 3:17 PM · Machine-Learning-Team
JMeybohm removed projects from T358489: mw2420-mw2451 do have unnecessary raid controllers (configured): ops-codfw, DC-Ops.

@JMeybohm hello is there anything DC-ops need to do on this task?

Mar 18 2024, 1:25 PM · SRE, serviceops
JMeybohm added a comment to T359416: Add Dragonfly to the ML k8s clusters.

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

Mar 18 2024, 8:44 AM · Machine-Learning-Team

Mar 12 2024

JMeybohm updated subscribers of T344478: Fix how we keep docker-pkg based images up to date.
Mar 12 2024, 1:15 PM · Release Pipeline (Blubber), docker-pkg, serviceops

Mar 8 2024

JMeybohm added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

Are the Dragonfly's supernodes sharable between clusters? We are interesting in adding Dragonfly (it is in our backlog since a long time), but I don't get from the docs if we should create a p2p network shared between clusters or not.

@JMeybohm Care to answer this one? ^. My impression is no, but I am not sure.

Yep, that's on the move already T359416: Add Dragonfly to the ML k8s clusters

Mar 8 2024, 2:32 PM · Machine-Learning-Team
JMeybohm added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Just to create the reference, I would assume this is a consequence of T290536: Serve production traffic via Kubernetes and friends.

Mar 8 2024, 2:25 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
JMeybohm lowered the priority of T256762: Fix nginx config and caching for docker registry from Medium to Low.
Mar 8 2024, 2:22 PM · serviceops, Kubernetes, SRE
JMeybohm added a comment to T256762: Fix nginx config and caching for docker registry .
  • Requests for the catalog are not cached
    • curl -I -XGET 'https://docker-registry.wikimedia.org/v2/_catalog

catalog is now cached.

Mar 8 2024, 2:21 PM · serviceops, Kubernetes, SRE

Mar 7 2024

JMeybohm added a comment to T359416: Add Dragonfly to the ML k8s clusters.

I think it's fine to use the existing supernodes. They act as coordinators only, so there is not much load or network traffic even during mw-deployments.

Mar 7 2024, 10:51 AM · Machine-Learning-Team

Mar 6 2024

JMeybohm added a comment to T359334: Create a networkpolicy template allowing charts to define a Calico Network policy to external services.

Yes, exactly. I think we would benefit from breaking with the current structure and instead create a generic one for helm chart values files as well as for the values.yaml we generate via global_config.pp. We can keep the old global_config.pp structure (AIUI that's just for kafka and zookeeper) around until we've migrated all relevant charts which should not take that long.

Mar 6 2024, 1:29 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T359334: Create a networkpolicy template allowing charts to define a Calico Network policy to external services.

Thanks for the writeup! I pretty much agree with your plan/idea although I would suggest to add a separate module (modules/base/external-services-networkplicy_1.0.0.yaml or alike) to we can create new versions independently from the base/networkpolicy module.

Mar 6 2024, 1:13 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mar 5 2024

JMeybohm added a comment to T348466: Rethink kubernetes etcd storage.

@Clement_Goubert are you okay with merging this into T353464: Migrate wikikube control planes to hardware nodes?

Mar 5 2024, 11:43 AM · Prod-Kubernetes, serviceops
JMeybohm added a subtask for T341984: Update Kubernetes clusters to >1.25: T348466: Rethink kubernetes etcd storage.
Mar 5 2024, 11:39 AM · Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a parent task for T348466: Rethink kubernetes etcd storage: T341984: Update Kubernetes clusters to >1.25.
Mar 5 2024, 11:39 AM · Prod-Kubernetes, serviceops
JMeybohm added a parent task for T353464: Migrate wikikube control planes to hardware nodes: T341984: Update Kubernetes clusters to >1.25.
Mar 5 2024, 11:38 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a subtask for T341984: Update Kubernetes clusters to >1.25: T353464: Migrate wikikube control planes to hardware nodes.
Mar 5 2024, 11:38 AM · Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a comment to T358936: Kubernetes apiserver probe failures on restart.

To clarify why this happened/happens:
kubemaster2001 refreshed the certs used by the apiserver in one puppet run at ~00:51:

Mar 5 2024, 11:27 AM · Prod-Kubernetes, serviceops, SRE

Mar 4 2024

JMeybohm added a comment to T341984: Update Kubernetes clusters to >1.25.

With the next k8s upgrade we already have the following dependency problems:

  • We need to migrate to containerd before moving to k8s >=1.24 (T269684)
  • containerd version (< 1.6) in bullseye is only supported in kubelet <=1.25 (see)
  • PSPs gone in >=1.25 (T273507)
  • VAPs available in >=1.26 (T273507)
Mar 4 2024, 8:52 AM · Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops

Feb 29 2024

JMeybohm added a project to T282926: Allow for users to specify Wikidata items in function evaluation requests: serviceops-radar.
Feb 29 2024, 8:22 AM · serviceops-radar, WikiLambda, function-orchestrator, Epic, Abstract Wikipedia team

Feb 28 2024

JMeybohm renamed T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) from mw2420-mw2451 do have unncecesarry raid controllers (configured) to mw2420-mw2451 do have unnecessary raid controllers (configured).
Feb 28 2024, 5:49 PM · SRE, serviceops
JMeybohm updated the task description for T355871: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw.
Feb 28 2024, 2:56 PM · DBA, ops-codfw, Infrastructure-Foundations, netops, SRE

Feb 27 2024

JMeybohm added a comment to T356303: Review wikitech:Search and write processes for k8s world.

I'm not very familiar with running Flink in general, so I really can't speak to that, "we want to run N related things" just sounded like what the session clusters idea is to me.

Feb 27 2024, 11:27 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Documentation, Discovery-Search (Current work)

Feb 26 2024

JMeybohm added a comment to T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

We could verify that while T351074: Move servers from the appserver/api cluster to kubernetes (although most of the servers have already been moved to k8s unfortunately).

Feb 26 2024, 4:25 PM · SRE, serviceops
JMeybohm added a comment to T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

Feb 26 2024, 3:12 PM · SRE, serviceops
JMeybohm renamed T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) from mw2420-mw2451 do have unncecesarry raid controllers (configured to mw2420-mw2451 do have unncecesarry raid controllers (configured).
Feb 26 2024, 1:58 PM · SRE, serviceops
JMeybohm closed T357380: Degraded RAID on mw2442 as Resolved.

T358489 as follow-up for the strange RAID config, resolving this one.

Feb 26 2024, 1:57 PM · serviceops, ops-codfw