Page MenuHomePhabricator

JMeybohm
User

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Apr 2 2020, 9:01 AM (211 w, 1 d)
Availability
Available
IRC Nick
jayme
LDAP User
JMeybohm
MediaWiki User
JMeybohm (WMF) [ Global Accounts ]

Recent Activity

Today

JMeybohm reopened T355237: Update cache.mrouter modules in deployment-charts as "Open".

This breaks in CI when actually enabled.

Fri, Apr 19, 12:12 PM · serviceops
JMeybohm added a project to T362310: SUP rate-limit fetch: serviceops-radar.
Fri, Apr 19, 8:33 AM · serviceops-radar, Discovery-Search (Current work), CirrusSearch
JMeybohm added a comment to T362518: Deprecate buster-backports.

Successfully published image docker-registry.discovery.wmnet/httpd-fcgi:2.4.38-10-u5-20240415
Successfully published image docker-registry.discovery.wmnet/mediawiki-httpd:0.1.8-s2-20240415

Fri, Apr 19, 8:29 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T362518: Deprecate buster-backports.

httpd-fcgi + dependent images seem to not have successfully rebuild on Monday. checking.

Fri, Apr 19, 8:25 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated subscribers of T362954: Fix rendering issue in modules.app.job when cronjobs are enabled and private values are defined.

@jijiki I recall you had dug up some things around jobs/cronjobs as well, maybe you can take a look?

Fri, Apr 19, 8:03 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), serviceops, Patch-For-Review, Kubernetes
JMeybohm added a project to T362954: Fix rendering issue in modules.app.job when cronjobs are enabled and private values are defined: serviceops.
Fri, Apr 19, 8:02 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), serviceops, Patch-For-Review, Kubernetes

Yesterday

JMeybohm updated the task description for T362766: 2024-04-17 mw-* went down in eqiad.
Thu, Apr 18, 12:17 PM · serviceops, Sustainability (Incident Followup)

Wed, Apr 17

JMeybohm added a comment to T362766: 2024-04-17 mw-* went down in eqiad.

coredns related changes
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020778
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020765
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020789
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020774

Wed, Apr 17, 11:33 AM · serviceops, Sustainability (Incident Followup)
JMeybohm added a comment to T350034: Have the function orchestrator emit application-level events to Prometheus for observability.

I was looking at the metrics as per our conversation yesterday and I do see the application responding with 404 to GET /metrics requests. Did you configure a different metrics path? If so, that must be provided to prometheus via the prometheus.io/path annotation of the Pods.

Wed, Apr 17, 8:37 AM · function-orchestrator, Abstract Wikipedia team

Tue, Apr 16

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Tue, Apr 16, 6:26 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm created P60658 check-apparmor_seccomp.sh.
Tue, Apr 16, 4:22 PM
JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Tue, Apr 16, 1:28 PM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Tue, Apr 16, 1:25 PM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Tue, Apr 16, 12:29 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Tue, Apr 16, 12:28 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Tue, Apr 16, 11:18 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm claimed T353464: Migrate wikikube control planes to hardware nodes.
Tue, Apr 16, 11:04 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T358936: Kubernetes apiserver probe failures on restart.

We had this happening again in eqiad today because of a (planned) apiserver safe restart. We'll prioritize T353464: Migrate wikikube control planes to hardware nodes to give more resources to wikikube apiservers.

Tue, Apr 16, 11:03 AM · Prod-Kubernetes, serviceops, SRE
JMeybohm raised the priority of T287491: Allow to address Kubernets API servers from NetworkPolicy from Low to High.

We should prioritize T353464: Migrate wikikube control planes to hardware nodes because of T358936: Kubernetes apiserver probe failures on restart. Would be nice to have this done to lower configuration overhead, so raising this as well.

Tue, Apr 16, 11:02 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm raised the priority of T353464: Migrate wikikube control planes to hardware nodes from Medium to High.
Tue, Apr 16, 11:00 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

Updating (restarting) rsyslog in wikikube codfw again led to quite a bump in events followed by a (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-codfw alert. Maybe restarting rsyslog cluster wide after a week is a viable strategy to determine if rsyslog is still "loosing" messages/getting stuck in some error state as in my understanding, there should not be a bump of events after a restart if everything is fine.

Tue, Apr 16, 8:58 AM · Patch-For-Review, Observability-Logging, serviceops

Mon, Apr 15

JMeybohm updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 4:49 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T315560: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters.

@JMeybohm Is this something still needed?

Mon, Apr 15, 3:47 PM · Infrastructure-Foundations, SRE-tools, Spicerack
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 2:32 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 2:29 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T362518: Deprecate buster-backports.

Production images rebuild is done:

Mon, Apr 15, 1:51 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 1:50 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 10:57 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
JMeybohm added a comment to T362408: Migration to containerd and away from docker.

@akosiaris could you please double check in your test environment that containerd will still enforce the default apparmor profile (see Remove apparmor.security.beta.kubernetes.io/defaultProfileName in T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21) like docker currently does?

Mon, Apr 15, 8:03 AM · Prod-Kubernetes, Kubernetes, serviceops

Fri, Apr 12

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Fri, Apr 12, 7:46 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Fri, Apr 12, 7:43 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Fri, Apr 12, 7:43 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Fri, Apr 12, 7:42 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm renamed T290020: Enable audit logging for kube-apiserver from Evaluate and enable audit logging for kube-apiserver to Enable audit logging for kube-apiserver.
Fri, Apr 12, 7:39 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I've added a more comprehensive list of @elukey's test at https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement#Violation_error_handling
Bottom line is: With PSP's and VAP's we only get events, with PSS and kyverno we get additional user warnings (or even full rejections in case of kyverno)

Fri, Apr 12, 5:48 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

Elastic Integrations aren't available to us in an OpenSearch world. However, the mapping data from that link would be useful if we choose to transform these logs to ECS and not use a dedicated index.

Fri, Apr 12, 3:40 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

I've merged the attached patch and the logs are ingested into the logstash-k8s- index. Unfortunately the event dates are off as the date of ingestion is used instead of a timestamp from the actual data. I suppose this is something that will be fixed by using a dedicated index

Fri, Apr 12, 7:55 AM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Fri, Apr 12, 7:32 AM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

Thanks! Elastic does maintain a audit_log integration at https://github.com/elastic/integrations/tree/main/packages/kubernetes/data_stream/audit_logs - does that make things easier?

Fri, Apr 12, 7:26 AM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes

Thu, Apr 11

JMeybohm updated the task description for T290020: Enable audit logging for kube-apiserver.
Thu, Apr 11, 12:48 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a project to T290020: Enable audit logging for kube-apiserver: Observability-Logging.
Thu, Apr 11, 12:44 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes

Wed, Apr 10

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Wed, Apr 10, 3:04 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T359423: Migrate charts to Calico Network Policies.

Grouped the todo list by chart, some of those also need mesh.configuration updates due to T346638: Rename the envoy's uses_ingress option to sets_sni which could be bundled with this

Wed, Apr 10, 12:46 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Wed, Apr 10, 12:45 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T346638: Rename the envoy's uses_ingress option to sets_sni .

Unfortunately version 1.4.3 of mesh.configuration still uses uses_ingress in one if-block. So the initially assumed version requirement was not correct and there are still a bunch of charts to update. :/

Wed, Apr 10, 12:37 PM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Wed, Apr 10, 12:34 PM · Patch-For-Review, Machine-Learning-Team, serviceops

Tue, Apr 9

JMeybohm updated the task description for T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator.
Tue, Apr 9, 10:47 AM · Patch-For-Review, Data-Platform-SRE, Wikidata, Wikidata-Query-Service
JMeybohm added a comment to T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error.

Can someone clarify what the problem here is? From WBQC’s perspective, it’s totally expected that some of these regex checks will fail (though there’s some confusion about which shellbox errors we should or shouldn’t try to catch, see T304084 and especially T304084#8561863). But we might need to make some changes to keep the service mesh monitoring happy? (“exceeding retry limit” also sounds concerning – we don’t really want these requests to be retried, I think.)

Tue, Apr 9, 9:36 AM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox

Mon, Apr 8

JMeybohm renamed T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error from shellbox-constraints listener is constantly exceeding the retry limit to shellbox-constraints returning 500 on preg_match error.
Mon, Apr 8, 4:03 PM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox
JMeybohm added a comment to T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error.
Mon, Apr 8, 2:57 PM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox
JMeybohm created T362084: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error.
Mon, Apr 8, 2:15 PM · Wikidata Dev Team, Patch-For-Review, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Wikibase-Quality-Constraints, serviceops, Shellbox
JMeybohm added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

We had a bunch of error again today. All of them connection errors to eventgate-analytics and main leading to 503 after exceeding the retry limit. There where ERROR ferm input drop default policy not set, ferm might not have been started correctly alerts during that time but I'm not convinced those are related.

image.png (1×3 px, 251 KB)

image.png (990×1 px, 138 KB)

Mon, Apr 8, 2:01 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
JMeybohm added a comment to T361706: 2024-04-03 calico/typha down.

Regarding the number of connections per typha: There is a logic in typha that will balance the connections between instances by disconnecting the ones exceeding a dynamic threshold that is calculated based on the number of typhas and the number of nodes (see https://github.com/projectcalico/calico/blob/v3.23.3/typha/pkg/k8s/rebalance.go#L35).

Mon, Apr 8, 10:04 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Mon, Apr 8, 9:53 AM · Patch-For-Review, Machine-Learning-Team, serviceops
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Mon, Apr 8, 8:39 AM · JavaScript, Phabricator
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Mon, Apr 8, 8:33 AM · JavaScript, Phabricator
JMeybohm edited P11638 smaller_gerritbot_comments.js.
Mon, Apr 8, 8:32 AM · JavaScript, Phabricator
JMeybohm added a comment to T361706: 2024-04-03 calico/typha down.

The investigation from last Friday showed the first failed probe for calico-typha-75d4649699-h7vgq was recorded at 13:10:31, which was probably a consequence of the process not being able to allocate additional memory (Apr 3 13:10:28 kubernetes1022 kernel: [18672041.401461] calico-typha invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=99). That typha instance had ~230 client connected at that time (out of ~636) which then must have tried to re-establish connections with one of the remaining 2 typha's, kicking those over the memory limit threshold as well.
After updating the grafana dashboard ab bit it was clean that the client distribution between the 3 typha instances is naturally pretty uneven and so is the memory usage. As the most loaded instance already reaches ~500MiB (from its 600MiB limit) I will increase the limit again to 1GiB per instance.

Mon, Apr 8, 8:18 AM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident

Fri, Apr 5

JMeybohm reopened T357616: Logs from containers sometimes not visible in logstash as "Open".

This is still happening. K8s event logs seem suspiciously empty on April 4th for eqiad for example.

Fri, Apr 5, 1:26 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Fri, Apr 5, 10:15 AM · Patch-For-Review, Machine-Learning-Team, serviceops

Thu, Apr 4

JMeybohm closed T357616: Logs from containers sometimes not visible in logstash as Resolved.

The restarts do work properly and we've not seen "Too many open files" errors since their implementation. I'm optimistically resolving this for now

Thu, Apr 4, 11:30 AM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T361728: SwaggerProbeHasFailures for citoid since last deployment.

Weirdly it looks like that first downtime I see in the grafana was everyone? Since the swagger endpoint is unhappy for everyone?

That was because of T361706: 2024-04-03 calico/typha down

Thu, Apr 4, 11:22 AM · Patch-For-Review, serviceops-radar, Citoid

Wed, Apr 3

JMeybohm added a project to T361728: SwaggerProbeHasFailures for citoid since last deployment: serviceops-radar.
Wed, Apr 3, 5:50 PM · Patch-For-Review, serviceops-radar, Citoid
JMeybohm created T361728: SwaggerProbeHasFailures for citoid since last deployment.
Wed, Apr 3, 5:50 PM · Patch-For-Review, serviceops-radar, Citoid
JMeybohm added a comment to T290020: Enable audit logging for kube-apiserver.

Observability-Logging could you maybe advice on if/how/where we could potentially store these audit logs to make them more accessible?
They come as Json lines with the format specified in https://kubernetes.io/docs/reference/config-api/apiserver-audit.v1/#audit-k8s-io-v1-Event

Wed, Apr 3, 12:50 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm updated the task description for T290020: Enable audit logging for kube-apiserver.
Wed, Apr 3, 12:42 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes

Thu, Mar 28

JMeybohm renamed T290020: Enable audit logging for kube-apiserver from Evaluate and enable audit logging for kubeapi-server to Evaluate and enable audit logging for kube-apiserver.
Thu, Mar 28, 3:59 PM · Observability-Logging, Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm updated the task description for T343787: Find a replacement for the unmaintained eventrouter.
Thu, Mar 28, 11:10 AM · Technical-Debt, serviceops, Prod-Kubernetes, Kubernetes

Wed, Mar 27

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

A systemd timer has been deployed to all kubernetes nodes that will check every hour if rsyslog has accumulated >10k fd's to deleted files/folders and restarts it if that's the case...let's see.

Wed, Mar 27, 4:37 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Wed, Mar 27, 4:16 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

btw, in case this is relevant, https://logstash.wikimedia.org/goto/b060c8f0c137245fc0d63b9329583abe shows a spike of a bunch of requests with the same request ID, but those logs for handling web requests. I can file a separate task for that if you think it's worth investigating further.

Wed, Mar 27, 1:45 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

This seems to be an issue with how imfile implements inotify watches for symlinks (or symlinks to symlinks maybe?):

Wed, Mar 27, 12:19 PM · Patch-For-Review, Observability-Logging, serviceops

Tue, Mar 26

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Tue, Mar 26, 4:03 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

[26.03.24 14:55] <jinxer-wm> (KubernetesRsyslogDown) firing: rsyslog on mw1483:9105 is missing kubernetes logs

Tue, Mar 26, 3:24 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).

Deployed v0.0.3 of the chart incl. rdb to all wikikube and staging clusters as well as dse

Tue, Mar 26, 2:27 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T360612: Add redis (rdb) instances to external-services as Resolved.
Tue, Mar 26, 1:32 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T360612: Add redis (rdb) instances to external-services, a subtask of T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies), as Resolved.
Tue, Mar 26, 1:30 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mon, Mar 25

JMeybohm edited P11638 smaller_gerritbot_comments.js.
Mon, Mar 25, 10:18 AM · JavaScript, Phabricator

Fri, Mar 22

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

Unfortunately we did not gain any insides from the new metrics (dashboard at https://grafana-rw.wikimedia.org/d/KimNkFTIk/jayme-omkafka) as of now. We where also not able to spot another incarnation of this issue or reproduce it somehow

Fri, Mar 22, 2:14 PM · Patch-For-Review, Observability-Logging, serviceops
BTullis awarded T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) a Hungry Hippo token.
Fri, Mar 22, 1:44 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Thu, Mar 21

JMeybohm added a comment to T360637: Bump memory for registry[12]00[34] VMs.

Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps

Thu, Mar 21, 2:28 PM · Patch-For-Review, serviceops, Machine-Learning-Team
JMeybohm created T360612: Add redis (rdb) instances to external-services.
Thu, Mar 21, 10:13 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Prod-Kubernetes, Kubernetes, serviceops

Mar 19 2024

JMeybohm updated subscribers of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I've summarized my findings at https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement @akosiaris, @elukey: I'd like you to take a look and ask questions if you find the time.

Mar 19 2024, 5:29 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).

I know think we might have misunderstood each other in T359334. I do think your proposal in https://phabricator.wikimedia.org/T359334#9606340 is the way to go, but what I initially meant was grouping the source data in global_config in a similar way. I now reply here because this relates to the external-services chart and how we generate data for it.

Mar 19 2024, 10:11 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mar 18 2024

JMeybohm added a comment to T359416: Add Dragonfly to the ML k8s clusters.

That's right. By default the supernode does act as a CDN in front of the docker-registry but I intentionally disabled that behavior as there's no benefit of that in our infra. I would assume that the supernodes do still query the registry for some sanity check data maybe but the seeding happens directly from the docker-registry instances.

Mar 18 2024, 3:17 PM · Machine-Learning-Team
JMeybohm removed projects from T358489: mw2420-mw2451 do have unnecessary raid controllers (configured): ops-codfw, DC-Ops.

@JMeybohm hello is there anything DC-ops need to do on this task?

Mar 18 2024, 1:25 PM · SRE, serviceops
JMeybohm added a comment to T359416: Add Dragonfly to the ML k8s clusters.

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

Mar 18 2024, 8:44 AM · Machine-Learning-Team

Mar 12 2024

JMeybohm updated subscribers of T344478: Fix how we keep docker-pkg based images up to date.
Mar 12 2024, 1:15 PM · Release Pipeline (Blubber), docker-pkg, serviceops

Mar 8 2024

JMeybohm added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

Are the Dragonfly's supernodes sharable between clusters? We are interesting in adding Dragonfly (it is in our backlog since a long time), but I don't get from the docs if we should create a p2p network shared between clusters or not.

@JMeybohm Care to answer this one? ^. My impression is no, but I am not sure.

Yep, that's on the move already T359416: Add Dragonfly to the ML k8s clusters

Mar 8 2024, 2:32 PM · Machine-Learning-Team
JMeybohm added a comment to T359633: Strategy for Envoy metrics and Prometheus.

Just to create the reference, I would assume this is a consequence of T290536: Serve production traffic via Kubernetes and friends.

Mar 8 2024, 2:25 PM · User-fgiunchedi, Patch-For-Review, Observability-Metrics, MW-on-K8s
JMeybohm lowered the priority of T256762: Fix nginx config and caching for docker registry from Medium to Low.
Mar 8 2024, 2:22 PM · serviceops, Kubernetes, SRE
JMeybohm added a comment to T256762: Fix nginx config and caching for docker registry .
  • Requests for the catalog are not cached
    • curl -I -XGET 'https://docker-registry.wikimedia.org/v2/_catalog

catalog is now cached.

Mar 8 2024, 2:21 PM · serviceops, Kubernetes, SRE

Mar 7 2024

JMeybohm added a comment to T359416: Add Dragonfly to the ML k8s clusters.

I think it's fine to use the existing supernodes. They act as coordinators only, so there is not much load or network traffic even during mw-deployments.

Mar 7 2024, 10:51 AM · Machine-Learning-Team

Mar 6 2024

JMeybohm added a comment to T359334: Create a networkpolicy template allowing charts to define a Calico Network policy to external services.

Yes, exactly. I think we would benefit from breaking with the current structure and instead create a generic one for helm chart values files as well as for the values.yaml we generate via global_config.pp. We can keep the old global_config.pp structure (AIUI that's just for kafka and zookeeper) around until we've migrated all relevant charts which should not take that long.

Mar 6 2024, 1:29 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T359334: Create a networkpolicy template allowing charts to define a Calico Network policy to external services.

Thanks for the writeup! I pretty much agree with your plan/idea although I would suggest to add a separate module (modules/base/external-services-networkplicy_1.0.0.yaml or alike) to we can create new versions independently from the base/networkpolicy module.

Mar 6 2024, 1:13 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mar 5 2024

JMeybohm added a comment to T348466: Rethink kubernetes etcd storage.

@Clement_Goubert are you okay with merging this into T353464: Migrate wikikube control planes to hardware nodes?

Mar 5 2024, 11:43 AM · Prod-Kubernetes, serviceops
JMeybohm added a subtask for T341984: Update Kubernetes clusters to >1.25: T348466: Rethink kubernetes etcd storage.
Mar 5 2024, 11:39 AM · Shared-Data-Infrastructure, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a parent task for T348466: Rethink kubernetes etcd storage: T341984: Update Kubernetes clusters to >1.25.
Mar 5 2024, 11:39 AM · Prod-Kubernetes, serviceops
JMeybohm added a parent task for T353464: Migrate wikikube control planes to hardware nodes: T341984: Update Kubernetes clusters to >1.25.
Mar 5 2024, 11:38 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a subtask for T341984: Update Kubernetes clusters to >1.25: T353464: Migrate wikikube control planes to hardware nodes.
Mar 5 2024, 11:38 AM · Shared-Data-Infrastructure, Kubernetes, Prod-Kubernetes, serviceops