Page MenuHomePhabricator

JMeybohm
User

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Apr 2 2020, 9:01 AM (111 w, 2 d)
Availability
Available
IRC Nick
jayme
LDAP User
JMeybohm
MediaWiki User
JMeybohm (WMF) [ Global Accounts ]

Recent Activity

Thu, May 19

JMeybohm added a comment to T306165: Replace kubeyaml in deployment-charts CI.

Frankly I'd say we can keep this relatively simple. Building a debian package doesn't relly give us any advantage as I doubt we'd use it anywhere but in the helm-linter image.

I would propose instead you create a local repo on either gitlab or gerrit, with a script allowing to checkout just the stuff we want from upstream, then we clone it inside the image.

Thu, May 19, 10:44 AM · Patch-For-Review, Kubernetes, serviceops

Wed, May 18

JMeybohm added a parent task for T306165: Replace kubeyaml in deployment-charts CI: T307943: Update Kubernetes clusters to v1.23.
Wed, May 18, 1:40 PM · Patch-For-Review, Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T306165: Replace kubeyaml in deployment-charts CI.
Wed, May 18, 1:40 PM · Kubernetes, Prod-Kubernetes, serviceops

Tue, May 17

JMeybohm added a comment to T306165: Replace kubeyaml in deployment-charts CI.

kubeconform debian package is ready as well (needs gerrit repo etc.) but I'm not sure about the best way to deal with the kubernetes json schema (the repo is quite big, 14GB on disk).

Tue, May 17, 5:25 PM · Patch-For-Review, Kubernetes, serviceops
JMeybohm claimed T306165: Replace kubeyaml in deployment-charts CI.
Tue, May 17, 6:42 AM · Patch-For-Review, Kubernetes, serviceops
JMeybohm added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

Reading that, I am under the impression it won't work cause it only applies for Network policies. Can't hurt to try though.

Yeah. Some other docs suggest they can be used in selector fields an general, though.

Tue, May 17, 6:12 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Mon, May 16

JMeybohm updated the task description for T306165: Replace kubeyaml in deployment-charts CI.
Mon, May 16, 2:54 PM · Patch-For-Review, Kubernetes, serviceops
JMeybohm updated the task description for T306165: Replace kubeyaml in deployment-charts CI.
Mon, May 16, 2:51 PM · Patch-For-Review, Kubernetes, serviceops
JMeybohm added a comment to T306165: Replace kubeyaml in deployment-charts CI.

Looking at datree, it's very interesting it can be installed as an helm plugin, so that it can be just used on helm charts from helm itself; however I don't think it can be integrated with helmfile, so we should use it as a standalone cli tool on the generated yaml anyways.

Mon, May 16, 2:47 PM · Patch-For-Review, Kubernetes, serviceops
JMeybohm updated the task description for T306165: Replace kubeyaml in deployment-charts CI.
Mon, May 16, 10:19 AM · Patch-For-Review, Kubernetes, serviceops

Sun, May 15

JMeybohm added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

We should be aware that changing the typology annotations also changes scheduling behavior. As of now, the scheduler will try to schedule Pods of the same Replicaset across different rows (as zone contains just the row). Changing zone do include the rack will make each rack unique, so we lower the possibility of Replicatsets Pods to span multiple rows. That might have unwanted implications in case of power or network issues on one row.

Sun, May 15, 1:49 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Fri, May 13

JMeybohm updated the task description for T306165: Replace kubeyaml in deployment-charts CI.
Fri, May 13, 2:43 PM · Patch-For-Review, Kubernetes, serviceops
JMeybohm updated the task description for T307943: Update Kubernetes clusters to v1.23.
Fri, May 13, 1:23 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

Do we really think we need a global/shared cache directory? AIUI it is used to:

Fri, May 13, 11:46 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops

Wed, May 11

JMeybohm updated the task description for T307943: Update Kubernetes clusters to v1.23.
Wed, May 11, 1:17 PM · Kubernetes, Prod-Kubernetes, serviceops

Tue, May 10

JMeybohm added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

Hi folks! In T307927 I am trying to figure out why ml-team deployers (not in the deployment group) are not able to use helmfile/helm when there are charts changes, is it possible that this is due to the HELM_CACHE_HOME perm changes? The ml-team can opt-in the deployment group without issues, we didn't need it since we only deploy to the ML team's k8s clusters.

Tue, May 10, 1:20 PM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
JMeybohm updated the task description for T307943: Update Kubernetes clusters to v1.23.
Tue, May 10, 11:35 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T299236: Move away from system:node RBAC role.
Tue, May 10, 11:32 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a parent task for T299236: Move away from system:node RBAC role: T307943: Update Kubernetes clusters to v1.23.
Tue, May 10, 11:32 AM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T290963: Drop the use of nonexisting groups in kubernetes infrastructure_users.
Tue, May 10, 11:31 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a parent task for T290963: Drop the use of nonexisting groups in kubernetes infrastructure_users: T307943: Update Kubernetes clusters to v1.23.
Tue, May 10, 11:31 AM · Prod-Kubernetes, Kubernetes

Mon, May 9

JMeybohm added a parent task for T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches: T307943: Update Kubernetes clusters to v1.23.
Mon, May 9, 5:21 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.
Mon, May 9, 5:21 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm renamed T244335: Upgrade kubernetes clusters to v1.16 from Upgrade kubernetes clusters to a security supported (LTS) version to Upgrade kubernetes clusters v1.16.
Mon, May 9, 5:20 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm renamed T244335: Upgrade kubernetes clusters to v1.16 from Upgrade kubernetes clusters v1.16 to Upgrade kubernetes clusters to v1.16.
Mon, May 9, 5:20 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T307943: Update Kubernetes clusters to v1.23.
Mon, May 9, 5:19 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm removed a subtask for T244335: Upgrade kubernetes clusters to v1.16: Unknown Object (Task).
Mon, May 9, 5:18 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: Unknown Object (Task).
Mon, May 9, 5:18 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm removed a subtask for T244335: Upgrade kubernetes clusters to v1.16: T300499: Migrate from command line flags to config files for kubernetes components.
Mon, May 9, 5:17 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T300499: Migrate from command line flags to config files for kubernetes components.
Mon, May 9, 5:17 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm edited parent tasks for T300499: Migrate from command line flags to config files for kubernetes components, added: T307943: Update Kubernetes clusters to v1.23; removed: T244335: Upgrade kubernetes clusters to v1.16.
Mon, May 9, 5:17 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm removed a subtask for T244335: Upgrade kubernetes clusters to v1.16: T270271: Target Sources (component/kubernetes-future/source/Sources) is configured multiple times.
Mon, May 9, 5:17 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm removed a subtask for T278329: Support multiple kubernetes versions with puppet: T270271: Target Sources (component/kubernetes-future/source/Sources) is configured multiple times.
Mon, May 9, 5:17 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T270271: Target Sources (component/kubernetes-future/source/Sources) is configured multiple times.
Mon, May 9, 5:17 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm edited parent tasks for T270271: Target Sources (component/kubernetes-future/source/Sources) is configured multiple times, added: T307943: Update Kubernetes clusters to v1.23; removed: T278329: Support multiple kubernetes versions with puppet, T244335: Upgrade kubernetes clusters to v1.16.
Mon, May 9, 5:17 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm removed a subtask for T244335: Upgrade kubernetes clusters to v1.16: T278329: Support multiple kubernetes versions with puppet.
Mon, May 9, 5:12 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T278329: Support multiple kubernetes versions with puppet.
Mon, May 9, 5:12 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm edited parent tasks for T278329: Support multiple kubernetes versions with puppet, added: T307943: Update Kubernetes clusters to v1.23; removed: T244335: Upgrade kubernetes clusters to v1.16.
Mon, May 9, 5:12 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm removed a subtask for T244335: Upgrade kubernetes clusters to v1.16: T270191: Add kubernetes 1.17+ topology annotations.
Mon, May 9, 5:12 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a subtask for T307943: Update Kubernetes clusters to v1.23: T270191: Add kubernetes 1.17+ topology annotations.
Mon, May 9, 5:12 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm edited parent tasks for T270191: Add kubernetes 1.17+ topology annotations, added: T307943: Update Kubernetes clusters to v1.23; removed: T244335: Upgrade kubernetes clusters to v1.16.
Mon, May 9, 5:12 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm triaged T307943: Update Kubernetes clusters to v1.23 as High priority.
Mon, May 9, 5:10 PM · Kubernetes, Prod-Kubernetes, serviceops

Fri, May 6

JMeybohm claimed T260661: Create a cookbook to perform a rolling reboot of a kubernetes cluster.
Fri, May 6, 12:52 PM · Patch-For-Review, Infrastructure-Foundations, Prod-Kubernetes, User-jijiki, SRE-tools, serviceops, SRE

Wed, May 4

JMeybohm updated the task description for T300879: Add a kubernetes module to spicerack.
Wed, May 4, 2:42 PM · Infrastructure-Foundations, serviceops, SRE-tools
JMeybohm added a comment to T303049: New Service Request: DataHub.

I understand that it's something to do with DNS Discovery - but it seems counter-intuitive to have to refer to a read-only DNS record if the service supports writing in both DCs. Have I misunderstood something, or is it just a quirk of the setup that I have to get used to?

Wed, May 4, 10:28 AM · Patch-For-Review, serviceops, Data-Catalog, Data-Engineering, Service-deployment-requests, Services, SRE

Tue, May 3

JMeybohm closed T290966: Implement POC for istio ingress, a subtask of T261277: Create a gateway in kubernetes for the execution of our "lambdas", as Resolved.
Tue, May 3, 3:08 PM · MW-on-K8s, serviceops, SRE
JMeybohm closed T290966: Implement POC for istio ingress as Resolved.

This is done with miscweb being the first full Ingress service and datahub following up.
Docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress

Tue, May 3, 3:08 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T303049: New Service Request: DataHub.

I finally managed to verify and document the steps needed to put a service under Ingress. I did also update the general
https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service documentation (which contains a link to the Ingress specific part).
@BTullis: I'd very much like you to go over the new docs to verify those are useful to others. From what I remember datahub still needs:

  • most of the DNS CNAME records (currently only datahub-gms.discovery.wmnet exists)
  • service::catalog entries for datahub-frontend and datahub-gms
  • to make use of datahub-frontend.discovery.wmnet in hieradata/common/profile/trafficserver/backend.yaml
Tue, May 3, 9:10 AM · Patch-For-Review, serviceops, Data-Catalog, Data-Engineering, Service-deployment-requests, Services, SRE
JMeybohm closed T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress as Resolved.

This is now used by miscweb and documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service (more specifically https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress)

Tue, May 3, 9:03 AM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress, a subtask of T290966: Implement POC for istio ingress, as Resolved.
Tue, May 3, 9:03 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T306797: [Shared Event Platform] Investigate Event Service Platforms.

@JMeybohm I wonder if the main pain points were more around the fact that the WDQS Updater is stateful. If there is no state (other than Kafka consumer offsets, which are stored in Kafka), perhaps multi DC k8s deployment won't be as difficult.

I think the use cases we are targeting atm are stateless.

Tue, May 3, 9:00 AM · Epic, Generated Data Platform
JMeybohm added a comment to T307252: push-notifications: Validate APNS and FCM credentials on startup.

Happy to help with this however we can. So you know, this APNs key rotation isn't something we anticipate happening frequently, so we may not want to over-engineer an automated check for it. There is no expiry of the key. In this case, we wanted to rotate the key we had been using for dev before it goes to production, but from here out we don't expect to update it even once per year.

Tue, May 3, 8:40 AM · Push-Notification-Service, Product-Infrastructure-Team-Backlog, Wikipedia-iOS-App-Backlog, serviceops

Mon, May 2

JMeybohm renamed T307252: push-notifications: Validate APNS and FCM credentials on startup from push-notifications: follow-up task about APNS credentials to push-notifications: Validate APNS and FCM credentials on startup.
Mon, May 2, 9:13 AM · Push-Notification-Service, Product-Infrastructure-Team-Backlog, Wikipedia-iOS-App-Backlog, serviceops

Fri, Apr 29

JMeybohm closed T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users, a subtask of T302539: Deploy MediaWiki images for kubernetes from the deployment servers, as Resolved.
Fri, Apr 29, 10:02 AM · Release-Engineering-Team (Radar), serviceops, MW-on-K8s, Scap
JMeybohm closed T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users as Resolved.
Fri, Apr 29, 10:02 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops

Thu, Apr 28

JMeybohm added a comment to T307043: helm-linter started failing on operations/deployment-charts today.

I can reproduce that when running rake run_locally['default'] on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/787058 but not on master (a2857b739e8a1578e1b124c45d02e51cd940c363).

Thu, Apr 28, 9:44 AM · Release-Engineering-Team, serviceops
JMeybohm added a comment to T297140: New Service Request: developer-portal.

It says to test if everything is ok, when adding a new namespace, with:

kube_env $YOUR-SERVICE-NAME staging-codfw
kubectl get ns

When literally doing that I run into:

Error from server (Forbidden): namespaces is forbidden: User "image-suggestion" cannot list resource "namespaces" in API group "" at the cluster scope

I _know_ this happened to me back when I added miscweb and it was not actually an issue but can you remind me what the corrext fix to the docs is?

This is bad docs. The deployer Kubernetes accounts don't have (and don't need) permission to "get ns", that's why you get this error.
I've changed the docs to:

kube_env admin staging-codfw
kubectl describe ns $YOUR-SERVICE-NAME
Thu, Apr 28, 9:00 AM · Patch-For-Review, Goal, serviceops, Wikimedia-Developer-Portal, Service-deployment-requests

Wed, Apr 27

JMeybohm added a comment to T288546: Rotate APNS key before deploying Push Notifications to Production.

CCing @JMeybohm based on involvement in the original setup at T256973. Does that sound reasonable?

Wed, Apr 27, 10:56 AM · serviceops, iOS-app-v6.9-Carp-On-A-Zamboni, Product-Infrastructure-Team-Backlog, Wikipedia-iOS-App-Backlog
JMeybohm closed T306827: Deploy Scap version 4.7.0 as Resolved.

Rolled out to canaries + deploy1002, still super slow as introduced with T305949

I created T306915 for further conversation on this topic.

Wed, Apr 27, 10:43 AM · User-brennen, Release-Engineering-Team, serviceops, Scap

Tue, Apr 26

JMeybohm added a comment to T303049: New Service Request: DataHub.

Please keep this open as it is absolutely in a hacky state currently (DNS + service::catalog wise)

Tue, Apr 26, 1:57 PM · Patch-For-Review, serviceops, Data-Catalog, Data-Engineering, Service-deployment-requests, Services, SRE
BTullis awarded T303049: New Service Request: DataHub a Mountain of Wealth token.
Tue, Apr 26, 1:57 PM · Patch-For-Review, serviceops, Data-Catalog, Data-Engineering, Service-deployment-requests, Services, SRE
JMeybohm added a project to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches: Prod-Kubernetes.
Tue, Apr 26, 7:22 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
JMeybohm added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.
  • WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mwdebug-deploy-eqiad.config is display by helmfile. There are times when it seems like it is upgrading the warning to an error, but it's unclear .

I don't think this ever gets elevated to en error (but I do get that it is annoying).

Tue, Apr 26, 7:13 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
JMeybohm added a comment to T306827: Deploy Scap version 4.7.0.

Rolled out to canaries + deploy1002, still super slow as introduced with T305949

Tue, Apr 26, 6:55 AM · User-brennen, Release-Engineering-Team, serviceops, Scap
JMeybohm claimed T306827: Deploy Scap version 4.7.0.
Tue, Apr 26, 6:37 AM · User-brennen, Release-Engineering-Team, serviceops, Scap

Apr 14 2022

JMeybohm added a comment to T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress.
  • The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.

I am not sure this is true. I see helm-charts not having an lvs: stanza and still having monitoring and icinga having those services just fine.

I think this is a lucky coincidence as for both services host config is created by via monitoring::service resource (in modules/profile/manifests/chartmuseum.pp and modules/profile/manifests/releases/common.pp. As there are no "real hosts" with case of ingress services, monitoring::service resources usually don't exist.

Apr 14 2022, 12:28 PM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated the task description for T306165: Replace kubeyaml in deployment-charts CI.
Apr 14 2022, 8:30 AM · Patch-For-Review, Kubernetes, serviceops
JMeybohm triaged T306165: Replace kubeyaml in deployment-charts CI as Low priority.
Apr 14 2022, 8:29 AM · Patch-For-Review, Kubernetes, serviceops
JMeybohm closed T304875: docker-report-releng failing on multiple image tags because of certificate validation error as Resolved.

I've removed the tags from the registry (https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images) and triggered docker-reporter-releng-images.service

Apr 14 2022, 8:12 AM · Release-Engineering-Team (Radar), Continuous-Integration-Infrastructure, serviceops
JMeybohm closed T305949: Deploy Scap version 4.6.1 as Resolved.
Apr 14 2022, 8:07 AM · Release-Engineering-Team (Radar), serviceops, Scap

Apr 13 2022

JMeybohm changed the status of T305949: Deploy Scap version 4.6.1 from Stalled to In Progress.

Would you be ok with going ahead with the deploy? We can take a deeper look in the future if the issue persists.

Apr 13 2022, 3:24 PM · Release-Engineering-Team (Radar), serviceops, Scap
JMeybohm added a comment to T305949: Deploy Scap version 4.6.1.

@JMeybohm I'll take a look. What host(s) did you run scap deploy --environment dev-cluster on?

Apr 13 2022, 9:19 AM · Release-Engineering-Team (Radar), serviceops, Scap
JMeybohm changed the status of T305949: Deploy Scap version 4.6.1 from Open to Stalled.
Apr 13 2022, 8:47 AM · Release-Engineering-Team (Radar), serviceops, Scap
JMeybohm added a comment to T305949: Deploy Scap version 4.6.1.

4.6.1 rolled out to canaries. scap pullworks as usual, cd /srv/deployment/restbase/deploy/; scap deploy --environment dev-cluster feels way(!) slower then the last times I did this.

Apr 13 2022, 8:46 AM · Release-Engineering-Team (Radar), serviceops, Scap
JMeybohm claimed T305949: Deploy Scap version 4.6.1.
Apr 13 2022, 8:39 AM · Release-Engineering-Team (Radar), serviceops, Scap

Apr 11 2022

JMeybohm claimed T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

I'll prepare the needed patches.

Apr 11 2022, 2:26 PM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
JMeybohm renamed T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress from service:.catalog entries and dnsdisc for Kubernetes services under Ingress to service::catalog entries and dnsdisc for Kubernetes services under Ingress.
Apr 11 2022, 1:23 PM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm renamed T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress from service:.catalog entries and dnsdisc for Kubernetes sevrices under Ingress to service:.catalog entries and dnsdisc for Kubernetes services under Ingress.
Apr 11 2022, 8:29 AM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops

Apr 8 2022

JMeybohm updated the task description for T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress.
Apr 8 2022, 2:39 PM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T305435: Migrate kubernetes masters to bullseye as Resolved.
Apr 8 2022, 1:40 PM · Kubernetes, Prod-Kubernetes, serviceops

Apr 7 2022

JMeybohm updated the task description for T305435: Migrate kubernetes masters to bullseye.
Apr 7 2022, 1:39 PM · Kubernetes, Prod-Kubernetes, serviceops

Apr 5 2022

JMeybohm updated the task description for T305435: Migrate kubernetes masters to bullseye.
Apr 5 2022, 2:53 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T290966: Implement POC for istio ingress.
Apr 5 2022, 1:17 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm triaged T305435: Migrate kubernetes masters to bullseye as Medium priority.
Apr 5 2022, 8:55 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T305250: Deploy Scap version 4.6.0 as Resolved.

Rolled out everywhere

Apr 5 2022, 8:12 AM · serviceops, Release-Engineering-Team, Scap

Apr 4 2022

JMeybohm committed rLPRI7db74920701c: Move datahub secrets into the right subchart YAML structure (authored by JMeybohm).
Move datahub secrets into the right subchart YAML structure
Apr 4 2022, 10:49 AM
JMeybohm triaged T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress as High priority.
Apr 4 2022, 10:03 AM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm created T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress.
Apr 4 2022, 10:00 AM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T300740: Provide a convenient way to connect to services in kubernetes staging clusters as Resolved.

Something like curl -I https://miscweb.k8s-staging.discovery.wmnet:30443 now works by default.

Apr 4 2022, 9:51 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T300740: Provide a convenient way to connect to services in kubernetes staging clusters, a subtask of T290966: Implement POC for istio ingress, as Resolved.
Apr 4 2022, 9:50 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T305250: Deploy Scap version 4.6.0.

Basic checks on mwdebug and restbase reploy looking fine. Will roll out to fleet wide tomorrow

Apr 4 2022, 8:02 AM · serviceops, Release-Engineering-Team, Scap
JMeybohm claimed T305250: Deploy Scap version 4.6.0.
Apr 4 2022, 7:48 AM · serviceops, Release-Engineering-Team, Scap
JMeybohm closed T245272: Draft a plan for upgrading kubernetes machines to buster as Resolved.

We skipped buster with T300744

Apr 4 2022, 7:30 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T245272: Draft a plan for upgrading kubernetes machines to buster, a subtask of T247045: Migrate all of production metal and VMs to Buster or later, as Resolved.
Apr 4 2022, 7:30 AM · SRE, Epic

Apr 1 2022

JMeybohm updated the task description for T300740: Provide a convenient way to connect to services in kubernetes staging clusters.
Apr 1 2022, 8:25 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T304891: New Service Request Generated Datasets: Image Suggestions Service.

Mentioned in SAL (#wikimedia-operations) [2022-03-31T20:40:52Z] <mutante> reserving port 4017 for new k8s service request 'image-suggestions' T304891

Apr 1 2022, 7:55 AM · User-Eevans, Image-Suggestions, Patch-For-Review, serviceops, Generated Data Platform, Service-deployment-requests, Services, SRE
JMeybohm updated the task description for T305155: Blubber setup for Image Suggestions Service.
Apr 1 2022, 7:52 AM · Image-Suggestions, Patch-For-Review, serviceops, Generated Data Platform, Service-deployment-requests, Services, SRE

Mar 31 2022

JMeybohm added a comment to T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00).

The additional PODs won't be used as a flink job does not automatically scale so it would be a pure waste of resources (2.5G of reserved mem per additional POD). It would help I guess to improve redundancy in this scenario only if k8s assigns every POD to a distinct machine, in which case even with a single machine misbehaving flink would have enough redundancy to allocate the job to the spare POD. If k8s does do allocation randomly or that there are not enough k8s worker nodes (1 spare POD in our case would mean spreading the PODs over 8 different machines) then it's probably not worth the waste of resources.

Mar 31 2022, 3:58 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
JMeybohm added a project to T212866: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker: Prod-Kubernetes.
Mar 31 2022, 2:40 PM · Prod-Kubernetes, Infrastructure-Foundations, Kubernetes, SRE, SRE-tools
JMeybohm added a comment to T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00).

To be discussed with service ops:

  • Investigate and address the reasons why after a node failure k8s did not fulfill its promise of making sure that the rdf-streaming-updater deployment have 6 working replicas

The problem was more that the node did not really fail (to it's complete extend). It was heavily overloaded (for an unknown reason) and that's potentially why containers/processed running there seemed dead. But from K8s perspective the Pods where still running and a new pod was scheduled as soon as I power cycled the node (e.g. K8s was able to detect a mismatch in desired end existing replicas).

Mar 31 2022, 1:40 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service