Page MenuHomePhabricator

JMeybohm
User

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Apr 2 2020, 9:01 AM (181 w, 1 d)
Availability
Available
IRC Nick
jayme
LDAP User
JMeybohm
MediaWiki User
JMeybohm (WMF) [ Global Accounts ]

Recent Activity

Yesterday

JMeybohm triaged T269684: [EPIC] Docker deprecation as a container runtime enginer for kubernetes. as High priority.
Fri, Sep 22, 9:13 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm raised the priority of T341984: Update Kubernetes clusters to >1.25 from Medium to High.
Fri, Sep 22, 9:12 AM · Shared-Data-Infrastructure, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T345892: Reduction of Secret-based Service Account Tokens.
Fri, Sep 22, 9:10 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm merged T212866: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker into T260661: Create a cookbook to perform a rolling reboot of a kubernetes cluster.
Fri, Sep 22, 9:07 AM · Spicerack, Infrastructure-Foundations, Prod-Kubernetes, User-jijiki, SRE-tools, serviceops, SRE
JMeybohm merged task T212866: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker into T260661: Create a cookbook to perform a rolling reboot of a kubernetes cluster.
Fri, Sep 22, 9:07 AM · Spicerack, Prod-Kubernetes, Infrastructure-Foundations, Kubernetes, SRE, SRE-tools
JMeybohm added a project to T249929: Integrate kube-metrics-server into our infrastructure: Prod-Kubernetes.
Fri, Sep 22, 9:01 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a project to T316347: Helm chart packaging should update dependencies: Prod-Kubernetes.
Fri, Sep 22, 9:01 AM · Prod-Kubernetes, Patch-For-Review, serviceops, Kubernetes
JMeybohm added a project to T343801: Create kube-state-metrics docker image: Prod-Kubernetes.
Fri, Sep 22, 9:01 AM · Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a project to T343529: Prometheus doesn't reload or alert on expired client certificates: Prod-Kubernetes.
Fri, Sep 22, 9:00 AM · Prod-Kubernetes, SRE Observability (FY2023/2024-Q1), Observability-Metrics, User-fgiunchedi, Kubernetes, serviceops-radar
JMeybohm added a project to T337928: cfssl-issuer: Generate Kubernetes Events: Prod-Kubernetes.
Fri, Sep 22, 9:00 AM · Prod-Kubernetes, serviceops, Kubernetes
JMeybohm closed T318707: Don't scrape every containerPort for metrics as Resolved.
Fri, Sep 22, 8:59 AM · Machine-Learning-Team, Kubernetes, Observability-Metrics, serviceops
JMeybohm closed T289639: Document how k8s logging works as Resolved.
Fri, Sep 22, 8:59 AM · Kubernetes, serviceops
JMeybohm added a project to T256256: Raise an alarm on container restarts/OOMs in kubernetes: Prod-Kubernetes.
Fri, Sep 22, 8:57 AM · Prod-Kubernetes, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), serviceops, Kubernetes, ChangeProp
JMeybohm triaged T346971: Uncaught ConfigException: Failed to load configuration from etcd as Low priority.

I've checked the logs from September (https://logstash.wikimedia.org/goto/37581ef39fe3ed2251e9cf0e13d12445) where this happened 132 times as of now, usually in batches of around 13 messages coming from one single pod. Cross referencing with k8s events showed that this "sometimes" happens during the startup of a new mediawiki pod. I would assume there is some kind of race condition at play.

Fri, Sep 22, 8:09 AM · Patch-For-Review, MediaWiki-Platform-Team (Radar), MW-on-K8s, serviceops, MediaWiki-Configuration, User-brennen, Wikimedia-production-error

Thu, Sep 21

JMeybohm closed T324959: Scrape controller-manager and scheduler metrics, a subtask of T307943: Update Kubernetes clusters to v1.23, as Resolved.
Thu, Sep 21, 2:40 PM · Foundational Technology Requests, Shared-Data-Infrastructure, Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T324959: Scrape controller-manager and scheduler metrics as Resolved.

I've added two dashboards:

Thu, Sep 21, 2:40 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T324959: Scrape controller-manager and scheduler metrics, a subtask of T341984: Update Kubernetes clusters to >1.25, as Resolved.
Thu, Sep 21, 2:40 PM · Shared-Data-Infrastructure, Kubernetes, Prod-Kubernetes, serviceops

Wed, Sep 20

JMeybohm closed T212123: Kubernetes clusters roadmap as Resolved.

I'm going to resolve this one as we no longer use it

Wed, Sep 20, 5:06 PM · User-fsero, serviceops, Prod-Kubernetes
JMeybohm added a comment to T345839: Audit charts drift between staging and production.

Linking to T265979: Alert on unapplied changes in deployment-charts repo at this is somewhat similar but not identical

Wed, Sep 20, 5:03 PM · Prod-Kubernetes, serviceops, Kubernetes
JMeybohm closed T260964: Add templates for puppet_ca consumption as Declined.

Containers should have the wmf-certificates package installed which contains the puppet ca as well.

Wed, Sep 20, 5:02 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T275026: Use a separate key for service account token issuer as Resolved.

This has been resolved with the move to PKI in T307943: Update Kubernetes clusters to v1.23

Wed, Sep 20, 5:00 PM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a subtask for T341984: Update Kubernetes clusters to >1.25: T334234: Migrate to node-role.kubernetes.io/control-plane label/taint.
Wed, Sep 20, 4:57 PM · Shared-Data-Infrastructure, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm added a parent task for T334234: Migrate to node-role.kubernetes.io/control-plane label/taint: T341984: Update Kubernetes clusters to >1.25.
Wed, Sep 20, 4:57 PM · Prod-Kubernetes, Kubernetes
JMeybohm added a project to T264625: Deploy kube-state-metrics: Prod-Kubernetes.
Wed, Sep 20, 4:56 PM · Prod-Kubernetes, serviceops, User-jijiki, Kubernetes
JMeybohm added a project to T344154: Allow parallel image pulls in k8s: Prod-Kubernetes.
Wed, Sep 20, 4:56 PM · Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a project to T345839: Audit charts drift between staging and production: Prod-Kubernetes.
Wed, Sep 20, 4:55 PM · Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a project to T345823: Wikikube staging clusters are out of IPv4 Pod IP's: Prod-Kubernetes.
Wed, Sep 20, 4:55 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a project to T345892: Reduction of Secret-based Service Account Tokens: Prod-Kubernetes.
Wed, Sep 20, 4:54 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a project to T346915: Refactor discovery of calico-felix targets in prometheus: Prod-Kubernetes.
Wed, Sep 20, 4:54 PM · Patch-For-Review, Prod-Kubernetes, Observability-Metrics, serviceops, Kubernetes
JMeybohm added a project to T344171: Reverse DNS for k8s pods IPs: Prod-Kubernetes.
Wed, Sep 20, 4:54 PM · Prod-Kubernetes, Kubernetes
JMeybohm moved T346915: Refactor discovery of calico-felix targets in prometheus from Incoming 🐫 to ⎈Kubernetes on the serviceops board.
Wed, Sep 20, 3:18 PM · Patch-For-Review, Prod-Kubernetes, Observability-Metrics, serviceops, Kubernetes
JMeybohm triaged T346915: Refactor discovery of calico-felix targets in prometheus as Low priority.
Wed, Sep 20, 3:18 PM · Patch-For-Review, Prod-Kubernetes, Observability-Metrics, serviceops, Kubernetes
JMeybohm claimed T324959: Scrape controller-manager and scheduler metrics.
Wed, Sep 20, 2:39 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T324959: Scrape controller-manager and scheduler metrics.
Wed, Sep 20, 10:20 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T324959: Scrape controller-manager and scheduler metrics.
Wed, Sep 20, 10:17 AM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Wed, Sep 20, 9:01 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Tue, Sep 19

JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Tue, Sep 19, 10:17 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T328291: Post Kubernetes v1.23 cleanup as Resolved.
Tue, Sep 19, 10:02 AM · Patch-For-Review, Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T328291: Post Kubernetes v1.23 cleanup, a subtask of T307943: Update Kubernetes clusters to v1.23, as Resolved.
Tue, Sep 19, 10:02 AM · Foundational Technology Requests, Shared-Data-Infrastructure, Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) as Resolved.

Removed all the certs with [puppet-private] (23d9433a) and ran puppet on all masters without issue. Wikitech has been updated as well to remove all mentions of cergen.

Tue, Sep 19, 10:02 AM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm committed rLPRI4b98794014df: Drop kubernetes cergen certs (authored by JMeybohm).
Drop kubernetes cergen certs
Tue, Sep 19, 10:02 AM
JMeybohm closed T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen), a subtask of T307943: Update Kubernetes clusters to v1.23, as Resolved.
Tue, Sep 19, 10:01 AM · Foundational Technology Requests, Shared-Data-Infrastructure, Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen), a subtask of T328291: Post Kubernetes v1.23 cleanup, as Resolved.
Tue, Sep 19, 10:01 AM · Patch-For-Review, Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).
Tue, Sep 19, 9:59 AM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).
Tue, Sep 19, 9:03 AM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops

Mon, Sep 18

JMeybohm updated the task description for T346638: Rename the envoy's uses_ingress option to sets_sni .
Mon, Sep 18, 2:58 PM · Machine-Learning-Team, serviceops
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Mon, Sep 18, 2:46 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Mon, Sep 18, 2:01 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Mon, Sep 18, 1:20 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Mon, Sep 18, 12:42 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Mon, Sep 18, 12:32 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T345892: Reduction of Secret-based Service Account Tokens.
Mon, Sep 18, 12:00 PM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm updated the task description for T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).
Mon, Sep 18, 9:13 AM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops

Fri, Sep 15

JMeybohm updated the task description for T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).
Fri, Sep 15, 12:42 PM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm closed T332010: Migrate conf2* hosts to bullseye as Resolved.

Updated etcd-mirror package has been rolled out, resolving this again

Fri, Sep 15, 11:56 AM · serviceops, SRE
JMeybohm closed T332010: Migrate conf2* hosts to bullseye, a subtask of T291916: Tracking task for Bullseye migrations in production, as Resolved.
Fri, Sep 15, 11:55 AM · Epic, Infrastructure-Foundations, SRE
JMeybohm reopened T332010: Migrate conf2* hosts to bullseye as "Open".

SRE was paged due to EtcdReplicationDown. Turns out the etcdmirror webinterface does not work with python3 on bullseye

Fri, Sep 15, 7:09 AM · serviceops, SRE
JMeybohm reopened T332010: Migrate conf2* hosts to bullseye, a subtask of T291916: Tracking task for Bullseye migrations in production, as Open.
Fri, Sep 15, 7:09 AM · Epic, Infrastructure-Foundations, SRE

Thu, Sep 14

JMeybohm added a comment to T345738: etcd in codfw burned all latency SLO error budget.

conf2 nodes are on bullseye now and the metrics do look better now, as expected

Thu, Sep 14, 4:10 PM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops
JMeybohm closed T332010: Migrate conf2* hosts to bullseye as Resolved.

This is done and clients (confd/pybal) are back on the cluster.

Thu, Sep 14, 4:00 PM · serviceops, SRE
JMeybohm closed T332010: Migrate conf2* hosts to bullseye, a subtask of T291916: Tracking task for Bullseye migrations in production, as Resolved.
Thu, Sep 14, 3:59 PM · Epic, Infrastructure-Foundations, SRE

Wed, Sep 13

JMeybohm renamed T346264: Improve wikifunctions logging from Improcve wikifunctions logging to Improve wikifunctions logging.
Wed, Sep 13, 4:59 PM · WikiLambda, Abstract Wikipedia team
JMeybohm created T346264: Improve wikifunctions logging.
Wed, Sep 13, 4:59 PM · WikiLambda, Abstract Wikipedia team
JMeybohm moved T345892: Reduction of Secret-based Service Account Tokens from Incoming 🐫 to ⎈Kubernetes on the serviceops board.
Wed, Sep 13, 11:14 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm claimed T332010: Migrate conf2* hosts to bullseye.
Wed, Sep 13, 8:56 AM · serviceops, SRE

Mon, Sep 11

JMeybohm updated the task description for T345892: Reduction of Secret-based Service Account Tokens.
Mon, Sep 11, 8:23 AM · Prod-Kubernetes, Kubernetes, serviceops

Fri, Sep 8

JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Fri, Sep 8, 2:03 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm created T345894: Improve jaeger ingress deployment .
Fri, Sep 8, 7:40 AM · User-fgiunchedi, Observability-Tracing
JMeybohm updated the task description for T345892: Reduction of Secret-based Service Account Tokens.
Fri, Sep 8, 7:20 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm triaged T345892: Reduction of Secret-based Service Account Tokens as Low priority.
Fri, Sep 8, 7:15 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm closed T341042: Unmanaged envoyproxy installation on wdqs1009 and wdqs1010 as Resolved.

@JMeybohm these hosts have been reimaged, are you still seeing their envoy proxy as unmanaged?
Envoy configuration is included here if it helps.

Fri, Sep 8, 6:27 AM · Discovery-Search (Current work), Data-Platform-SRE

Thu, Sep 7

JMeybohm committed rOSHR8c2217f54b9c: Update to kubernetes client-go 0.23.14 (authored by JMeybohm).
Update to kubernetes client-go 0.23.14
Thu, Sep 7, 7:45 PM
JMeybohm triaged T345823: Wikikube staging clusters are out of IPv4 Pod IP's as High priority.
Thu, Sep 7, 11:46 AM · Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T345738: etcd in codfw burned all latency SLO error budget.

Hi

TL;DR

cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bump their kernel version.

I'm ok to disable cadvisor there, though I gotta ask what are the plans for conf2* upgrades and/or reboot ?

Thu, Sep 7, 8:59 AM · Patch-For-Review, SRE, Infrastructure-Foundations, serviceops

Wed, Sep 6

JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Wed, Sep 6, 9:57 AM · serviceops, Prod-Kubernetes
JMeybohm triaged T345709: Setup kubernetes20[25-53] as Medium priority.
Wed, Sep 6, 9:39 AM · serviceops
JMeybohm moved T345709: Setup kubernetes20[25-53] from Incoming 🐫 to ⎈Kubernetes on the serviceops board.
Wed, Sep 6, 9:39 AM · serviceops
JMeybohm created T345709: Setup kubernetes20[25-53].
Wed, Sep 6, 9:39 AM · serviceops

Tue, Sep 5

JMeybohm added a comment to T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).

I put together a small go tool to validate some/all tokens with provided certficates (one or many). I did not see any other way of checking which token is signed by which key - and we need to make sure all tokens are signed by a pki key before we remove the cergen cert from the list for validation. For anybody interested, the public certs of all clusters (cergen and pki) can be found at deploy1002:/home/jayme/kube-apiserver-sa/certs/ (a compiled version of the below is at deploy1002:/home/jayme/kube-apiserver-sa/k8s-jwt-validator

1package main
2
3import (
4 "bufio"
5 "context"
6 "flag"
7 "fmt"
8 "log"
9 "os"
10
11 "crypto/x509"
12 "encoding/pem"
13
14 jwt "github.com/golang-jwt/jwt/v5"
15 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
16 "k8s.io/client-go/kubernetes"
17 "k8s.io/client-go/tools/clientcmd"
18)
19
20var (
21 kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file (optional)")
22)
23
24func loadX509Cert(x509CertificatePath string) (*x509.Certificate, error) {
25 certData, err := os.ReadFile(x509CertificatePath)
26 if err != nil {
27 return nil, err
28 }
29
30 block, _ := pem.Decode(certData)
31 if block == nil {
32 return nil, fmt.Errorf("failed to parse certificate PEM")
33 }
34
35 cert, err := x509.ParseCertificate(block.Bytes)
36 if err != nil {
37 return nil, err
38 }
39
40 return cert, nil
41}
42
43func parseJWT(jwtToken string, publicKey any) (*jwt.Token, error) {
44 token, err := jwt.Parse(jwtToken, func(token *jwt.Token) (interface{}, error) {
45 return publicKey, nil
46 })
47 if err != nil {
48 return nil, err
49 }
50
51 return token, nil
52}
53
54func main() {
55 tokensToValidate := make(map[string]string)
56
57 flag.Parse()
58 x509CertPaths := flag.Args()
59 if len(x509CertPaths) <= 0 {
60 log.Fatalln("No x.509 certificate for validation provided")
61 }
62 x509Certs := make([]*x509.Certificate, len(x509CertPaths))
63 for idx, certPath := range x509CertPaths {
64 cert, err := loadX509Cert(certPath)
65 if err != nil {
66 log.Fatalf("Error loading X.509 certificate %s: %v", certPath, err)
67 }
68 x509Certs[idx] = cert
69 }
70
71 if *kubeconfig != "" {
72 // Fetch secrets from k8s if kubeconfig is provided
73 config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
74 if err != nil {
75 panic(err)
76 }
77 clientset, err := kubernetes.NewForConfig(config)
78 if err != nil {
79 panic(err)
80 }
81 // List all secrets of type "kubernetes.io/service-account-token"
82 secrets, err := clientset.CoreV1().Secrets("").List(context.Background(), metav1.ListOptions{
83 FieldSelector: "type=kubernetes.io/service-account-token",
84 })
85 if err != nil {
86 log.Fatalf("Error listing secrets: %v", err)
87 }
88
89 for _, secret := range secrets.Items {
90 name := fmt.Sprintf("%s/%s", secret.Namespace, secret.Name)
91 token, found := secret.Data["token"]
92 if !found {
93 log.Fatalf("Token field not found in the secret %s", name)
94 }
95 tokensToValidate[name] = string(token)
96 }
97 } else {
98 // Without kubeconfig, expect a token from stdin
99 stdinScanner := bufio.NewScanner(os.Stdin)
100 stdinScanner.Scan()
101 tokensToValidate["stdin"] = stdinScanner.Text()
102 }
103
104 var validationError error
105 var anyValidationFailed bool
106 for tokenName, tokenString := range tokensToValidate {
107 for certIdx, cert := range x509Certs {
108 _, validationError = parseJWT(tokenString, cert.PublicKey)
109 if validationError == nil {
110 fmt.Printf("%s validated with: %s\n", tokenName, x509CertPaths[certIdx])
111 break
112 }
113 }
114 if validationError != nil {
115 anyValidationFailed = true
116 fmt.Printf("%s validation FAILED: %v\n", tokenName, validationError)
117 }
118 }
119
120 if anyValidationFailed {
121 os.Exit(1)
122 }
123}

Tue, Sep 5, 10:44 AM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created P52248 k8s-jwt-validator.go.
Tue, Sep 5, 10:36 AM

Mon, Sep 4

JMeybohm added a comment to T342534: Q1:rack/setup/install kubernetes20[25-53].

[...]
so 2025 and 2026 nodes had 2 roles, insetup and kubernetes::worker roles

Mon, Sep 4, 11:03 AM · SRE, ops-codfw, serviceops, DC-Ops
JMeybohm closed T341669: Allow for multiple confd instances in puppet as Resolved.

Thanks to @jbond's refactor this is now resolved (again).

Mon, Sep 4, 10:32 AM · serviceops
JMeybohm closed T341669: Allow for multiple confd instances in puppet, a subtask of T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen), as Resolved.
Mon, Sep 4, 10:32 AM · Foundational Technology Requests, Kubernetes, Prod-Kubernetes, serviceops

Fri, Sep 1

JMeybohm closed T344253: jaeger is configured to receive traces from production as Resolved.
Fri, Sep 1, 2:32 PM · serviceops, User-fgiunchedi, Observability-Tracing
JMeybohm closed T344253: jaeger is configured to receive traces from production, a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
Fri, Sep 1, 2:32 PM · Epic, Observability-Tracing
JMeybohm updated the task description for T344253: jaeger is configured to receive traces from production.
Fri, Sep 1, 1:20 PM · serviceops, User-fgiunchedi, Observability-Tracing

Thu, Aug 31

JMeybohm added a project to T344230: Get aux-k8s cluster row-redundant and with more workers: Infrastructure-Foundations.
Thu, Aug 31, 1:15 PM · Infrastructure-Foundations, Observability-Tracing, Epic
JMeybohm updated subscribers of T344230: Get aux-k8s cluster row-redundant and with more workers.
Thu, Aug 31, 1:14 PM · Infrastructure-Foundations, Observability-Tracing, Epic
JMeybohm updated the task description for T344253: jaeger is configured to receive traces from production.
Thu, Aug 31, 12:51 PM · serviceops, User-fgiunchedi, Observability-Tracing
JMeybohm updated the task description for T344253: jaeger is configured to receive traces from production.
Thu, Aug 31, 12:49 PM · serviceops, User-fgiunchedi, Observability-Tracing
JMeybohm closed T325178: Add ingress to aux-k8s as Resolved.

Did the LVS dance, curl -4 --resolve jaeger-query.discovery.wmnet:30443:$(dig +short k8s-ingress-aux.svc.eqiad.wmnet) https://jaeger-query.discovery.wmnet:30443 does now work as expected and probes come through.

Thu, Aug 31, 12:49 PM · serviceops, Observability-Tracing
JMeybohm closed T325178: Add ingress to aux-k8s, a subtask of T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services, as Resolved.
Thu, Aug 31, 12:48 PM · Observability-Tracing
JMeybohm updated the task description for T325178: Add ingress to aux-k8s.
Thu, Aug 31, 12:45 PM · serviceops, Observability-Tracing
JMeybohm claimed T325178: Add ingress to aux-k8s.
Thu, Aug 31, 11:51 AM · serviceops, Observability-Tracing
JMeybohm raised the priority of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21 from Low to High.
Thu, Aug 31, 8:57 AM · serviceops, Prod-Kubernetes

Wed, Aug 30

JMeybohm added a comment to T344253: jaeger is configured to receive traces from production.

Before we can complete T343302: otel collector is configured to send traces to jaeger we need to get jaeger collector TCP ports (4317 for grpc and 4318 for http) exposed on the production network.

Wed, Aug 30, 1:00 PM · serviceops, User-fgiunchedi, Observability-Tracing

Mon, Aug 28

JMeybohm added a comment to T344998: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http'.

Thanks @elukey for stepping in!

I did deploy those changes last week and the tests described in https://wikitech.wikimedia.org/wiki/Wikifunctions/Runbook where okay, so I was confident that I did not break anything.

Yeah, sorry about this. In this case the problem was not apparent when making curl requests from the deployment server, but were when coming from the MW cluster; presumably there are extra rights bundled along with those somehow?

No. I think the code is doing something different when called from mediawiki instead of curl. The issue clearly is that with the curl command the orchestrator does not call back to mw-api.

Mon, Aug 28, 2:25 PM · Patch-For-Review, serviceops, Wikimedia-production-error, Abstract Wikipedia team, Wikifunctions
JMeybohm lowered the priority of T344998: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http' from Unbreak Now! to High.

Thanks @elukey for stepping in!

Mon, Aug 28, 8:22 AM · Patch-For-Review, serviceops, Wikimedia-production-error, Abstract Wikipedia team, Wikifunctions
JMeybohm removed a project from T344998: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http': SRE.
Mon, Aug 28, 8:20 AM · Patch-For-Review, serviceops, Wikimedia-production-error, Abstract Wikipedia team, Wikifunctions

Fri, Aug 25

JMeybohm committed rLPRI8b0812182de5: PKI: Rename aux key to match the naming scheme of everything else (authored by JMeybohm).
PKI: Rename aux key to match the naming scheme of everything else
Fri, Aug 25, 8:34 AM