Page MenuHomePhabricator
Feed Advanced Search

Feb 26 2024

JMeybohm created T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).
Feb 26 2024, 1:56 PM · SRE, serviceops
JMeybohm added a comment to T357380: Degraded RAID on mw2442.

After the reboot, you could still have made the new virtual drive with the last of those lines:

megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
Feb 26 2024, 11:20 AM · serviceops, ops-codfw
JMeybohm updated subscribers of T357380: Degraded RAID on mw2442.

@MatthewVernon pointed out (thanks) that this could have helped (if done before the reboot obviously):

Feb 26 2024, 11:16 AM · serviceops, ops-codfw
JMeybohm added a comment to T357380: Degraded RAID on mw2442.

The new disk was not detected by the host, even after scsi scan (maybe that's not a thing anymore? ;))
Anyhow. I rebooted the node and it did not came back up. Powercycling again with console attached showed the following prompt:

Feb 26 2024, 10:58 AM · serviceops, ops-codfw

Feb 23 2024

JMeybohm added a comment to T335177: docker-pkg fails to upload big Docker images to the registry.

reference T288198: Pushes to docker-registry fail for images with compressed layers of size >1GB for posterity

Feb 23 2024, 4:16 PM · Machine-Learning-Team, serviceops
JMeybohm updated subscribers of T356303: Review wikitech:Search and write processes for k8s world.

Yesterday on IRC the question was raised:

Feb 23 2024, 9:52 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Documentation, Discovery-Search (Current work)

Feb 22 2024

JMeybohm created T358189: aux-k8s cluster prometheus setup is incomplete.
Feb 22 2024, 9:24 AM · Infrastructure-Foundations, Observability-Tracing

Feb 20 2024

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

The rdk:broker2005 references one of the threads (I do see two) handling the connection to kafka-logging2005 (according to https://github.com/confluentinc/librdkafka/issues/2489#issuecomment-523824657)

Feb 20 2024, 3:09 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

I've created https://github.com/prometheus-community/rsyslog_exporter/pull/12 so we can collect kafka stats from rsyslogd as everything points into that direction currently

Feb 20 2024, 3:04 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

Another case of no missing container logs from mw2434, @Clement_Goubert did restart rsyslgd which was probably in a bad state:

Feb 20 10:55:10 mw2434 rsyslogd: action 'fwd_centrallog2002.codfw.wmnet:6514' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2208.0 try https://www.rsyslog.com/e/2007 ]
Feb 20 10:55:10 mw2434 rsyslogd: nsd_ossl: TLS Connection initiated with remote syslog server. [v8.2208.0]
Feb 20 10:55:10 mw2434 rsyslogd: nsd_ossl: Information, no shared curve between syslog client and server [v8.2208.0]
Feb 20 10:55:10 mw2434 rsyslogd: action 'fwd_centrallog2002.codfw.wmnet:6514' resumed (module 'builtin:omfwd') [v8.2208.0 try https://www.rsyslog.com/e/2359 ]
Feb 20 12:24:57 mw2434 rsyslogd: [origin software="rsyslogd" swVersion="8.2208.0" x-pid="3023245" x-info="https://www.rsyslog.com"] exiting on signal 15.
Feb 20 12:26:27 mw2434 systemd[1]: rsyslog.service: State 'stop-sigterm' timed out. Killing.
Feb 20 12:26:27 mw2434 systemd[1]: rsyslog.service: Killing process 3023245 (rsyslogd) with signal SIGKILL.
Feb 20 12:26:27 mw2434 systemd[1]: rsyslog.service: Killing process 2429765 (rdk:broker2005) with signal SIGKILL.
Feb 20 12:26:27 mw2434 systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL
Feb 20 12:26:27 mw2434 systemd[1]: rsyslog.service: Failed with result 'timeout'.
Feb 20 12:26:27 mw2434 systemd[1]: rsyslog.service: Consumed 53min 41.497s CPU time.
Feb 20 12:26:27 mw2434 rsyslogd: lookup table 'output_lookup' loaded from file '/etc/rsyslog.lookup.d/lookup_table_output.json' [v8.2208.0 try https://www.rsyslog.com/e/0 ]
Feb 20 12:26:27 mw2434 rsyslogd: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.2208.0]
Feb 20 12:26:27 mw2434 rsyslogd: [origin software="rsyslogd" swVersion="8.2208.0" x-pid="1326944" x-info="https://www.rsyslog.com"] start
Feb 20 2024, 2:47 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

@colewhite @JMeybohm @Clement_Goubert I think we could mark this resolved, unless you want to investigate it further.

Feb 20 2024, 2:09 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm renamed T357616: Logs from containers sometimes not visible in logstash from Logs from ipoid-production-daily-updates-28465260-qbbz9 are not visible in Logstash after Feb 14, 2024 @ 23:40:08.183 to Logs from containers sometimes not visible in logstash.
Feb 20 2024, 2:07 PM · Patch-For-Review, Observability-Logging, serviceops

Feb 18 2024

JMeybohm added a comment to T357380: Degraded RAID on mw2442.

Sorry for the late reply. I'm not sure what you're asking thb. As I understand it the disk most likely broke, so the "replace the PD" option would be the way to go here.

Feb 18 2024, 11:15 AM · serviceops, ops-codfw
JMeybohm edited projects for T357380: Degraded RAID on mw2442, added: serviceops; removed SRE.
Feb 18 2024, 11:15 AM · serviceops, ops-codfw

Feb 16 2024

JMeybohm added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I'm not super convinced of 4...writing ValidationAdmissionPolicies is quite complex and there are so many corner cases. I tried implementing the first restrictions from the baseline profile (I think we need around 10) and it's already huge:

Feb 16 2024, 3:24 PM · Patch-For-Review, serviceops, Prod-Kubernetes

Feb 15 2024

JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

There were two logs found in the dead-letter queue with "ipoid" in the name, but they were destined for the legacy indexes with "cert-manager" in the namespace.

Yeah, those two are from cert-manager itself and only related to issuing certificates for ipoid, that's why that name comes up. So not related to the missing ipoid logs.

Feb 15 2024, 4:06 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a comment to T357616: Logs from containers sometimes not visible in logstash.

Yeah, although those are from a different index (logstash-* vs. ecs-*). Not sure if that makes any difference

Feb 15 2024, 3:11 PM · Patch-For-Review, Observability-Logging, serviceops
JMeybohm added a project to T357616: Logs from containers sometimes not visible in logstash: Observability-Logging.

I tried to find a culprit but could not find anything obvious. The only thing I see are, probably unrelated, rsyslogd errors from omfwd plugin fleet wide:

Feb 15 11:35:15 mw1460 rsyslogd: nsd_ossl: Information, no shared curve between syslog client and server [v8.2208.0]
Feb 15 11:35:15 mw1460 rsyslogd: nsd_ossl: TLS Connection initiated with remote syslog server. [v8.2208.0]
Feb 15 11:35:15 mw1460 rsyslogd: nsd_ossl: TLS session terminated successfully to remote syslog server 'centrallog1002.eqiad.wmnet' with SSL Error '-1': End Session [v8.2208.0]
Feb 15 11:35:15 mw1460 rsyslogd: omfwd: TCPSendBuf error -1, destruct TCP Connection to centrallog1002.eqiad.wmnet:6514 [v8.2208.0]
Feb 15 11:35:15 mw1460 rsyslogd: SSL_ERROR_SYSCALL Error in 'Send': 'error:00000005:lib(0):func(0):DH lib(5)' with ret=-1, errno=32, sslapi='SSL_write'  [v8.2208.0]
Feb 15 12:15:17 mw1460 rsyslogd: action 'fwd_centrallog1002.eqiad.wmnet:6514' resumed (module 'builtin:omfwd') [v8.2208.0 try https://www.rsyslog.com/e/2359 ]
Feb 15 12:15:17 mw1460 rsyslogd: action 'fwd_centrallog1002.eqiad.wmnet:6514' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2208.0 try https://www.rsyslog.com/e/2007 ]
Feb 15 2024, 2:28 PM · Patch-For-Review, Observability-Logging, serviceops

Feb 9 2024

JMeybohm added a project to T357145: Consider moving to haproxy ingress for Thumbor workers: Kubernetes.

I would actually love if we could try to reproduce what we do with haproxy with istio ingressgateway before introducing another ingress controller (but tbh I did not check at all if that is feasible)

Feb 9 2024, 4:06 PM · Kubernetes, serviceops, Thumbor

Feb 7 2024

JMeybohm added a comment to T356787: The label named state on node_systemd_service_restart_total metrics was changed to name.

It's probably also worth noting here that the crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not work on buster hosts we won't be losing functionality compared to the nrpe check.

Feb 7 2024, 11:08 AM · User-fgiunchedi, Observability-Alerting

Feb 6 2024

JMeybohm created T356787: The label named state on node_systemd_service_restart_total metrics was changed to name.
Feb 6 2024, 4:32 PM · User-fgiunchedi, Observability-Alerting

Jan 15 2024

JMeybohm added a comment to T354049: Requesting access to <restricted> for Arthur Taylor.

How / where was this account created? ldapsearch -xxx cn="Arthur Taylor" says cn and sn are Arthur taylor instead (different capitalization).

Jan 15 2024, 12:20 PM · User-ItamarWMDE, SRE, SRE-Access-Requests
JMeybohm added a comment to T354276: Grant Access to wmde, nda for Dima Koushha.

Hi all, please provide Dima koushha's WMDE email address to kfrancis@wikimedia.org and I'll prepare the NDA. Thanks!

Jan 15 2024, 9:24 AM · SRE, LDAP-Access-Requests

Jan 12 2024

JMeybohm claimed T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I would personally try to spend some time in understanding the ValidationAdmissionPolicy feature before starting a big work of moving all our clusters to OPA Gatekeeper.

Jan 12 2024, 5:15 PM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm added a comment to T354532: Limit the concurrency of envoy in service mesh.

Another thing we saw a significant change with different limits/concurrency in is envoy_cluster_upstream_cx_connect_ms (Connection establishment) - all while constantly serving ~100-200 req/s):

Screenshot_20240112_092336.png (397×677 px, 103 KB)

Jan 12 2024, 8:32 AM · Kubernetes, Prod-Kubernetes, serviceops

Jan 11 2024

JMeybohm updated the task description for T354853: Service mesh envoy does not treat incoming connections as local.
Jan 11 2024, 12:53 PM · serviceops
JMeybohm updated subscribers of T354853: Service mesh envoy does not treat incoming connections as local.
Jan 11 2024, 12:50 PM · serviceops
JMeybohm created T354853: Service mesh envoy does not treat incoming connections as local.
Jan 11 2024, 12:50 PM · serviceops
JMeybohm closed T354604: Investigate prometheus@k8s metric/label cardinality reduction as Resolved.

With the fixed patch, head series where reduced and id is no longer the top cardinality label. I think we can resolve this

Jan 11 2024, 12:27 PM · Kubernetes, Observability-Metrics
JMeybohm closed T354604: Investigate prometheus@k8s metric/label cardinality reduction, a subtask of T354399: Prometheus @ k8s OOM loop, as Resolved.
Jan 11 2024, 12:26 PM · User-fgiunchedi, Observability-Metrics

Jan 10 2024

JMeybohm added a project to T352893: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup: Kubernetes.
Jan 10 2024, 3:57 PM · Kubernetes, Prod-Kubernetes, serviceops, netops, Infrastructure-Foundations, SRE

Jan 9 2024

JMeybohm added a comment to T354604: Investigate prometheus@k8s metric/label cardinality reduction.

While this did cut the cardinality for id in half it unfortunately did not really make any difference in terms of memory usage or appended samples per second (which I had expected). OTOH I would have also expected the cardinality to drop sharply, as there are only two other metrics (apart from the cadvisor stuff) that use the "id" label:
https://prometheus-eqiad.wikimedia.org/k8s/classic/graph?g0.range_input=1h&g0.expr=group%20(%7Bid!%3D%22%22%2C%20job!%3D%22k8s-node-cadvisor%22%7D)%20by%20(__name__%2C%20job)&g0.tab=1

Jan 9 2024, 3:14 PM · Kubernetes, Observability-Metrics
JMeybohm added a comment to T354604: Investigate prometheus@k8s metric/label cardinality reduction.

Highest cardinality label is id which is used heavily by cadvisor and contains the slice id of the container (like: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod143a77d4_c340_47e0_966c_ae0c06b977b4.slice/docker-b1d6065dd088bb283a27edc9dd9478d0d61bac1ad7846e740f9b56ee80c4b7d5.scope). I've skimmed grafana and I don't think we use that label anywhere in k8s context. Metrics are usually matched by name (container) and pod_name

Jan 9 2024, 9:50 AM · Kubernetes, Observability-Metrics

Jan 8 2024

JMeybohm added a comment to T354532: Limit the concurrency of envoy in service mesh.

It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?

Jan 8 2024, 3:12 PM · Kubernetes, Prod-Kubernetes, serviceops
JMeybohm created T354532: Limit the concurrency of envoy in service mesh.
Jan 8 2024, 2:47 PM · Kubernetes, Prod-Kubernetes, serviceops

Jan 5 2024

JMeybohm added a comment to T353460: The consumer job of the SUP does not achieve its expected throughput.

@pfischer send me here with the results from your consumer-devnull tests. We have not done excessive testing with this but it might help a lot to reduce the concurrency for envoy (which defaults to the number of CPUs on the node; https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#envoy).

Jan 5 2024, 4:46 PM · Discovery-Search (Current work), CirrusSearch
JMeybohm closed T353314: kube-apiserver and kubelet HTTPS certificates have the default validity (672h) in staging as Resolved.

This is now fixed by approach #1 and certs with 72h expiry have been issued on kubestagemaster1001.eqiad.wmnet, all other staging masters will follow throughout January.

Jan 5 2024, 1:48 PM · serviceops, Kubernetes

Dec 21 2023

JMeybohm added a comment to T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).

We want charts to explicitly define the services/endpoints/datastores they want to connect to, so a GlobalNetworkPolicy would be to broad.

Dec 21 2023, 2:34 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm added a comment to T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).

Cool. I think we could/should deploy this via admin_ng to have deployment restricted to root users and also not have to mess with multiple different credentials for each namespace.

Dec 21 2023, 1:21 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Dec 19 2023

JMeybohm closed T353463: Alert on calico components being down as Resolved.
Dec 19 2023, 2:23 PM · serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T353463: Alert on calico components being down, a subtask of T353233: Outage of wikikube codfw apiservers, as Resolved.
Dec 19 2023, 2:23 PM · Kubernetes, serviceops
JMeybohm added a comment to T287491: Allow to address Kubernetes API servers from NetworkPolicy.

kube-state-metrics successfully introduced the pattern of using a calico networkpolicy with service selector to match masters in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/974158

Dec 19 2023, 11:53 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

Forgive me for the drive-by comment, but would it be possible to create high IOPS tiers for Ganeti (RAID-0?) I'd recommend deploying in conjunction with non-DRDB VMs for services that have their own HA (such as Kubernetes control plane). I bring it up as I feel like Ganeti is an underused resource, and using it helps to avoid some of the management overhead associated with physical machines.

Dec 19 2023, 11:43 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T353233: Outage of wikikube codfw apiservers as Resolved.

Resolving this as the immediate problem is resolved and remaining follow-ups have their own tasks

Dec 19 2023, 11:40 AM · Kubernetes, serviceops
JMeybohm added a comment to T348284: Handle sidecar containers in one-off Kubernetes jobs.

Oh, I misunderstood what you meant by "enable the controller on a per namespace level" above! I thought deploying one instance per namespace was what you had in mind.

Dec 19 2023, 10:51 AM · MW-on-K8s, serviceops

Dec 18 2023

JMeybohm added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

Monday, January 1st, New Year's Day (Americas)
Monday, April 22nd, Earth Day (Americas)

Dec 18 2023, 4:32 PM · SRE Observability (FY2023/2024-Q4)
JMeybohm added a comment to T353464: Migrate wikikube control planes to hardware nodes.

I am not so sure we actually do scratch that memory limit now. Looking at kubemaster2001 last week

Dec 18 2023, 1:58 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a comment to T348284: Handle sidecar containers in one-off Kubernetes jobs.

Bummer...the change you proposed would require us to deploy one sidecar-controller per namespace (probably this is the yak you are looking for :-)) - which I don't think is ideal in terms of resource usage and deployment complexity.

Dec 18 2023, 10:51 AM · MW-on-K8s, serviceops

Dec 15 2023

JMeybohm added a subtask for T353233: Outage of wikikube codfw apiservers: T353464: Migrate wikikube control planes to hardware nodes.
Dec 15 2023, 2:59 PM · Kubernetes, serviceops
JMeybohm added a parent task for T353464: Migrate wikikube control planes to hardware nodes: T353233: Outage of wikikube codfw apiservers.
Dec 15 2023, 2:59 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm added a subtask for T353233: Outage of wikikube codfw apiservers: T353463: Alert on calico components being down.
Dec 15 2023, 2:59 PM · Kubernetes, serviceops
JMeybohm added a parent task for T353463: Alert on calico components being down: T353233: Outage of wikikube codfw apiservers.
Dec 15 2023, 2:59 PM · serviceops, Prod-Kubernetes, Kubernetes

Dec 14 2023

JMeybohm renamed T353464: Migrate wikikube control planes to hardware nodes from Migtate wikikube control planes to hardware nodes to Migrate wikikube control planes to hardware nodes.
Dec 14 2023, 4:11 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm triaged T353464: Migrate wikikube control planes to hardware nodes as Medium priority.
Dec 14 2023, 4:09 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm triaged T353463: Alert on calico components being down as High priority.
Dec 14 2023, 3:59 PM · serviceops, Prod-Kubernetes, Kubernetes

Dec 13 2023

JMeybohm updated the task description for T353314: kube-apiserver and kubelet HTTPS certificates have the default validity (672h) in staging.
Dec 13 2023, 3:57 PM · serviceops, Kubernetes
JMeybohm renamed T353314: kube-apiserver and kubelet HTTPS certificates have the default validity (672h) in staging from kube-apiserver ad kubelet certificates have the default validity (672h) in staging to kube-apiserver and kubelet HTTPS certificates have the default validity (672h) in staging.
Dec 13 2023, 10:58 AM · serviceops, Kubernetes
JMeybohm triaged T353314: kube-apiserver and kubelet HTTPS certificates have the default validity (672h) in staging as Medium priority.
Dec 13 2023, 10:58 AM · serviceops, Kubernetes

Dec 12 2023

JMeybohm removed a project from T353224: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad: serviceops.

Sweet! Untagging us

Dec 12 2023, 5:26 PM · Discovery-Search (Current work), CirrusSearch
JMeybohm added a comment to T353224: Envoy telemetry not available for cirrus-streaming-updater@staging-eqiad.

I'm not sure why it works for the other two. Prometheus does have established tcp connections to pods from mw-p-c-c-e but I can't create new ones because there is no networkpolicy that allows ingress traffic on port 1667. Maybe this is because of a recent networkpolicy change (existing connections are not effected by policy changes).

Dec 12 2023, 4:32 PM · Discovery-Search (Current work), CirrusSearch
JMeybohm claimed T353233: Outage of wikikube codfw apiservers.
Dec 12 2023, 2:22 PM · Kubernetes, serviceops
JMeybohm updated the task description for T353233: Outage of wikikube codfw apiservers.
Dec 12 2023, 1:23 PM · Kubernetes, serviceops
JMeybohm renamed T353233: Outage of wikikube codfw apiservers from kubernetes2047 lost all pods (unhealthy) to Outage of wikikube codfw apiservers.
Dec 12 2023, 11:50 AM · Kubernetes, serviceops
JMeybohm renamed T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies) from Make more use of Calico network policy features to Improve how we address outside k8s infrastructure from within charts (e.g. network policies).
Dec 12 2023, 9:59 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Dec 11 2023

JMeybohm closed T300033: Use cert-manager for service-proxy certificate creation as Resolved.
Dec 11 2023, 5:39 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Dec 11 2023, 10:18 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm committed rLPRI99e1562863b7: kubernetes: Remove cergen certs from kubernetes secrets.
kubernetes: Remove cergen certs from kubernetes secrets
Dec 11 2023, 9:08 AM

Dec 9 2023

JMeybohm added a project to T353045: Validate the impact of a k8s upgrade on our Flink deployment: serviceops-radar.
Dec 9 2023, 11:05 AM · serviceops-radar, Data-Platform-SRE

Dec 8 2023

JMeybohm added a comment to T348284: Handle sidecar containers in one-off Kubernetes jobs.

The error arises because we don't allow regular deployers to create RBAC objects in the cluster. The solution depends a bit on what if we want to limit the access scope of the controller to particular namespaces or not. Personally I would say that we should come up with a limited access approach, e.g. we would need to be able to enable the controller on a per namespace level. For that to work you need to:

Dec 8 2023, 8:40 AM · MW-on-K8s, serviceops

Dec 7 2023

JMeybohm added a subtask for T349796: Move MediaWiki jobs to mw-on-k8s: T352906: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error.
Dec 7 2023, 9:20 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
JMeybohm added a parent task for T352906: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error: T349796: Move MediaWiki jobs to mw-on-k8s.
Dec 7 2023, 9:20 AM · serviceops, Discovery-Search (Current work), MW-on-K8s

Dec 6 2023

JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Dec 6 2023, 1:05 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Dec 5 2023

JMeybohm added a comment to T259875: spicerack.dnsdisc.Discovery should expose TTL.

I don't exactly recall thb. but I would imagine I wanted something like this in one of the pool/depool/service-route cookbooks to store the TTL, lower it, change whatever and then reset the TTL to the value it had before. It's not ultimately required, just a nice to have I would say.

Dec 5 2023, 9:47 AM · Infrastructure-Foundations, SRE-tools

Nov 30 2023

JMeybohm updated the task description for T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator.
Nov 30 2023, 9:52 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Wikidata, Wikidata-Query-Service

Nov 29 2023

JMeybohm closed T345853: Fail event on /dev/md/0:kubernetes2028 as Resolved.

This LGTM now

Nov 29 2023, 2:55 PM · serviceops, SRE

Nov 28 2023

JMeybohm added a comment to T311050: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() .

Sorry, I must have missed the message. Yes, IIRC that is the correct interpretation.

Nov 28 2023, 9:54 AM · SRE-tools, Infrastructure-Foundations, Spicerack

Nov 27 2023

JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Nov 27 2023, 12:53 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Nov 23 2023

JMeybohm added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

I/F called that we might want to do do T327938: Codfw row A/B top-of-rack switch refresh in the process of reimaging

Nov 23 2023, 5:40 PM · serviceops, MW-on-K8s
JMeybohm changed the status of T300033: Use cert-manager for service-proxy certificate creation from Stalled to In Progress.
Nov 23 2023, 10:13 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm updated the task description for T300033: Use cert-manager for service-proxy certificate creation.
Nov 23 2023, 10:13 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
JMeybohm closed T324130: Update API gateway to newer version of Envoy, a subtask of T300324: Upgrade Envoy to supported version, as Resolved.
Nov 23 2023, 10:13 AM · SRE, Patch-For-Review, Traffic, serviceops, envoy
JMeybohm closed T324130: Update API gateway to newer version of Envoy as Resolved.

this looks like it's done now

Nov 23 2023, 10:12 AM · serviceops, Platform Team Workboards (Platform Engineering Reliability), Core Platform Team Initiatives (API Gateway)
JMeybohm closed T324130: Update API gateway to newer version of Envoy, a subtask of T306043: <API Platform> API Gateway MVP , as Resolved.
Nov 23 2023, 10:12 AM · API Platform (API Gateway Roadmap), Epic, Foundational Technology Requests

Nov 22 2023

JMeybohm added a comment to T264625: Deploy kube-state-metrics.

KSM in staging-eqiad was in a half installed state (probably due to prematurely terminated helmfile/helm command): [...]

Huh, strange, thank you.

Other than that, it has been running in staging and main for a week and it looks good. OK to deploy it in the other clusters (ml-*, dse, aux) too?

Nov 22 2023, 12:32 PM · Prod-Kubernetes, serviceops, User-jijiki, Kubernetes

Nov 21 2023

JMeybohm edited P53668 (An Untitled Masterwork).
Nov 21 2023, 11:53 AM
JMeybohm updated Other Assignee for T351704: kubernetes2041.codfw.wmnet NotReady, removed: Papaul.
Nov 21 2023, 11:32 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm added a comment to T264625: Deploy kube-state-metrics.

KSM in staging-eqiad was in a half installed state (probably due to prematurely terminated helmfile/helm command):

Nov 21 2023, 11:23 AM · Prod-Kubernetes, serviceops, User-jijiki, Kubernetes
JMeybohm reassigned T351704: kubernetes2041.codfw.wmnet NotReady from JMeybohm to Papaul.
Nov 21 2023, 11:18 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm updated Other Assignee for T351704: kubernetes2041.codfw.wmnet NotReady, added: Papaul.

Hey DCOps, this looks suspiciously like a cable might have been pulled. Could you please take a look?

Nov 21 2023, 11:18 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T351704: kubernetes2041.codfw.wmnet NotReady.
Nov 21 2023, 11:11 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T351704: kubernetes2041.codfw.wmnet NotReady.
Nov 21 2023, 10:54 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T351704: kubernetes2041.codfw.wmnet NotReady.
Nov 21 2023, 10:52 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm triaged T351704: kubernetes2041.codfw.wmnet NotReady as High priority.
Nov 21 2023, 10:51 AM · SRE, ops-codfw, Prod-Kubernetes, serviceops
JMeybohm created P53668 (An Untitled Masterwork).
Nov 21 2023, 10:00 AM

Nov 20 2023

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Nov 20 2023, 10:15 AM · serviceops, MW-on-K8s
JMeybohm edited P53484 add_k8s_nodes.py.
Nov 20 2023, 10:15 AM · serviceops

Nov 17 2023

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Nov 17 2023, 2:39 PM · serviceops, MW-on-K8s
JMeybohm edited P53484 add_k8s_nodes.py.
Nov 17 2023, 10:16 AM · serviceops
JMeybohm edited P53484 add_k8s_nodes.py.
Nov 17 2023, 10:04 AM · serviceops

Nov 16 2023

JMeybohm updated the task description for T351074: Move servers from the appserver/api cluster to kubernetes.
Nov 16 2023, 2:27 PM · serviceops, MW-on-K8s