Page MenuHomePhabricator

BTullis (Ben)
Staff SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Jun 29 2021, 9:56 AM (250 w, 13 h)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Today

BTullis added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

@elukey - I think that you can consider cephosd[2001-2003].codfw.wmnet as lower risk for this work, so you can bring it forward.

Tue, Apr 14, 2:14 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
BTullis moved T423243: Upgrade Airflow to 2.11.2 from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Apr 14, 1:40 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis edited projects for T423243: Upgrade Airflow to 2.11.2, added: Data-Platform-SRE (2026-03-27 - 2026-04-17); removed Data-Platform-SRE.
Tue, Apr 14, 1:40 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis created T423248: Upgrade to Airflow 3.2.x.
Tue, Apr 14, 11:22 AM · Epic, Data-Platform-SRE
BTullis created T423243: Upgrade Airflow to 2.11.2.
Tue, Apr 14, 10:04 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)

Yesterday

BTullis updated the task description for T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Mon, Apr 13, 2:18 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis changed the status of T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform, a subtask of T408586: ☂️ OpenSearch on K8s: Ensure that our first tenant workload is ready for production ☂️, from Stalled to Open.
Mon, Apr 13, 1:54 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), OKR-Work
BTullis changed the status of T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform from Stalled to Open.
Mon, Apr 13, 1:54 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis claimed T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform.

Reassigning this back to myself.
There was a little confusion, but I'm confident that we have the metrics we need to proceed.

Mon, Apr 13, 1:47 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)

Thu, Apr 2

BTullis reassigned T418175: Create SLO for the opensearch-ipoid cluster that runs on our OpenSearch on K8s platform from BTullis to bking.

I'm reassigning this to you for now @bking - as I'll be out next week and you've been making solid progress on the telemetry work. Hope that's OK.

Thu, Apr 2, 4:47 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis closed T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team as Resolved.

Thanks. I've added all of those 7 users to the Airflow-DAGs project in GitLab, now.

Thu, Apr 2, 4:24 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

4 Kerberos principals created and welcome emails sent.

btullis@krb1002:~$ sudo manage_principals.py create wmf-ldlulisa --email_address=ldlulisa@wikimedia.org
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to ldlulisa@wikimedia.org
Thu, Apr 2, 2:51 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

You should also now be able to start configuring and testing your SSH access to production, as outlined here:
https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config

Thu, Apr 2, 2:44 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

I have now modified the airflow-platform-eng-ops group.

btullis@ldap-maint1001:~$ sudo modify-ldap-group airflow-platform-eng-ops
Searching in: dc=wikimedia,dc=org
      1 entry read                                                                                                                                                                                                 
Searching in: ou=groups
      1 entry read                                                                                                                                                                                                 
Search failed: No such object
No search results.
add: 0, rename: 0, modify: 1, delete: 0
Action? [yYqQvVebB*rsf+?] y
Done.

Checked that the changes were applied.

btullis@ldap-maint1001:~$ ldapsearch -x cn=airflow-platform-eng-ops|egrep '(wmf-ldlulisa|kmontalva-wmf|renilthomas|hshaikh|eenabulele|ptiwary|sg912)'
member: uid=sg912,ou=people,dc=wikimedia,dc=org
member: uid=eenabulele,ou=people,dc=wikimedia,dc=org
member: uid=hshaikh,ou=people,dc=wikimedia,dc=org
member: uid=kmontalva-wmf,ou=people,dc=wikimedia,dc=org
member: uid=ptiwary,ou=people,dc=wikimedia,dc=org
member: uid=renilthomas,ou=people,dc=wikimedia,dc=org
member: uid=wmf-ldlulisa,ou=people,dc=wikimedia,dc=org

You should now all have ops-level access on https://airflow-platform-eng.wikimedia.org

Thu, Apr 2, 2:33 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Thu, Apr 2, 2:19 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

This patch for the POSIX groups is ready to go, I believe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1267031

Thu, Apr 2, 1:29 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

I have run cross-validate-accounts for all new production access requests, with no issues detected.

btullis@ldap-maint1001:~$ cross-validate-accounts --username wmf-ldlulisa --uid 46469 --email ldlulisa@wikimedia.org --real-name "Luvo Dlulisa" --ssh-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHYyrHgfVH5J5ahLjEzuGEbP7Yq0afDvZUDNuKEuYf9J luvodlulisa@wmf3275" --kerberos
btullis@ldap-maint1001:~$ echo $?
0
Thu, Apr 2, 12:41 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T422066: Request for +2 rights on Deployment-chart Repository for Snwachukwu.

Great! Thanks @taavi

Thu, Apr 2, 12:25 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Gerrit-Privilege-Requests
BTullis added a comment to T422066: Request for +2 rights on Deployment-chart Repository for Snwachukwu.

Although, that being said, I fully support Sandra's having these rights (or I wouldn't have added her in the first place).

Thu, Apr 2, 11:53 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Gerrit-Privilege-Requests
BTullis added a comment to T422066: Request for +2 rights on Deployment-chart Repository for Snwachukwu.

Apologies. I had granted access for @Snwachukwu without following due process.
Namely: https://www.mediawiki.org/wiki/Gerrit/Privilege_policy

Thu, Apr 2, 11:52 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Gerrit-Privilege-Requests
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

I have validated all SSH keys via out-of-band communication channels.

Thu, Apr 2, 11:37 AM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Thu, Apr 2, 11:27 AM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise

Wed, Apr 1

BTullis reopened T420812: Degraded RAID on an-worker1213 as "Open".

Hi @VRiley-WMF - Apologies for the delay in getting back to you. We haven't had a chance to do the fiddly bit with this yet, so I will reopen it and assign it to myself.

Wed, Apr 1, 4:10 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE, DC-Ops, ops-eqiad
BTullis added a comment to T420053: Requesting access to analytics-privatedata-users for AWesterinen.

I have created the kerberos principal.

btullis@krb1002:~$ sudo manage_principals.py create andreawest --email_address=arwesterinen@gmail.com
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to arwesterinen@gmail.com

The email should contain the temporary password. along with instructions on how to reset it on first use.

Wed, Apr 1, 2:21 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE, SRE-Access-Requests
BTullis added a comment to T420053: Requesting access to analytics-privatedata-users for AWesterinen.

I believe that the problem is my two different accounts (I am unsure how I ended up with two). I use AWesterinen to log into Phabricator, but the analytics-privatedata-users is assuming andreawest. So, I am getting the error "Service access denied due to missing privileges."

Wed, Apr 1, 2:17 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE, SRE-Access-Requests
BTullis added a comment to T405509: Provide an access to MaxMind GeoIP in DSE K8S pods.

I believe that the GeoIP files may now be mounted by Airflow task pods.

Wed, Apr 1, 1:40 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Essential-Work, Data-Engineering
BTullis closed T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad, a subtask of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21, as Resolved.
Wed, Apr 1, 1:07 PM · ServiceOps new, Patch-For-Review, Prod-Kubernetes
BTullis closed T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad as Resolved.

I believe that this is now applying the correct pod security standards, so this can be closed.

btullis@deploy1003:~$ kubectl get namespaces mediawiki-dumps-legacy -oyaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    meta.helm.sh/release-name: namespaces
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-12-19T17:18:06Z"
  labels:
    app: raw
    app.kubernetes.io/managed-by: Helm
    chart: raw-0.3.0
    heritage: Helm
    istio-injection: disabled
    kubernetes.io/metadata.name: mediawiki-dumps-legacy
    pod-security.wmf.org/allow-hostpath-geoip: include
    pod-security.wmf.org/disallow-capabilities-adding-capabilities: exclude
    pod-security.wmf.org/disallow-capabilities-except-ptrace: include
    pod-security.wmf.org/disallow-capabilities-strict-adding-capabilities-strict: exclude
    pod-security.wmf.org/disallow-host-path: exclude
    pod-security.wmf.org/profile: restricted
    pod-security.wmf.org/restrict-volume-types-restricted-volumes: exclude
    release: namespaces
  name: mediawiki-dumps-legacy
  resourceVersion: "1282738711"
  uid: ddb55dfc-e4b9-443e-85f0-4e504793b9aa
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Please feel free to reopen if there is anything still amiss.

Wed, Apr 1, 1:07 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Wed, Apr 1, 12:54 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Wed, Apr 1, 12:53 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a member for WMF-NDA: RThomas-WMF.
Wed, Apr 1, 12:46 PM
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

I will manually add @RThomas-WMF to the wmf LDAP group. This is usually performed as self-service, as noted here: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access but on this occasion I will add his account manually, using the method described here: https://wikitech.wikimedia.org/wiki/SRE/LDAP#Method_1

btullis@ldap-maint1001:~$ sudo modify-ldap-group wmf
Searching in: dc=wikimedia,dc=org
      1 entry read                                                                                                                                                                                                 
Searching in: ou=groups
      1 entry read                                                                                                                                                                                                 
Search failed: No such object
No search results.
add: 0, rename: 0, modify: 1, delete: 0
Action? [yYqQvVebB*rsf+?] y
Done.
btullis@ldap-maint1001:~$ ldapsearch -x cn=wmf|grep renilthomas
member: uid=renilthomas,ou=people,dc=wikimedia,dc=org
btullis@ldap-maint1001:~$

I'll follow up with a puppet patch to track this, even before switching this to a full shell account.

Wed, Apr 1, 12:46 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

In the meantime, we will need an SSH key for those users who don't already have one for production shell access.
That's the following users:

Wed, Apr 1, 12:34 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated subscribers of T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

In terms of manager approvals, @HShaikh is the manager for all 7 of the other users listed, whereas @HShaikh reports to @lanebecker.

Wed, Apr 1, 12:14 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Wed, Apr 1, 12:03 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Wed, Apr 1, 11:28 AM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Wed, Apr 1, 11:21 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis added a project to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team: LDAP-Access-Requests.

OK, thanks for all of the input so far. I've got some clarification on the procedures from the SRE team and I have a way forward.
So far, we have only discussed the requirement to have ops-level access to https://airflow-platform-eng.wikimedia.org so that they can trigger DAG runs and re-run any failed tasks etc.
I can add the users to the required LDAP group to grant this privilege.

Wed, Apr 1, 11:14 AM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Wed, Apr 1, 8:52 AM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise

Tue, Mar 31

BTullis updated subscribers of T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.

I can also take the druid[1012-1013].eqiad.wmnet hosts.

Tue, Mar 31, 5:42 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis added a comment to T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.

dbstore1007 is shortly to be refreshed by dbstore1010 when T417948: Q3:rack/setup/install dbstore1010 is complete.

Tue, Mar 31, 5:41 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis updated the task description for T421714: Data platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Tue, Mar 31, 5:40 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Tue, Mar 31, 2:50 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

Agreed. I'm happy to decom this server. As per the original description, this drive bay keeps connecting and dropping. It costs us time each time that we try to re-add a data volume to Hadoop.

Tue, Mar 31, 1:30 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
BTullis renamed T421860: Requesting shell access and membership of the ops group for atsuko from Requesting shell access for atsuko to Requesting shell access and membership of the ops group for atsuko.
Tue, Mar 31, 12:40 PM · SRE, SRE-Access-Requests
BTullis updated subscribers of T421860: Requesting shell access and membership of the ops group for atsuko.

I approve membership of analytics-admin

Tue, Mar 31, 11:08 AM · SRE, SRE-Access-Requests
BTullis updated the task description for T421860: Requesting shell access and membership of the ops group for atsuko.
Tue, Mar 31, 11:04 AM · SRE, SRE-Access-Requests
BTullis updated the task description for T421860: Requesting shell access and membership of the ops group for atsuko.
Tue, Mar 31, 11:03 AM · SRE, SRE-Access-Requests
BTullis updated the task description for T421860: Requesting shell access and membership of the ops group for atsuko.
Tue, Mar 31, 10:47 AM · SRE, SRE-Access-Requests
BTullis closed T420053: Requesting access to analytics-privatedata-users for AWesterinen as Resolved.

I believe that this is all ready to go now. I'll resolve the ticket, but please feel free to let me know if you have any problems with your access @AWesterinen.

Tue, Mar 31, 10:30 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE, SRE-Access-Requests

Mon, Mar 30

BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

@LDlulisa-WMF , @RThomas-WMF , @E.Enabulele - I think that the next step in this will probably be for you to request membership of the wmf group in LDAP.
It seems strange that you weren't already added to this group during your onboarding, but perhaps this is related to the fact that you are primarily working on WME, so the authentication system will be a little different.

Mon, Mar 30, 4:45 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Mon, Mar 30, 3:23 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis updated the task description for T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Mon, Mar 30, 2:39 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis renamed T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team from Requesting Access to Data Engineering Airflow Instance for the WME team to Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Mon, Mar 30, 2:29 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis added a comment to T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.

I can pick this up and work with you on the details, to make sure that we get the level of access correct for your needs.
There is also a Slack thread around this request, which might be useful for reference.

Mon, Mar 30, 2:01 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis moved T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Mon, Mar 30, 12:05 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis claimed T421214: Requesting Ops level access to the 'platform_eng' Airflow Instance for the WME team.
Mon, Mar 30, 12:04 PM · LDAP-Access-Requests, Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests, Wikimedia Enterprise
BTullis closed T421241: an-druid1007 fails to compile in PCC as Resolved.

I believe that this should be fixed now. I noticed that there were quite a few out-of-date dummy keytabs for old hosts in the labs-private repo, so I carried out a one-off sync of the directory structure and filenames from the private repo to the labs-private repo.

Mon, Mar 30, 12:01 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis moved T421241: an-druid1007 fails to compile in PCC from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Mon, Mar 30, 11:52 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis claimed T421241: an-druid1007 fails to compile in PCC.
Mon, Mar 30, 11:52 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
BTullis edited projects for T421361: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values, added: Data-Platform-SRE (2026-03-27 - 2026-04-17); removed Data-Platform-SRE.
Mon, Mar 30, 11:00 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
BTullis merged task T421361: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values into T421362: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values.
Mon, Mar 30, 11:00 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
BTullis merged T421361: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values into T421362: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values.
Mon, Mar 30, 11:00 AM · Infrastructure-Foundations, SRE, Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
BTullis moved T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager from Backlog - project to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Mon, Mar 30, 10:58 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering, Observability-Alerting
BTullis closed T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager as Resolved.

I propose to resolve this ticket now.
The Data-Engineering team is still working on T415941: [EPIC] Move SystemD timer based jobs to Airflow to migrate the remaining systemd timer jobs to Airflow, but a large proportion of them have already moved the alert noise associated with those that remain is very low.

Mon, Mar 30, 10:58 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering, Observability-Alerting
BTullis closed T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager, a subtask of T346438: [Epic] Review alerting strategy for Data Platform SRE, as Resolved.
Mon, Mar 30, 10:58 AM · Epic, Data-Platform-SRE, observability
BTullis added a parent task for T398073: Ensure DPE SRE can receive alerts for applications hosted in wikikube: T346438: [Epic] Review alerting strategy for Data Platform SRE.
Mon, Mar 30, 10:54 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Kubernetes, ServiceOps new, Essential-Work, SRE Observability (FY2025/2026-Q1)
BTullis added a parent task for T420264: Data Platform SRE paging alerts and on-call SRE response: T346438: [Epic] Review alerting strategy for Data Platform SRE.
Mon, Mar 30, 10:54 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE
BTullis added subtasks for T346438: [Epic] Review alerting strategy for Data Platform SRE: T398073: Ensure DPE SRE can receive alerts for applications hosted in wikikube, T420264: Data Platform SRE paging alerts and on-call SRE response.
Mon, Mar 30, 10:54 AM · Epic, Data-Platform-SRE, observability
BTullis closed T414970: Alert in need of triage: KubernetesAPIErrorRate as Resolved.

I'm resolving this ticket, since it is historical.

Mon, Mar 30, 9:18 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), sre-alert-triage

Thu, Mar 26

BTullis added a comment to T419820: Requesting access to analytics-admins for Jerrywang.

@BTullis - I see this is assigned to you. Do you need any assistance from Clinic Duty? (I see this is still waiting on an SSH public key, etc.)

Thu, Mar 26, 4:03 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE-Access-Requests
BTullis moved T405509: Provide an access to MaxMind GeoIP in DSE K8S pods from Blocked/Waiting to In Progress on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Thu, Mar 26, 3:10 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Essential-Work, Data-Engineering
BTullis updated the task description for T421362: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values.
Thu, Mar 26, 12:34 PM · Infrastructure-Foundations, SRE, Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
BTullis created P89939 (An Untitled Masterwork).
Thu, Mar 26, 12:32 PM
BTullis created T421362: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values.
Thu, Mar 26, 12:32 PM · Infrastructure-Foundations, SRE, Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
BTullis created T421361: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values.
Thu, Mar 26, 12:32 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes

Wed, Mar 25

BTullis added a comment to T398800: Dumps on Airflow are not using the dbstore servers due to etcd timeout.

How does this look, now that T416345 is finished and all of the dse-k8s-worker hosts are on 10 Gbps? Do you think that we can close it?

Wed, Mar 25, 11:50 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Essential-Work
BTullis moved T405509: Provide an access to MaxMind GeoIP in DSE K8S pods from Needs Review to Blocked/Waiting on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.

This is blocked on T414484: Upgrade DSE clusters to kubernetes 1.31 - since we need to use the ValidatingAdmissionPolicy for this.

Wed, Mar 25, 11:37 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Essential-Work, Data-Engineering

Tue, Mar 24

BTullis added a project to T420812: Degraded RAID on an-worker1213: Data-Platform-SRE (2026-03-06 - 2026-03-27).
Tue, Mar 24, 10:13 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE, DC-Ops, ops-eqiad
BTullis reassigned T419289: Ensure OpenSearch on k8s clusters can safely use envoy TLS termination from BTullis to bking.
Tue, Mar 24, 12:22 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review
BTullis updated the task description for T419289: Ensure OpenSearch on k8s clusters can safely use envoy TLS termination.
Tue, Mar 24, 12:21 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Patch-For-Review
BTullis closed T421048: request: make kafka-jumbo and kafka-test brokers reachable from wdqs eqiad test nodes, a subtask of T414782: Hypothesis WE2.5.1: setup triple stores in eqiad, as Resolved.
Tue, Mar 24, 12:08 PM · Wikidata Platform Team (Sprint 03 (2026/03/03)), OKR-Work, Wikidata, Epic, Wikidata-Query-Service
BTullis closed T421048: request: make kafka-jumbo and kafka-test brokers reachable from wdqs eqiad test nodes as Resolved.

I believe that this should be working now. Please let me know if it doesn't behave as you expect.

Tue, Mar 24, 12:08 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikidata Platform Team
BTullis added a subtask for T414782: Hypothesis WE2.5.1: setup triple stores in eqiad: T421048: request: make kafka-jumbo and kafka-test brokers reachable from wdqs eqiad test nodes.
Tue, Mar 24, 10:57 AM · Wikidata Platform Team (Sprint 03 (2026/03/03)), OKR-Work, Wikidata, Epic, Wikidata-Query-Service
BTullis added a parent task for T421048: request: make kafka-jumbo and kafka-test brokers reachable from wdqs eqiad test nodes: T414782: Hypothesis WE2.5.1: setup triple stores in eqiad.
Tue, Mar 24, 10:57 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikidata Platform Team
BTullis moved T421048: request: make kafka-jumbo and kafka-test brokers reachable from wdqs eqiad test nodes from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Tue, Mar 24, 10:48 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikidata Platform Team
BTullis claimed T421048: request: make kafka-jumbo and kafka-test brokers reachable from wdqs eqiad test nodes.
Tue, Mar 24, 10:48 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), Wikidata Platform Team

Mon, Mar 23

BTullis closed T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet as Resolved.

I belive that this is now fixed. Thanks @Jclark-ctr .

Mon, Mar 23, 4:54 PM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
BTullis triaged T420437: Migrate DSE k8s apiserver and services to IPIP as Medium priority.
Mon, Mar 23, 2:45 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Prod-Kubernetes, Kubernetes, Liberica, Traffic
BTullis moved T420437: Migrate DSE k8s apiserver and services to IPIP from In Progress to Blocked/Waiting on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.

We are going to wait until the dust has settled slightly on T414484: Upgrade DSE clusters to kubernetes 1.31 before implementing this.
Technically, the changes should be orthogonal, but it is probably best to keep the number of concurrent changes down as much as possible.

Mon, Mar 23, 2:41 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Prod-Kubernetes, Kubernetes, Liberica, Traffic
BTullis claimed T414484: Upgrade DSE clusters to kubernetes 1.31.

We are planning to upgrade the dse-k8s-eqiad cluster on Thursday March 26th.
I'll send out the communications today and start preparing patches.

Mon, Mar 23, 1:38 PM · ServiceOps new, Data-Platform-SRE (2026-03-06 - 2026-03-27), Essential-Work, Kubernetes, Prod-Kubernetes

Fri, Mar 20

BTullis moved T420264: Data Platform SRE paging alerts and on-call SRE response from Backlog - operations to In Progress on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 20, 9:38 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE
BTullis claimed T420264: Data Platform SRE paging alerts and on-call SRE response.
Fri, Mar 20, 9:37 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), SRE
BTullis added a comment to T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet.

The first reimage failed because of a partman issue.

image.png (536×899 px, 76 KB)

I'll put the host into insetup mode to carry out the reimage.

Fri, Mar 20, 9:34 AM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)
BTullis reassigned T419041: Enable custom readahead settings for Ceph block devices serving workload on the dse-k8s clusters from BTullis to bking.
Fri, Mar 20, 9:21 AM · Patch-For-Review, Discovery-Search (2026.03.03 - 2026.04.03), Data-Platform-SRE (2026-03-06 - 2026-03-27)
BTullis claimed T420416: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet.
Fri, Mar 20, 9:15 AM · SRE, Essential-Work, ops-eqiad, DC-Ops, Data-Platform-SRE (2026-03-06 - 2026-03-27)

Thu, Mar 19

BTullis closed T418582: Add dse-k8s-worker102[4-8] to the dse-k8s-eqiad cluster, a subtask of T414948: Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers, as Resolved.
Thu, Mar 19, 2:41 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2026-02-13 - 2026-03-06)
BTullis closed T418582: Add dse-k8s-worker102[4-8] to the dse-k8s-eqiad cluster, a subtask of T418398: Two hosts are failing to do DHCP based PXE booting after renaming and moving vlan, as Resolved.
Thu, Mar 19, 2:41 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), Infrastructure-Foundations
BTullis closed T418582: Add dse-k8s-worker102[4-8] to the dse-k8s-eqiad cluster as Resolved.

These have all now been added to the cluster.
The transient networking issues are described here: T419992: Alert if calico BGP sessions are not established on any kubernetes worker

Thu, Mar 19, 2:41 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)
BTullis moved T416824: Label dse-k8s-nodes with 1G NIC from Backlog - project to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Thu, Mar 19, 2:37 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27)