Page MenuHomePhabricator

brouberol (Balthazar Rouberol)
Data Platform SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Sunday

  • No visible events.

User Details

User Since
Sep 5 2023, 11:23 AM (135 w, 3 d)
Availability
Available
IRC Nick
brouberol
LDAP User
Brouberol
MediaWiki User
BRouberol-WMF [ Global Accounts ]

Recent Activity

Today

brouberol moved T420730: Allow certain DAGs to be ignored when creating an airflow development environment from Backlog - operations to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 10, 12:19 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T420781: Update GrowthBook to new version that includes recent PRs from Needs Review to To Be Deployed on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 10, 12:19 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen
brouberol added a comment to T420781: Update GrowthBook to new version that includes recent PRs.

We've built

docker-registry.discovery.wmnet/repos/data-engineering/growthbook/next:2026-04-10-115921-a565c5295af07c0a141223fc363eb36aeea3fbb5@sha256:81a933393356a56bd9d5d5a8090eaad0da65b055721f928a85d63ba3ac6b1ef2

that built growthbook on the merge commit of https://github.com/growthbook/growthbook/pull/5520.

Fri, Apr 10, 12:15 PM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen
brouberol moved T420781: Update GrowthBook to new version that includes recent PRs from Blocked/Waiting to Needs Review on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 10, 10:21 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen
brouberol changed the status of T420781: Update GrowthBook to new version that includes recent PRs from Open to In Progress.
Fri, Apr 10, 10:21 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen
brouberol claimed T420781: Update GrowthBook to new version that includes recent PRs.
Fri, Apr 10, 8:42 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen
brouberol moved T421361: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values from Backlog - project to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 10, 8:11 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
brouberol moved T422030: Surge in webrequest validation check from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 10, 8:11 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q4 FS25/26 April 1st - June 30st), Traffic

Wed, Apr 8

brouberol assigned T416820: Check home/HDFS leftovers of nettrom to BTullis.
Wed, Apr 8, 9:01 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)

Fri, Apr 3

brouberol added a comment to T421783: Requesting Kerberos access for matmarex.

Sure thing!

brouberol@krb1002:~$ sudo manage_principals.py reset-password  matmarex --email=bdziewonski@wikimedia.org
Password reset successfully.
Successfully sent email to bdziewonski@wikimedia.org
Fri, Apr 3, 3:43 PM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol reassigned T420696: API keys for GrowthBook from brouberol to RKemper.
Fri, Apr 3, 11:18 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17), Test Kitchen, OKR-Work, Epic
brouberol closed T421783: Requesting Kerberos access for matmarex as Resolved.
Fri, Apr 3, 11:18 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T421783: Requesting Kerberos access for matmarex from In Progress to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 3, 11:18 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a comment to T421783: Requesting Kerberos access for matmarex.

It appears as though you already have a kerberos principal created @matmarex

Fri, Apr 3, 11:17 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a comment to T421783: Requesting Kerberos access for matmarex.
brouberol@krb1002:~$ sudo manage_principals.py create  matmarex --email=bdziewonski@wikimedia.org
Principal already created (or an error occurred with kadmin), skipping.
brouberol@krb1002:~$ sudo kadmin.local listprincs | grep matmarex
matmarex@WIKIMEDIA
Fri, Apr 3, 11:17 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol changed the status of T421783: Requesting Kerberos access for matmarex from Open to In Progress.
Fri, Apr 3, 11:16 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T421783: Requesting Kerberos access for matmarex from Quick Wins to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 3, 11:16 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol claimed T421783: Requesting Kerberos access for matmarex.
Fri, Apr 3, 11:16 AM · Data-Engineering, Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T421362: Unusual CI failure for aux-k8s when changing dse-k8s cert-manager values from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 3, 11:15 AM · Infrastructure-Foundations, SRE, Data-Platform-SRE (2026-03-27 - 2026-04-17), ci-test-error, Kubernetes
brouberol moved T422066: Request for +2 rights on Deployment-chart Repository for Snwachukwu from Quick Wins to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Apr 3, 11:15 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Gerrit-Privilege-Requests
brouberol added a comment to T360794: Event stream with latest revision HTML & parent revision HTML diff.

@Ottomata Assuming you mean 290GB and not 290TB, we should be all good :)

Fri, Apr 3, 6:42 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Research, Event-Platform

Wed, Apr 1

brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

@Jclark-ctr an-worker1148 is now in decommissioning status (https://netbox.wikimedia.org/dcim/devices/3661/). Over to you, with many thanks!

Wed, Apr 1, 2:37 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops

Tue, Mar 31

brouberol updated the task description for T421860: Requesting shell access and membership of the ops group for atsuko.
Tue, Mar 31, 2:26 PM · SRE, SRE-Access-Requests
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

I'm seeing

Fault detected on drive 1 in disk drive bay 1. 	Tue Mar 31 2026 12:56:39

in the IDRAC UI, which maps to ~1min after I mounted the disk.

Tue, Mar 31, 1:28 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

Puppet failed with

Error: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d
Error: /Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/d]/File[/var/lib/hadoop/data/d]/group: change from 'root' to 'hdfs' failed: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d (corrective)

The mountpoint was mounted ro and the disk started to display errors back into the IDRAC.

Screenshot 2026-03-31 at 15.21.41.png (777×1 px, 111 KB)

Tue, Mar 31, 1:22 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

I'm going to follow https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk to configure the missing disk

brouberol@an-worker1148:~$ sudo megacli -PDList -aAll | grep Firm
Firmware state: Online, Spun Up
Device Firmware Level: LA0B
Firmware state: Unconfigured(good), Spun Up. <---
Device Firmware Level: LA0C 
Firmware state: Online, Spun Up
...
brouberol@an-worker1148:~$ sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state"
Adapter #0
Enclosure Device ID: 32
Slot Number: 0
Firmware state: Online, Spun Up
---
Enclosure Device ID: 32
Slot Number: 1
Firmware state: Unconfigured(good), Spun Up
---
Enclosure Device ID: 32
Slot Number: 2
...
Tue, Mar 31, 12:59 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

The fstab seems to be correct though.

brouberol@an-worker1148:~$ cat /etc/fstab  | grep LABEL=hadoop | grep -v '#'
LABEL=hadoop-e	/var/lib/hadoop/data/e	ext4	defaults,noatime	0	2
LABEL=hadoop-f	/var/lib/hadoop/data/f	ext4	defaults,noatime	0	2
LABEL=hadoop-g	/var/lib/hadoop/data/g	ext4	defaults,noatime	0	2
LABEL=hadoop-h	/var/lib/hadoop/data/h	ext4	defaults,noatime	0	2
LABEL=hadoop-i	/var/lib/hadoop/data/i	ext4	defaults,noatime	0	2
LABEL=hadoop-j	/var/lib/hadoop/data/j	ext4	defaults,noatime	0	2
LABEL=hadoop-k	/var/lib/hadoop/data/k	ext4	defaults,noatime	0	2
LABEL=hadoop-l	/var/lib/hadoop/data/l	ext4	defaults,noatime	0	2
LABEL=hadoop-m	/var/lib/hadoop/data/m	ext4	defaults,noatime	0	2
Tue, Mar 31, 12:02 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

Oh and something I overlooked in https://phabricator.wikimedia.org/T411919#11772073: we're back to having the device names and the mount points jumbled up.

Tue, Mar 31, 12:00 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

Screenshot 2026-03-31 at 13.55.43.png (460×1 px, 105 KB)
Screenshot 2026-03-31 at 13.56.13.png (609×1 px, 93 KB)
Seems like all disks are healthy, but one of them isn't online.

Tue, Mar 31, 11:57 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

All disks are reported healthy by SMART:

brouberol@an-worker1148:~$ sudo smart-data-dump --debug 2>&1 | grep healthy
...
# HELP device_smart_healthy SMART health
# TYPE device_smart_healthy gauge
device_smart_healthy{device="sat+megaraid,0"} 1.0
device_smart_healthy{device="sat+megaraid,1"} 1.0
device_smart_healthy{device="sat+megaraid,2"} 1.0
device_smart_healthy{device="sat+megaraid,3"} 1.0
device_smart_healthy{device="sat+megaraid,4"} 1.0
device_smart_healthy{device="sat+megaraid,5"} 1.0
device_smart_healthy{device="sat+megaraid,6"} 1.0
device_smart_healthy{device="sat+megaraid,7"} 1.0
device_smart_healthy{device="sat+megaraid,8"} 1.0
device_smart_healthy{device="sat+megaraid,9"} 1.0
device_smart_healthy{device="sat+megaraid,10"} 1.0
device_smart_healthy{device="sat+megaraid,11"} 1.0
device_smart_healthy{device="sat+megaraid,12"} 1.0
device_smart_healthy{device="sat+megaraid,13"} 1.0
Tue, Mar 31, 11:49 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

Seems like /dev/sdk is having some issues:

brouberol@an-worker1148:~$ sudo dmesg | grep sdk
[    9.359370] sd 0:2:11:0: [sdk] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB)
[    9.359372] sd 0:2:11:0: [sdk] 4096-byte physical blocks
[    9.359399] sd 0:2:11:0: [sdk] Write Protect is off
[    9.359401] sd 0:2:11:0: [sdk] Mode Sense: 1f 00 00 08
[    9.359469] sd 0:2:11:0: [sdk] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    9.359545] sdk: detected capacity change from 0 to 8000987201536
[    9.752614] sdk: detected capacity change from 0 to 8000987201536
[    9.754638]  sdk: sdk1
[    9.815601] sdk: detected capacity change from 0 to 8000987201536
[    9.827790] sd 0:2:11:0: [sdk] Attached SCSI disk
[   18.420794] EXT4-fs (sdk1): mounted filesystem with ordered data mode. Opts: (null)
[699069.233415] sd 0:2:11:0: [sdk] tag#7 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[699069.233431] sd 0:2:11:0: [sdk] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[699069.233435] sd 0:2:11:0: [sdk] tag#7 Sense Key : Medium Error [current]
[699069.233439] sd 0:2:11:0: [sdk] tag#7 Add. Sense: No additional sense information
[699069.233445] sd 0:2:11:0: [sdk] tag#7 CDB: Read(16) 88 00 00 00 00 00 19 d0 ba 00 00 00 02 00 00 00
[699069.233450] blk_update_request: I/O error, dev sdk, sector 433109504 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 0
[699069.628615] sd 0:2:11:0: [sdk] tag#746 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.645265] sd 0:2:11:0: [sdk] tag#748 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.661232] sd 0:2:11:0: [sdk] tag#749 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.677270] sd 0:2:11:0: [sdk] tag#750 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.693266] sd 0:2:11:0: [sdk] tag#753 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.709346] sd 0:2:11:0: [sdk] tag#754 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.709367] sd 0:2:11:0: [sdk] tag#754 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[699069.709380] sd 0:2:11:0: [sdk] tag#754 Sense Key : Medium Error [current]
[699069.709393] sd 0:2:11:0: [sdk] tag#754 Add. Sense: No additional sense information
[699069.709407] sd 0:2:11:0: [sdk] tag#754 CDB: Read(16) 88 00 00 00 00 00 19 d0 bb 90 00 00 00 08 00 00
[699069.709422] blk_update_request: I/O error, dev sdk, sector 433109904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[699069.720970] sd 0:2:11:0: [sdk] tag#755 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.737263] sd 0:2:11:0: [sdk] tag#756 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.753261] sd 0:2:11:0: [sdk] tag#757 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.769257] sd 0:2:11:0: [sdk] tag#758 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.785342] sd 0:2:11:0: [sdk] tag#759 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.801228] sd 0:2:11:0: [sdk] tag#760 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.801249] sd 0:2:11:0: [sdk] tag#760 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[699069.801269] sd 0:2:11:0: [sdk] tag#760 Sense Key : Medium Error [current]
[699069.801275] sd 0:2:11:0: [sdk] tag#760 Add. Sense: No additional sense information
[699069.801280] sd 0:2:11:0: [sdk] tag#760 CDB: Read(16) 88 00 00 00 00 00 19 d0 bb 90 00 00 00 08 00 00
[699069.801285] blk_update_request: I/O error, dev sdk, sector 433109904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Tue, Mar 31, 11:42 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol added a comment to T411919: hw troubleshooting: PERC1 battery failure for an-worker1148.

@RKemper I can't seem to be able to run puppet on this host:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Number of datanode mountpoints (9) below threshold: 10, please check. (file: /srv/puppet_code/environments/production/modules/profile/manifests/hadoop/common.pp, line: 418, column: 9) on node an-worker1148.eqiad.wmnet
Tue, Mar 31, 11:40 AM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), SRE, ops-eqiad, DC-Ops
brouberol assigned T420730: Allow certain DAGs to be ignored when creating an airflow development environment to atsuko.
Tue, Mar 31, 11:23 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a member for acl*sre-team: atsuko.
Tue, Mar 31, 8:51 AM
brouberol added a member for WMF-NDA: atsuko.
Tue, Mar 31, 8:51 AM
brouberol closed T416113: Deploy turnilo to dse-k8s-eqiad as Resolved.
Tue, Mar 31, 8:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol moved T416113: Deploy turnilo to dse-k8s-eqiad from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Mar 31, 8:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416123: Define the turnilo global config, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Tue, Mar 31, 8:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol moved T416123: Define the turnilo global config from Backlog - project to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Mar 31, 8:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416123: Define the turnilo global config as Resolved.
Tue, Mar 31, 8:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a comment to T416123: Define the turnilo global config.

Yep, the config is now fully defined in Kubernetes/deployment-charts. There's nothing else to do there.

Tue, Mar 31, 8:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416127: Decommission an-tool1007, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Tue, Mar 31, 8:21 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416127: Decommission an-tool1007 as Resolved.
Tue, Mar 31, 8:21 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416127: Decommission an-tool1007 from In Progress to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Mar 31, 8:21 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416126: Cleanup turnilo resources from puppet, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Tue, Mar 31, 8:19 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416126: Cleanup turnilo resources from puppet as Resolved.
Tue, Mar 31, 8:19 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416126: Cleanup turnilo resources from puppet from In Progress to Done on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Mar 31, 8:19 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol claimed T416127: Decommission an-tool1007.
Tue, Mar 31, 7:45 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol claimed T416126: Cleanup turnilo resources from puppet.
Tue, Mar 31, 7:45 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol changed the status of T416127: Decommission an-tool1007, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, from Open to In Progress.
Tue, Mar 31, 7:45 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol changed the status of T416127: Decommission an-tool1007 from Open to In Progress.
Tue, Mar 31, 7:44 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol changed the status of T416126: Cleanup turnilo resources from puppet, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, from Open to In Progress.
Tue, Mar 31, 7:44 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol moved T416127: Decommission an-tool1007 from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Mar 31, 7:44 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol changed the status of T416126: Cleanup turnilo resources from puppet from Open to In Progress.
Tue, Mar 31, 7:44 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416126: Cleanup turnilo resources from puppet from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Tue, Mar 31, 7:44 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)

Fri, Mar 27

brouberol updated Other Assignee for T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad, added: brouberol.
Fri, Mar 27, 4:36 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
brouberol reassigned T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad from brouberol to BTullis.
Fri, Mar 27, 4:36 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
brouberol moved T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad from In Progress to Needs Review on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Mar 27, 4:32 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
brouberol changed the status of T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad, a subtask of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21, from Open to In Progress.
Fri, Mar 27, 4:26 PM · ServiceOps new, Patch-For-Review, Prod-Kubernetes
brouberol changed the status of T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad from Open to In Progress.
Fri, Mar 27, 4:26 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
brouberol moved T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad from Backlog - project to In Progress on the Data-Platform-SRE (2026-03-27 - 2026-04-17) board.
Fri, Mar 27, 4:26 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
brouberol claimed T419259: mediawiki-dumps-legacy is running without security policy on dse-k8s-eqiad.
Fri, Mar 27, 4:26 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering-Radar, Data-Engineering, Dumps-Generation, ServiceOps new, Prod-Kubernetes
brouberol closed T417159: Add a monitor to check that the hive-metastore is actually functioning as Declined.

Closing as duplicate of https://phabricator.wikimedia.org/T417158

Fri, Mar 27, 10:30 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416146: Document our Turnilo-on-k8s deployment, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Fri, Mar 27, 9:30 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416146: Document our Turnilo-on-k8s deployment as Resolved.
Fri, Mar 27, 9:30 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416146: Document our Turnilo-on-k8s deployment from In Progress to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 27, 9:30 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a comment to T416146: Document our Turnilo-on-k8s deployment.

I removed the legacy documentation from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Turnilo and added our readiness checklist for the production k8s service.

Fri, Mar 27, 9:30 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416125: Migrate turnilo.wikimedia.org to the kubernetes deployment, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Fri, Mar 27, 9:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol moved T416125: Migrate turnilo.wikimedia.org to the kubernetes deployment from To Be Deployed to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 27, 9:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416125: Migrate turnilo.wikimedia.org to the kubernetes deployment as Resolved.
Fri, Mar 27, 9:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a comment to T416125: Migrate turnilo.wikimedia.org to the kubernetes deployment.

image.png (1×1 px, 150 KB)

Fri, Mar 27, 9:22 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416125: Migrate turnilo.wikimedia.org to the kubernetes deployment from Backlog - project to To Be Deployed on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 27, 8:50 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol added a comment to T416125: Migrate turnilo.wikimedia.org to the kubernetes deployment.

Turnilo is now deployed in kubernetes:

brouberol@deploy2002:~$ curl https://turnilo.discovery.wmnet:30443
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://idp.wikimedia.org/login?service=https%3a%2f%2fturnilo.wikimedia.org%2f">here</a>.</p>
</body></html>
brouberol@deploy2002:~$

I'm now merging the patch that redirects https://turnilo.wikimedia.org to the kubernetes pod.

Fri, Mar 27, 8:50 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416121: Define the turnilo helmfiles, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Fri, Mar 27, 8:47 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416120: Define the turnilo namespaces, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416121: Define the turnilo helmfiles as Resolved.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416121: Define the turnilo helmfiles from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416119: Define the turnilo kubeconfigs, a subtask of T416113: Deploy turnilo to dse-k8s-eqiad, as Resolved.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
brouberol closed T416120: Define the turnilo namespaces as Resolved.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416120: Define the turnilo namespaces from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol closed T416119: Define the turnilo kubeconfigs as Resolved.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)
brouberol moved T416119: Define the turnilo kubeconfigs from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Fri, Mar 27, 8:46 AM · Data-Platform-SRE (2026-03-27 - 2026-04-17)

Thu, Mar 26

brouberol closed T417213: Create FR Tech Airflow instance as Resolved.
Thu, Mar 26, 3:45 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), Data-Engineering-Radar, Data-Engineering, FR-Tech-Analytics
brouberol moved T417213: Create FR Tech Airflow instance from To Be Deployed to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Thu, Mar 26, 3:45 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), Data-Engineering-Radar, Data-Engineering, FR-Tech-Analytics
brouberol closed T417213: Create FR Tech Airflow instance, a subtask of T416457: Enable greater integration between the DPE and FR-tech analytics stacks, as Resolved.
Thu, Mar 26, 3:45 PM · Epic, Data-Engineering, FR-Tech-Analytics, Data-Platform-SRE
brouberol added a comment to T417213: Create FR Tech Airflow instance.

Screenshot 2026-03-26 at 16.44.34.png (567×1 px, 121 KB)
All good and deployed!

Thu, Mar 26, 3:44 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), Data-Engineering-Radar, Data-Engineering, FR-Tech-Analytics
brouberol added a comment to T417213: Create FR Tech Airflow instance.
btullis@seaborgium:~$ ldapadd -f airflow-fr-tech-ops.ldif -D "cn=admin,dc=wikimedia,dc=org" -x -W -H "ldap://ldap-rw.eqiad.wikimedia.org:389"
Enter LDAP Password: 
adding new entry "cn=airflow-fr-tech-ops,ou=groups,dc=wikimedia,dc=org"

The LDAP group was created

Thu, Mar 26, 3:44 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), Data-Engineering-Radar, Data-Engineering, FR-Tech-Analytics
brouberol moved T417213: Create FR Tech Airflow instance from In Progress to To Be Deployed on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Thu, Mar 26, 2:55 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), Data-Engineering-Radar, Data-Engineering, FR-Tech-Analytics
brouberol closed T414484: Upgrade DSE clusters to kubernetes 1.31, a subtask of T341984: Update Kubernetes clusters to 1.31, as Resolved.
Thu, Mar 26, 2:49 PM · Data-Platform-SRE (2026.01.05 - 2026.01.23), Epic, ServiceOps new, Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes
brouberol closed T414484: Upgrade DSE clusters to kubernetes 1.31 as Resolved.
Thu, Mar 26, 2:49 PM · ServiceOps new, Data-Platform-SRE (2026-03-06 - 2026-03-27), Essential-Work, Kubernetes, Prod-Kubernetes
brouberol added a comment to T414484: Upgrade DSE clusters to kubernetes 1.31.
root@deploy2002:/srv/deployment-charts/custom_deploy.d/istio# istioctl-1.24.2 manifest apply -f ./dse-k8s/config.yaml
        |\
        | \
        |  \
        |   \
      /||    \
     / ||     \
    /  ||      \
   /   ||       \
  /    ||        \
 /     ||         \
/______||__________\
____________________
  \__       _____/
     \_____/
Thu, Mar 26, 2:49 PM · ServiceOps new, Data-Platform-SRE (2026-03-06 - 2026-03-27), Essential-Work, Kubernetes, Prod-Kubernetes
brouberol added a comment to T383553: Set cert-manager leader election namespace to cert-manager.

This has now been done for dse-k8s-eqiad.

Thu, Mar 26, 1:09 PM · Machine-Learning-Team (Q4 FY2025-26), Infrastructure-Foundations, ServiceOps new, Data-Platform-SRE, Kubernetes, Prod-Kubernetes
brouberol added a comment to T414484: Upgrade DSE clusters to kubernetes 1.31.

Our upgrade notes can be found at https://docs.google.com/document/d/1q7Amw_XSN_Lfb7fCnaSprpW8Z43iMyD4NOD3Lbq2hR4/edit?tab=t.0

Thu, Mar 26, 1:07 PM · ServiceOps new, Data-Platform-SRE (2026-03-06 - 2026-03-27), Essential-Work, Kubernetes, Prod-Kubernetes
brouberol added a comment to T414484: Upgrade DSE clusters to kubernetes 1.31.

All nodes now run Kubernetes 1.31:

root@deploy2002:~# kubectl get nodes
NAME                             STATUS   ROLES           AGE     VERSION
dse-k8s-ctrl1001.eqiad.wmnet     Ready    control-plane   3y31d   v1.31.4
dse-k8s-ctrl1002.eqiad.wmnet     Ready    control-plane   3y31d   v1.31.4
dse-k8s-worker1001.eqiad.wmnet   Ready    <none>          3y31d   v1.31.4
dse-k8s-worker1002.eqiad.wmnet   Ready    <none>          3y31d   v1.31.4
dse-k8s-worker1003.eqiad.wmnet   Ready    <none>          3y31d   v1.31.4
dse-k8s-worker1004.eqiad.wmnet   Ready    <none>          3y31d   v1.31.4
dse-k8s-worker1005.eqiad.wmnet   Ready    <none>          3y27d   v1.31.4
dse-k8s-worker1006.eqiad.wmnet   Ready    <none>          3y27d   v1.31.4
dse-k8s-worker1007.eqiad.wmnet   Ready    <none>          3y27d   v1.31.4
dse-k8s-worker1008.eqiad.wmnet   Ready    <none>          3y27d   v1.31.4
dse-k8s-worker1009.eqiad.wmnet   Ready    <none>          572d    v1.31.4
dse-k8s-worker1010.eqiad.wmnet   Ready    <none>          310d    v1.31.4
dse-k8s-worker1011.eqiad.wmnet   Ready    <none>          310d    v1.31.4
dse-k8s-worker1012.eqiad.wmnet   Ready    <none>          286d    v1.31.4
dse-k8s-worker1013.eqiad.wmnet   Ready    <none>          286d    v1.31.4
dse-k8s-worker1014.eqiad.wmnet   Ready    <none>          185d    v1.31.4
dse-k8s-worker1015.eqiad.wmnet   Ready    <none>          231d    v1.31.4
dse-k8s-worker1016.eqiad.wmnet   Ready    <none>          231d    v1.31.4
dse-k8s-worker1017.eqiad.wmnet   Ready    <none>          231d    v1.31.4
dse-k8s-worker1018.eqiad.wmnet   Ready    <none>          231d    v1.31.4
dse-k8s-worker1019.eqiad.wmnet   Ready    <none>          231d    v1.31.4
dse-k8s-worker1024.eqiad.wmnet   Ready    <none>          24d     v1.31.4
dse-k8s-worker1025.eqiad.wmnet   Ready    <none>          23d     v1.31.4
dse-k8s-worker1026.eqiad.wmnet   Ready    <none>          8d      v1.31.4
dse-k8s-worker1027.eqiad.wmnet   Ready    <none>          7d21h   v1.31.4
dse-k8s-worker1028.eqiad.wmnet   Ready    <none>          23d     v1.31.4
Thu, Mar 26, 1:07 PM · ServiceOps new, Data-Platform-SRE (2026-03-06 - 2026-03-27), Essential-Work, Kubernetes, Prod-Kubernetes

Wed, Mar 25

brouberol closed T417407: Evacuate all kafka-mirrormaker instances to Kubernetes, a subtask of T416669: Upgrade Kafka to version 3.x, as Resolved.
Wed, Mar 25, 10:37 AM · ServiceOps-Datastores, ServiceOps new, Infrastructure-Foundations, SRE
brouberol closed T417407: Evacuate all kafka-mirrormaker instances to Kubernetes as Resolved.
Wed, Mar 25, 10:37 AM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps-Datastores, ServiceOps new
brouberol moved T417407: Evacuate all kafka-mirrormaker instances to Kubernetes from Blocked/Waiting to Done on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Wed, Mar 25, 10:37 AM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps-Datastores, ServiceOps new
brouberol added a comment to T417407: Evacuate all kafka-mirrormaker instances to Kubernetes.
brouberol@deploy2002:/srv/deployment-charts/helmfile.d/aux-k8s-services/kafka-mirrormaker$ kubectl get pod
NAME                                                              READY   STATUS    RESTARTS   AGE
kafka-mirrormaker-jumbo-eqiad-to-test-eqiad-766ccfd75b-6gk2t      1/1     Running   0          5d17h
kafka-mirrormaker-logging-codfw-to-jumbo-eqiad-585b57d898-96ddm   1/1     Running   0          5d1h
kafka-mirrormaker-logging-eqiad-to-jumbo-eqiad-76b47659fd-p4gkx   1/1     Running   0          5d1h
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-dj2ml        1/1     Running   0          52m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-lzpn2        1/1     Running   0          52m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-mx5hq        1/1     Running   0          52m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-qnnwk        1/1     Running   0          52m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-sf5tn        1/1     Running   0          52m
kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-4tt9g      1/1     Running   0          39m
kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-8qg72      1/1     Running   0          39m
kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-9trnl      1/1     Running   0          39m
kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-sz8mp      1/1     Running   0          39m
kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-txc9v      1/1     Running   0          39m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-g9v2j       1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-hw62l       1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-jt85b       1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-k6lgz       1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-rnvqv       1/1     Running   0          45m
brouberol@deploy2002:/srv/deployment-charts/helmfile.d/aux-k8s-services/kafka-mirrormaker$ kube-env kafka-mirrormaker aux-k8s-codfw
brouberol@deploy2002:/srv/deployment-charts/helmfile.d/aux-k8s-services/kafka-mirrormaker$ kubectl get pod
NAME                                                          READY   STATUS    RESTARTS   AGE
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-k85d2    1/1     Running   0          51m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-nlhr4    1/1     Running   0          51m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-qc9q2    1/1     Running   0          51m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-r7l8q    1/1     Running   0          51m
kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-x24fk    1/1     Running   0          51m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-447ln   1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-gdwkl   1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-t84wk   1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-tp28b   1/1     Running   0          45m
kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-zslrc   1/1     Running   0          45m
Wed, Mar 25, 10:37 AM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps-Datastores, ServiceOps new

Mon, Mar 23

brouberol moved T417407: Evacuate all kafka-mirrormaker instances to Kubernetes from To Be Deployed to Blocked/Waiting on the Data-Platform-SRE (2026-03-06 - 2026-03-27) board.
Mon, Mar 23, 1:34 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps-Datastores, ServiceOps new
brouberol updated subscribers of T417407: Evacuate all kafka-mirrormaker instances to Kubernetes.

I attempted to deploy the kafka-mirror-main-codfw_to_main-eqiad instance this morning, but got blocked by the fact that the aux clusters are very low on resource. I happened to realize that we have 6 pending hosts in each DC, with each host having 48CPU and 128GB of RAM (cf T393053 and T393054). @elukey is working on reimaging them to trixie so we can start to add them into the cluster. After which, we should be finally ready to deploy MM to k8s.

Mon, Mar 23, 1:33 PM · Patch-For-Review, Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps-Datastores, ServiceOps new