User Details
- User Since
- Sep 5 2023, 11:23 AM (135 w, 3 d)
- Availability
- Available
- IRC Nick
- brouberol
- LDAP User
- Brouberol
- MediaWiki User
- BRouberol-WMF [ Global Accounts ]
Today
We've built
docker-registry.discovery.wmnet/repos/data-engineering/growthbook/next:2026-04-10-115921-a565c5295af07c0a141223fc363eb36aeea3fbb5@sha256:81a933393356a56bd9d5d5a8090eaad0da65b055721f928a85d63ba3ac6b1ef2
that built growthbook on the merge commit of https://github.com/growthbook/growthbook/pull/5520.
Wed, Apr 8
Fri, Apr 3
Sure thing!
brouberol@krb1002:~$ sudo manage_principals.py reset-password matmarex --email=bdziewonski@wikimedia.org Password reset successfully. Successfully sent email to bdziewonski@wikimedia.org
It appears as though you already have a kerberos principal created @matmarex
brouberol@krb1002:~$ sudo manage_principals.py create matmarex --email=bdziewonski@wikimedia.org Principal already created (or an error occurred with kadmin), skipping. brouberol@krb1002:~$ sudo kadmin.local listprincs | grep matmarex matmarex@WIKIMEDIA
@Ottomata Assuming you mean 290GB and not 290TB, we should be all good :)
Wed, Apr 1
@Jclark-ctr an-worker1148 is now in decommissioning status (https://netbox.wikimedia.org/dcim/devices/3661/). Over to you, with many thanks!
Tue, Mar 31
I'm seeing
Fault detected on drive 1 in disk drive bay 1. Tue Mar 31 2026 12:56:39
in the IDRAC UI, which maps to ~1min after I mounted the disk.
Puppet failed with
Error: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d Error: /Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/d]/File[/var/lib/hadoop/data/d]/group: change from 'root' to 'hdfs' failed: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d (corrective)
The mountpoint was mounted ro and the disk started to display errors back into the IDRAC.
I'm going to follow https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk to configure the missing disk
brouberol@an-worker1148:~$ sudo megacli -PDList -aAll | grep Firm Firmware state: Online, Spun Up Device Firmware Level: LA0B Firmware state: Unconfigured(good), Spun Up. <--- Device Firmware Level: LA0C Firmware state: Online, Spun Up ...
brouberol@an-worker1148:~$ sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state" Adapter #0 Enclosure Device ID: 32 Slot Number: 0 Firmware state: Online, Spun Up --- Enclosure Device ID: 32 Slot Number: 1 Firmware state: Unconfigured(good), Spun Up --- Enclosure Device ID: 32 Slot Number: 2 ...
The fstab seems to be correct though.
brouberol@an-worker1148:~$ cat /etc/fstab | grep LABEL=hadoop | grep -v '#' LABEL=hadoop-e /var/lib/hadoop/data/e ext4 defaults,noatime 0 2 LABEL=hadoop-f /var/lib/hadoop/data/f ext4 defaults,noatime 0 2 LABEL=hadoop-g /var/lib/hadoop/data/g ext4 defaults,noatime 0 2 LABEL=hadoop-h /var/lib/hadoop/data/h ext4 defaults,noatime 0 2 LABEL=hadoop-i /var/lib/hadoop/data/i ext4 defaults,noatime 0 2 LABEL=hadoop-j /var/lib/hadoop/data/j ext4 defaults,noatime 0 2 LABEL=hadoop-k /var/lib/hadoop/data/k ext4 defaults,noatime 0 2 LABEL=hadoop-l /var/lib/hadoop/data/l ext4 defaults,noatime 0 2 LABEL=hadoop-m /var/lib/hadoop/data/m ext4 defaults,noatime 0 2
Oh and something I overlooked in https://phabricator.wikimedia.org/T411919#11772073: we're back to having the device names and the mount points jumbled up.
All disks are reported healthy by SMART:
brouberol@an-worker1148:~$ sudo smart-data-dump --debug 2>&1 | grep healthy
...
# HELP device_smart_healthy SMART health
# TYPE device_smart_healthy gauge
device_smart_healthy{device="sat+megaraid,0"} 1.0
device_smart_healthy{device="sat+megaraid,1"} 1.0
device_smart_healthy{device="sat+megaraid,2"} 1.0
device_smart_healthy{device="sat+megaraid,3"} 1.0
device_smart_healthy{device="sat+megaraid,4"} 1.0
device_smart_healthy{device="sat+megaraid,5"} 1.0
device_smart_healthy{device="sat+megaraid,6"} 1.0
device_smart_healthy{device="sat+megaraid,7"} 1.0
device_smart_healthy{device="sat+megaraid,8"} 1.0
device_smart_healthy{device="sat+megaraid,9"} 1.0
device_smart_healthy{device="sat+megaraid,10"} 1.0
device_smart_healthy{device="sat+megaraid,11"} 1.0
device_smart_healthy{device="sat+megaraid,12"} 1.0
device_smart_healthy{device="sat+megaraid,13"} 1.0Seems like /dev/sdk is having some issues:
brouberol@an-worker1148:~$ sudo dmesg | grep sdk [ 9.359370] sd 0:2:11:0: [sdk] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB) [ 9.359372] sd 0:2:11:0: [sdk] 4096-byte physical blocks [ 9.359399] sd 0:2:11:0: [sdk] Write Protect is off [ 9.359401] sd 0:2:11:0: [sdk] Mode Sense: 1f 00 00 08 [ 9.359469] sd 0:2:11:0: [sdk] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 9.359545] sdk: detected capacity change from 0 to 8000987201536 [ 9.752614] sdk: detected capacity change from 0 to 8000987201536 [ 9.754638] sdk: sdk1 [ 9.815601] sdk: detected capacity change from 0 to 8000987201536 [ 9.827790] sd 0:2:11:0: [sdk] Attached SCSI disk [ 18.420794] EXT4-fs (sdk1): mounted filesystem with ordered data mode. Opts: (null) [699069.233415] sd 0:2:11:0: [sdk] tag#7 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0 [699069.233431] sd 0:2:11:0: [sdk] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [699069.233435] sd 0:2:11:0: [sdk] tag#7 Sense Key : Medium Error [current] [699069.233439] sd 0:2:11:0: [sdk] tag#7 Add. Sense: No additional sense information [699069.233445] sd 0:2:11:0: [sdk] tag#7 CDB: Read(16) 88 00 00 00 00 00 19 d0 ba 00 00 00 02 00 00 00 [699069.233450] blk_update_request: I/O error, dev sdk, sector 433109504 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 0 [699069.628615] sd 0:2:11:0: [sdk] tag#746 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.645265] sd 0:2:11:0: [sdk] tag#748 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.661232] sd 0:2:11:0: [sdk] tag#749 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.677270] sd 0:2:11:0: [sdk] tag#750 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.693266] sd 0:2:11:0: [sdk] tag#753 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.709346] sd 0:2:11:0: [sdk] tag#754 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.709367] sd 0:2:11:0: [sdk] tag#754 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [699069.709380] sd 0:2:11:0: [sdk] tag#754 Sense Key : Medium Error [current] [699069.709393] sd 0:2:11:0: [sdk] tag#754 Add. Sense: No additional sense information [699069.709407] sd 0:2:11:0: [sdk] tag#754 CDB: Read(16) 88 00 00 00 00 00 19 d0 bb 90 00 00 00 08 00 00 [699069.709422] blk_update_request: I/O error, dev sdk, sector 433109904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [699069.720970] sd 0:2:11:0: [sdk] tag#755 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.737263] sd 0:2:11:0: [sdk] tag#756 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.753261] sd 0:2:11:0: [sdk] tag#757 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.769257] sd 0:2:11:0: [sdk] tag#758 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.785342] sd 0:2:11:0: [sdk] tag#759 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.801228] sd 0:2:11:0: [sdk] tag#760 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0 [699069.801249] sd 0:2:11:0: [sdk] tag#760 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [699069.801269] sd 0:2:11:0: [sdk] tag#760 Sense Key : Medium Error [current] [699069.801275] sd 0:2:11:0: [sdk] tag#760 Add. Sense: No additional sense information [699069.801280] sd 0:2:11:0: [sdk] tag#760 CDB: Read(16) 88 00 00 00 00 00 19 d0 bb 90 00 00 00 08 00 00 [699069.801285] blk_update_request: I/O error, dev sdk, sector 433109904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
@RKemper I can't seem to be able to run puppet on this host:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Number of datanode mountpoints (9) below threshold: 10, please check. (file: /srv/puppet_code/environments/production/modules/profile/manifests/hadoop/common.pp, line: 418, column: 9) on node an-worker1148.eqiad.wmnet
Yep, the config is now fully defined in Kubernetes/deployment-charts. There's nothing else to do there.
Fri, Mar 27
Closing as duplicate of https://phabricator.wikimedia.org/T417158
I removed the legacy documentation from https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Turnilo and added our readiness checklist for the production k8s service.
Turnilo is now deployed in kubernetes:
brouberol@deploy2002:~$ curl https://turnilo.discovery.wmnet:30443 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="https://idp.wikimedia.org/login?service=https%3a%2f%2fturnilo.wikimedia.org%2f">here</a>.</p> </body></html> brouberol@deploy2002:~$
I'm now merging the patch that redirects https://turnilo.wikimedia.org to the kubernetes pod.
Thu, Mar 26
btullis@seaborgium:~$ ldapadd -f airflow-fr-tech-ops.ldif -D "cn=admin,dc=wikimedia,dc=org" -x -W -H "ldap://ldap-rw.eqiad.wikimedia.org:389" Enter LDAP Password: adding new entry "cn=airflow-fr-tech-ops,ou=groups,dc=wikimedia,dc=org"
The LDAP group was created
root@deploy2002:/srv/deployment-charts/custom_deploy.d/istio# istioctl-1.24.2 manifest apply -f ./dse-k8s/config.yaml
|\
| \
| \
| \
/|| \
/ || \
/ || \
/ || \
/ || \
/ || \
/______||__________\
____________________
\__ _____/
\_____/This has now been done for dse-k8s-eqiad.
Our upgrade notes can be found at https://docs.google.com/document/d/1q7Amw_XSN_Lfb7fCnaSprpW8Z43iMyD4NOD3Lbq2hR4/edit?tab=t.0
All nodes now run Kubernetes 1.31:
root@deploy2002:~# kubectl get nodes NAME STATUS ROLES AGE VERSION dse-k8s-ctrl1001.eqiad.wmnet Ready control-plane 3y31d v1.31.4 dse-k8s-ctrl1002.eqiad.wmnet Ready control-plane 3y31d v1.31.4 dse-k8s-worker1001.eqiad.wmnet Ready <none> 3y31d v1.31.4 dse-k8s-worker1002.eqiad.wmnet Ready <none> 3y31d v1.31.4 dse-k8s-worker1003.eqiad.wmnet Ready <none> 3y31d v1.31.4 dse-k8s-worker1004.eqiad.wmnet Ready <none> 3y31d v1.31.4 dse-k8s-worker1005.eqiad.wmnet Ready <none> 3y27d v1.31.4 dse-k8s-worker1006.eqiad.wmnet Ready <none> 3y27d v1.31.4 dse-k8s-worker1007.eqiad.wmnet Ready <none> 3y27d v1.31.4 dse-k8s-worker1008.eqiad.wmnet Ready <none> 3y27d v1.31.4 dse-k8s-worker1009.eqiad.wmnet Ready <none> 572d v1.31.4 dse-k8s-worker1010.eqiad.wmnet Ready <none> 310d v1.31.4 dse-k8s-worker1011.eqiad.wmnet Ready <none> 310d v1.31.4 dse-k8s-worker1012.eqiad.wmnet Ready <none> 286d v1.31.4 dse-k8s-worker1013.eqiad.wmnet Ready <none> 286d v1.31.4 dse-k8s-worker1014.eqiad.wmnet Ready <none> 185d v1.31.4 dse-k8s-worker1015.eqiad.wmnet Ready <none> 231d v1.31.4 dse-k8s-worker1016.eqiad.wmnet Ready <none> 231d v1.31.4 dse-k8s-worker1017.eqiad.wmnet Ready <none> 231d v1.31.4 dse-k8s-worker1018.eqiad.wmnet Ready <none> 231d v1.31.4 dse-k8s-worker1019.eqiad.wmnet Ready <none> 231d v1.31.4 dse-k8s-worker1024.eqiad.wmnet Ready <none> 24d v1.31.4 dse-k8s-worker1025.eqiad.wmnet Ready <none> 23d v1.31.4 dse-k8s-worker1026.eqiad.wmnet Ready <none> 8d v1.31.4 dse-k8s-worker1027.eqiad.wmnet Ready <none> 7d21h v1.31.4 dse-k8s-worker1028.eqiad.wmnet Ready <none> 23d v1.31.4
Wed, Mar 25
brouberol@deploy2002:/srv/deployment-charts/helmfile.d/aux-k8s-services/kafka-mirrormaker$ kubectl get pod NAME READY STATUS RESTARTS AGE kafka-mirrormaker-jumbo-eqiad-to-test-eqiad-766ccfd75b-6gk2t 1/1 Running 0 5d17h kafka-mirrormaker-logging-codfw-to-jumbo-eqiad-585b57d898-96ddm 1/1 Running 0 5d1h kafka-mirrormaker-logging-eqiad-to-jumbo-eqiad-76b47659fd-p4gkx 1/1 Running 0 5d1h kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-dj2ml 1/1 Running 0 52m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-lzpn2 1/1 Running 0 52m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-mx5hq 1/1 Running 0 52m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-qnnwk 1/1 Running 0 52m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-sf5tn 1/1 Running 0 52m kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-4tt9g 1/1 Running 0 39m kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-8qg72 1/1 Running 0 39m kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-9trnl 1/1 Running 0 39m kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-sz8mp 1/1 Running 0 39m kafka-mirrormaker-main-eqiad-to-jumbo-eqiad-845b4db85b-txc9v 1/1 Running 0 39m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-g9v2j 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-hw62l 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-jt85b 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-k6lgz 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-rnvqv 1/1 Running 0 45m brouberol@deploy2002:/srv/deployment-charts/helmfile.d/aux-k8s-services/kafka-mirrormaker$ kube-env kafka-mirrormaker aux-k8s-codfw brouberol@deploy2002:/srv/deployment-charts/helmfile.d/aux-k8s-services/kafka-mirrormaker$ kubectl get pod NAME READY STATUS RESTARTS AGE kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-k85d2 1/1 Running 0 51m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-nlhr4 1/1 Running 0 51m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-qc9q2 1/1 Running 0 51m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-r7l8q 1/1 Running 0 51m kafka-mirrormaker-main-codfw-to-main-eqiad-899b8f95f-x24fk 1/1 Running 0 51m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-447ln 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-gdwkl 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-t84wk 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-tp28b 1/1 Running 0 45m kafka-mirrormaker-main-eqiad-to-main-codfw-68c4bf8d76-zslrc 1/1 Running 0 45m
Mon, Mar 23
I attempted to deploy the kafka-mirror-main-codfw_to_main-eqiad instance this morning, but got blocked by the fact that the aux clusters are very low on resource. I happened to realize that we have 6 pending hosts in each DC, with each host having 48CPU and 128GB of RAM (cf T393053 and T393054). @elukey is working on reimaging them to trixie so we can start to add them into the cluster. After which, we should be finally ready to deploy MM to k8s.




