Page MenuHomePhabricator
Feed Advanced Search

Today

elukey added a comment to T363829: Move cloud's PKI infrastructure to Bullseye/Bookworm.

Moved pki-test01 to Bullseye, I didn't know that dist-upgrade.sh was present in the puppet repo so I've done it manually.

Fri, May 17, 3:37 PM · Infrastructure-Foundations
elukey added a comment to T365253: Allow Kubernetes workers to be deployed on Bookworm.

ML would be very happy to test the 6.x kernel since the GPU drivers are shipped directly with it, so we'd get a nice bump to those as well. I forgot about containerd right, I'll wait for Alex's approval before doing anything.

Fri, May 17, 2:03 PM · Machine-Learning-Team, serviceops, Kubernetes
elukey updated the task description for T365253: Allow Kubernetes workers to be deployed on Bookworm.
Fri, May 17, 1:50 PM · Machine-Learning-Team, serviceops, Kubernetes
elukey created T365253: Allow Kubernetes workers to be deployed on Bookworm.
Fri, May 17, 1:49 PM · Machine-Learning-Team, serviceops, Kubernetes
elukey claimed T362984: GPU errors in hf image in ml-staging.
Fri, May 17, 1:28 PM · Lift-Wing, Machine-Learning-Team
elukey claimed T363191: Test if we can avoid ROCm debian packages on k8s nodes.
Fri, May 17, 1:28 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T360111: Set automatically libomp's num threads when using Pytorch.

The new endpoint has been rolled out as part of the migration to the mw-int-ro endpoint, task done!

Fri, May 17, 1:28 PM · Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Upgrading to Bookworm is not straightforward since multiple packages need to be built etc.., so I filed a bug report to Debian while we wait:

Fri, May 17, 1:11 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.

So far:

  • amd-k8s-device-plugin was copied to bookworm
  • kubelet is present for bookworm (another version though)
  • rsyslog-kubernetes is not present in bookworm-wikimedia, so we'll need to build it.
Fri, May 17, 12:34 PM · Patch-For-Review, Machine-Learning-Team

Yesterday

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Finally we found the issue, see https://github.com/ROCm/k8s-device-plugin/issues/65#issuecomment-2115414637

Thu, May 16, 3:41 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.

Thu, May 16, 3:31 PM · Patch-For-Review, Machine-Learning-Team

Wed, May 15

elukey changed the status of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends from Open to Stalled.

We rolled out PKI to thanos-fe1001 as test node, and we observed increase in cpu usage on Tegola (as anticipated). We are going to work on T344324 before proceeding any further.

Wed, May 15, 2:42 PM · SRE, SRE-swift-storage
elukey changed the status of T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) from Stalled to Open.
Wed, May 15, 2:42 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident
elukey changed the status of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends, a subtask of T357750: Phase out cergen, from Open to Stalled.
Wed, May 15, 2:42 PM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
elukey changed the status of T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023), a subtask of T343987: Switch thanos-fe to cfssl, from Stalled to Open.
Wed, May 15, 2:42 PM · Observability-Metrics
elukey changed the status of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends, a subtask of T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023), from Open to Stalled.
Wed, May 15, 2:42 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident
elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

@Jgiannelos ahhh I got fooled by the repo, didn't see that we use one branch for each release.. and the git tags fooled me as well.

Wed, May 15, 2:38 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident
elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

We rolled out CFSSL/PKI cert to thanos-fe1001, one of 4 eqiad nodes, and from CPU graphs the usage seems to have gone up by roughly +50ms. No constant throttling and the app seems working fine. The main issue is what happens with all 4 nodes with PKI, and if in the future we'll need more.

Wed, May 15, 2:19 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Opened https://github.com/ROCm/k8s-device-plugin/issues/65

Wed, May 15, 10:34 AM · Lift-Wing, Machine-Learning-Team

Tue, May 14

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Even better:

Tue, May 14, 4:55 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Following an advice from Janis, I tried on ml-staging2001:

Tue, May 14, 4:54 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Janis from ServiceOps suggested that maybe seccomp or apparmor are playing a role into this.

Tue, May 14, 3:37 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

I took the time to re-read the whole task, and one thing that I missed was the fact that after a lot of upgrades we may be in a different position with the current version of Tegola (namely, the problem may not be present anymore, or in a different form).

Tue, May 14, 1:02 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident
elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

I found also this interesting project that explains the issue very well: https://github.com/Kriechi/aws-s3-reverse-proxy/blob/master/README.md

Tue, May 14, 12:49 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

Mon, May 13

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

This is totally strange:

Mon, May 13, 4:18 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Ok finally something that is consistent: NLLB with pytorch 2.2.1 and ROCm 5.7 shows:

Mon, May 13, 3:46 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T352647: Move Cassandra clusters to PKI.

Dropped all cassandra-ca old certs from puppet private.

Mon, May 13, 3:36 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

It seems not to be related to the OS, since nllb-gpu on Bookworm ran fine on ml-staging2001 (with the GPU).

Mon, May 13, 2:54 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

To keep archives happy: tried to set up a local sidear in staging (I think it was attempted before but I didn't find task entries sorry) and I got:

Mon, May 13, 1:43 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

Fri, May 10

elukey updated the task description for T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.
Fri, May 10, 4:22 PM · serviceops, Data-Persistence
elukey added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

I'd vote to add the mesh support/configuration at this point, it seems less risky and error prone than allowing kask to reload TLS certs. The only concern would be the extra latency involved, but in theory it shouldn't be heavy (it add an extra hop/tcp conn to localhost, we can measure the impact in staging and decide).

Fri, May 10, 4:21 PM · serviceops, Data-Persistence
elukey committed rMLIS96dcea8dede1: llm: update to Bookworm.
llm: update to Bookworm
Fri, May 10, 12:37 PM
elukey committed rLPRI35e9aafd7118: Delete the Cassandra directory in secrets.
Delete the Cassandra directory in secrets
Fri, May 10, 11:59 AM
elukey committed rLPRI7125f3af894a: Add fake TLS keystore password for Cassandra clusters.
Add fake TLS keystore password for Cassandra clusters
Fri, May 10, 11:59 AM
elukey committed rADMK9af003412300: Release new version.
Release new version
Fri, May 10, 10:48 AM

Thu, May 9

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

On ml-staging2001 I checked the pod's details (via docker inspect) and found:

Thu, May 9, 5:03 PM · Lift-Wing, Machine-Learning-Team
Eevans awarded T362181: Encrypt Airflow connections to AQS Cassandra a Cookie token.
Thu, May 9, 3:19 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra
elukey added a comment to T352647: Move Cassandra clusters to PKI.

The doc in https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates seems already up to date, I added a note about the deprecation of the cassandra ca-manager way of getting TLS certs.

Thu, May 9, 1:58 PM · Patch-For-Review, Data-Persistence, Cassandra

Wed, May 8

elukey committed rADMKc9fc3459dd19: pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz.
pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz
Wed, May 8, 4:34 PM
elukey committed rADMK21a50b7cb4eb: pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz.
pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz
Wed, May 8, 4:34 PM
elukey committed rADMKe8fd9ae2a1d6: Merge tag 'upstream/1.25.2.8'.
Merge tag 'upstream/1.25.2.8'
Wed, May 8, 4:34 PM
elukey committed rADMK6266b42105a5: New upstream version 1.25.2.8.
New upstream version 1.25.2.8
Wed, May 8, 4:34 PM
elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

Stalled until T356412 is picked up by Data-Persistence

Wed, May 8, 1:57 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident
elukey added a comment to T360414: Phase out cergen for Observability services.

Also cc T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends and @elukey since the thanos-fe work here will help with that task too

Wed, May 8, 10:18 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Tue, May 7

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Tried https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU#Reset_the_GPU_state and killed/restarted the mistral pod, just as a test to see if anything was in a weird state, but same error.

Tue, May 7, 4:16 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

A lot of useful info in https://en.wikipedia.org/wiki/Direct_Rendering_Manager, it is also mentioned DRM-Auth and what it does.

Tue, May 7, 4:08 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

ms-fe1009's envoy migrated to PKI! We'll wait a couple of days before proceeding with either eqiad or codfw.

Tue, May 7, 3:14 PM · SRE, SRE-swift-storage
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Still seeing the old issue with ROCm 5.6:

Tue, May 7, 2:45 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362181: Encrypt Airflow connections to AQS Cassandra.

Niceee thanks a lot for all the work!

Tue, May 7, 2:36 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra
elukey added a comment to T362663: Add slow-logs for ML isvcs.

All the revscoring Docker images running in production now log the request id (associated with the related x-request-id header). This turned out to be sufficient to figure out how to reproduce traffic logged in the kserve access logs.

Tue, May 7, 1:56 PM · Machine-Learning-Team, ORES
elukey committed rMLIS74989e177ece: huggingface: upgrade base image.
huggingface: upgrade base image
Tue, May 7, 1:50 PM

Mon, May 6

elukey claimed T363829: Move cloud's PKI infrastructure to Bullseye/Bookworm.
Mon, May 6, 2:26 PM · Infrastructure-Foundations
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Ah wow my bad! I inspected the docker image and it contains a ton of Nvidia binaries. Will review again the install procedure, really sneaky.

Mon, May 6, 1:50 PM · Lift-Wing, Machine-Learning-Team
elukey moved T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Mon, May 6, 1:32 PM · Machine-Learning-Team
elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

The changes have been successfully deployed on all Lift Wing clusters.

Mon, May 6, 1:31 PM · Machine-Learning-Team
elukey changed the point value for T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing from 1 to 10.
Mon, May 6, 1:31 PM · Machine-Learning-Team
elukey moved T362316: Migrate ml-services to mw-api-int from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Mon, May 6, 1:31 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey set the point value for T362316: Migrate ml-services to mw-api-int to 5.
Mon, May 6, 1:30 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T362316: Migrate ml-services to mw-api-int.

And eqiad migrated as well, all done :)

Mon, May 6, 1:30 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T362984: GPU errors in hf image in ml-staging.
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1
Mon, May 6, 10:38 AM · Lift-Wing, Machine-Learning-Team
elukey updated subscribers of T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.

Hi Kevin! So https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_ng/values/ml-serve.yaml#L340 is the point to add the new config, I'd say commons.wikimedia.org should suffice. The endpoint is served by MediaWiki Appservers, so in my opinion we can just expand the list of available/allowed Host headers safely.

Mon, May 6, 9:36 AM · Patch-For-Review, Machine-Learning-Team

Fri, May 3

elukey added a comment to T362984: GPU errors in hf image in ml-staging.
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1
Fri, May 3, 7:05 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

This should be the diff between libdr bullseye (2.4.104) and bookworm (2.4.114) versions:

Fri, May 3, 4:17 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

After a chat with Tobias, we are going to test this:

Fri, May 3, 2:42 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

https://packages.debian.org/bookworm/rocm-smi
https://packages.debian.org/source/bookworm/rocm-smi-lib

Fri, May 3, 2:40 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.
elukey@stat1010:~$ dpkg -S rocm-smi 
rocm-smi-lib: /opt/rocm-5.4.0/bin/rocm-smi
Fri, May 3, 2:26 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

I have also reviewed the non-cpXXXX IPs found in netstat on ms-fe nodes, they seem all belonging to the thumbor pods, that should be using the mesh k8s module to contact swift (so already configured with the Root PKI CA bundle etc..). This means that moving ms-fe nodes to PKI should cause any TLS validation failure on the thumbor pod front. I think we can safely assume that the same applies for the Apache Traffic Server on cpXXXX nodes, but we can double check with traffic just to be sure (adding them to the code reviews).

Fri, May 3, 1:46 PM · SRE, SRE-swift-storage
elukey added a comment to T362316: Migrate ml-services to mw-api-int.

Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th.

Fri, May 3, 1:36 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.

Hi Kevin! You have two options:

Fri, May 3, 12:32 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

Hi! Trying to answer inline, Chris can chime in if I miss anything and/or if I write something totally off :)

Fri, May 3, 12:22 PM · SRE, SRE-swift-storage

Thu, May 2

elukey updated subscribers of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

To summarize previous discussions: we are currently relying on a TLS cert emitted by the puppet CA via cergen, a tool that we are trying to deprecate (see T357750).

Thu, May 2, 4:13 PM · SRE, SRE-swift-storage
elukey added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@Eevans IIUC kask terminates TLS by itself for session store, is it right? Would it be a problem to move to the mesh k8s module, namely to use the envoy sidecar that terminates TLS and proxies the request (in this case, kask would listen to a plaintext localhost:port combination). I am asking since we could move to PKI directly if kask's chart used the mesh module.

Thu, May 2, 3:50 PM · serviceops, Data-Persistence
elukey added a comment to T352647: Move Cassandra clusters to PKI.

All clusters on PKI!

Thu, May 2, 3:42 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey updated the task description for T352647: Move Cassandra clusters to PKI.
Thu, May 2, 3:41 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

The cert is here:

Thu, May 2, 2:25 PM · serviceops, Data-Persistence
elukey updated the task description for T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.
Thu, May 2, 1:25 PM · serviceops, Data-Persistence
elukey created T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.
Thu, May 2, 11:49 AM · serviceops, Data-Persistence

Tue, Apr 30

elukey edited P61497 (An Untitled Masterwork).
Tue, Apr 30, 4:04 PM
elukey created P61497 (An Untitled Masterwork).
Tue, Apr 30, 4:04 PM
elukey moved T362316: Migrate ml-services to mw-api-int from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Apr 30, 2:14 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey claimed T362316: Migrate ml-services to mw-api-int.
Tue, Apr 30, 2:13 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T362316: Migrate ml-services to mw-api-int.

All changes rebased and ready to go (for prod). The main idea is the following:

Tue, Apr 30, 1:52 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey created T363829: Move cloud's PKI infrastructure to Bullseye/Bookworm.
Tue, Apr 30, 1:26 PM · Infrastructure-Foundations
elukey added a comment to T352647: Move Cassandra clusters to PKI.

Restbase done!

Tue, Apr 30, 1:14 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey updated the task description for T352647: Move Cassandra clusters to PKI.
Tue, Apr 30, 12:58 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey committed rMSCP0de70bcb10fc: test: remove not needed JSON.parse causing failures.
test: remove not needed JSON.parse causing failures
Tue, Apr 30, 12:21 AM

Mon, Apr 29

elukey added a comment to T362316: Migrate ml-services to mw-api-int.

After a lot of tests and config changes, we are almost ready to proceed with prod. Hopefully we'll get to it on April 2nd.

Mon, Apr 29, 4:02 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

Opened T363725 for the redirects, as it can be tackled separately.

Mon, Apr 29, 4:00 PM · Machine-Learning-Team
elukey created T363725: Patch Location headers of HTTP redirects coming from the MW API in Lift Wing services.
Mon, Apr 29, 3:59 PM · Machine-Learning-Team
elukey added a comment to T363046: changeprop ORES tests failing.

Untagging ML since this is an issue with the nodejs code, not ORES etc.. Filed a patch to fix, lemme know :)

Mon, Apr 29, 2:43 PM · ORES, ChangeProp
elukey removed a project from T363046: changeprop ORES tests failing: Machine-Learning-Team.
Mon, Apr 29, 2:42 PM · ORES, ChangeProp

Fri, Apr 26

elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

Test in staging has been done, and it was successful! All the revscoring services are now running without WIKI_URL set explicitly.

Fri, Apr 26, 4:39 PM · Machine-Learning-Team
elukey added a comment to T352647: Move Cassandra clusters to PKI.

Current status:

Fri, Apr 26, 4:03 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey committed rMLISe2696604f37a: outlink_topic_model: fix formatting of README.md to please CI.
outlink_topic_model: fix formatting of README.md to please CI
Fri, Apr 26, 1:36 PM

Wed, Apr 24

elukey committed rMLISaeaee66b05b3: revscoring_model: log request_id alongside with inputs.
revscoring_model: log request_id alongside with inputs
Wed, Apr 24, 3:38 PM

Tue, Apr 23

elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

The only issue that I see from puppet is that prometheus::node_amd_rocm uses rocm smi to get info about what GPU to monitor.

Tue, Apr 23, 4:57 PM · Patch-For-Review, Machine-Learning-Team
elukey created T363191: Test if we can avoid ROCm debian packages on k8s nodes.
Tue, Apr 23, 4:51 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Quick clarification - there are currently two places where we use ROCm-specific libs:

Tue, Apr 23, 4:48 PM · Lift-Wing, Machine-Learning-Team
elukey added a comment to T362181: Encrypt Airflow connections to AQS Cassandra.

Also I confirm that AQS Cassandra runs now with PKI TLS certs, so we can start encrypting TLS connections anytime.

Tue, Apr 23, 3:24 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra
elukey added a comment to T362181: Encrypt Airflow connections to AQS Cassandra.

Filed a change for the stat nodes, the hadoop worker nodes already have the truststore!

Tue, Apr 23, 3:24 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra