Feed Advanced Search

Advanced Search
Use Results
Edit Query
Hide Query

	Include stories about projects I am a member of.

Today

elukey added a comment to T363829: Move cloud's PKI infrastructure to Bullseye/Bookworm.

Moved pki-test01 to Bullseye, I didn't know that dist-upgrade.sh was present in the puppet repo so I've done it manually.

Fri, May 17, 3:37 PM · Infrastructure-Foundations

elukey added a comment to T365253: Allow Kubernetes workers to be deployed on Bookworm.

ML would be very happy to test the 6.x kernel since the GPU drivers are shipped directly with it, so we'd get a nice bump to those as well. I forgot about containerd right, I'll wait for Alex's approval before doing anything.

Fri, May 17, 2:03 PM · Machine-Learning-Team, serviceops, Kubernetes

elukey updated the task description for T365253: Allow Kubernetes workers to be deployed on Bookworm.

Fri, May 17, 1:50 PM · Machine-Learning-Team, serviceops, Kubernetes

elukey created T365253: Allow Kubernetes workers to be deployed on Bookworm.

Fri, May 17, 1:49 PM · Machine-Learning-Team, serviceops, Kubernetes

elukey claimed T362984: GPU errors in hf image in ml-staging.

Fri, May 17, 1:28 PM · Lift-Wing, Machine-Learning-Team

elukey claimed T363191: Test if we can avoid ROCm debian packages on k8s nodes.

Fri, May 17, 1:28 PM · Patch-For-Review, Machine-Learning-Team

elukey added a comment to T360111: Set automatically libomp's num threads when using Pytorch.

The new endpoint has been rolled out as part of the migration to the mw-int-ro endpoint, task done!

Fri, May 17, 1:28 PM · Machine-Learning-Team

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Upgrading to Bookworm is not straightforward since multiple packages need to be built etc.., so I filed a bug report to Debian while we wait:

Fri, May 17, 1:11 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

In T363191#9805400, @elukey wrote:

In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.

So far:

amd-k8s-device-plugin was copied to bookworm

kubelet is present for bookworm (another version though)

rsyslog-kubernetes is not present in bookworm-wikimedia, so we'll need to build it.

Fri, May 17, 12:34 PM · Patch-For-Review, Machine-Learning-Team

Yesterday

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Finally we found the issue, see https://github.com/ROCm/k8s-device-plugin/issues/65#issuecomment-2115414637

Thu, May 16, 3:41 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.

Thu, May 16, 3:31 PM · Patch-For-Review, Machine-Learning-Team

Wed, May 15

elukey changed the status of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends from Open to Stalled.

We rolled out PKI to thanos-fe1001 as test node, and we observed increase in cpu usage on Tegola (as anticipated). We are going to work on T344324 before proceeding any further.

Wed, May 15, 2:42 PM · SRE, SRE-swift-storage

elukey changed the status of T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) from Stalled to Open.

Wed, May 15, 2:42 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

elukey changed the status of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends, a subtask of T357750: Phase out cergen, from Open to Stalled.

Wed, May 15, 2:42 PM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

elukey changed the status of T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023), a subtask of T343987: Switch thanos-fe to cfssl, from Stalled to Open.

Wed, May 15, 2:42 PM · Observability-Metrics

elukey changed the status of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends, a subtask of T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023), from Open to Stalled.

Wed, May 15, 2:42 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

@Jgiannelos ahhh I got fooled by the repo, didn't see that we use one branch for each release.. and the git tags fooled me as well.

Wed, May 15, 2:38 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

We rolled out CFSSL/PKI cert to thanos-fe1001, one of 4 eqiad nodes, and from CPU graphs the usage seems to have gone up by roughly +50ms. No constant throttling and the app seems working fine. The main issue is what happens with all 4 nodes with PKI, and if in the future we'll need more.

Wed, May 15, 2:19 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Opened https://github.com/ROCm/k8s-device-plugin/issues/65

Wed, May 15, 10:34 AM · Lift-Wing, Machine-Learning-Team

Tue, May 14

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Even better:

Tue, May 14, 4:55 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Following an advice from Janis, I tried on ml-staging2001:

Tue, May 14, 4:54 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Janis from ServiceOps suggested that maybe seccomp or apparmor are playing a role into this.

Tue, May 14, 3:37 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

I took the time to re-read the whole task, and one thing that I missed was the fact that after a lot of upgrades we may be in a different position with the current version of Tegola (namely, the problem may not be present anymore, or in a different form).

Tue, May 14, 1:02 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

I found also this interesting project that explains the issue very well: https://github.com/Kriechi/aws-s3-reverse-proxy/blob/master/README.md

Tue, May 14, 12:49 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

Mon, May 13

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

This is totally strange:

Mon, May 13, 4:18 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Ok finally something that is consistent: NLLB with pytorch 2.2.1 and ROCm 5.7 shows:

Mon, May 13, 3:46 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T352647: Move Cassandra clusters to PKI.

Dropped all cassandra-ca old certs from puppet private.

Mon, May 13, 3:36 PM · Patch-For-Review, Data-Persistence, Cassandra

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

It seems not to be related to the OS, since nllb-gpu on Bookworm ran fine on ml-staging2001 (with the GPU).

Mon, May 13, 2:54 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

To keep archives happy: tried to set up a local sidear in staging (I think it was attempted before but I didn't find task entries sorry) and I got:

Mon, May 13, 1:43 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

Fri, May 10

elukey updated the task description for T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Fri, May 10, 4:22 PM · serviceops, Data-Persistence

elukey added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

I'd vote to add the mesh support/configuration at this point, it seems less risky and error prone than allowing kask to reload TLS certs. The only concern would be the extra latency involved, but in theory it shouldn't be heavy (it add an extra hop/tcp conn to localhost, we can measure the impact in staging and decide).

Fri, May 10, 4:21 PM · serviceops, Data-Persistence

elukey committed rMLIS96dcea8dede1: llm: update to Bookworm.

llm: update to Bookworm

Fri, May 10, 12:37 PM

elukey committed rLPRI35e9aafd7118: Delete the Cassandra directory in secrets.

Delete the Cassandra directory in secrets

Fri, May 10, 11:59 AM

elukey committed rLPRI7125f3af894a: Add fake TLS keystore password for Cassandra clusters.

Add fake TLS keystore password for Cassandra clusters

Fri, May 10, 11:59 AM

elukey committed rADMK9af003412300: Release new version.

Release new version

Fri, May 10, 10:48 AM

Thu, May 9

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

On ml-staging2001 I checked the pod's details (via docker inspect) and found:

Thu, May 9, 5:03 PM · Lift-Wing, Machine-Learning-Team

Eevans awarded T362181: Encrypt Airflow connections to AQS Cassandra a Cookie token.

Thu, May 9, 3:19 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra

elukey added a comment to T352647: Move Cassandra clusters to PKI.

The doc in https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates seems already up to date, I added a note about the deprecation of the cassandra ca-manager way of getting TLS certs.

Thu, May 9, 1:58 PM · Patch-For-Review, Data-Persistence, Cassandra

Wed, May 8

elukey committed rADMKc9fc3459dd19: pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz.

pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz

Wed, May 8, 4:34 PM

elukey committed rADMK21a50b7cb4eb: pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz.

pristine-tar data for amd-k8s-device-plugin_1.25.2.8.orig.tar.gz

Wed, May 8, 4:34 PM

elukey committed rADMKe8fd9ae2a1d6: Merge tag 'upstream/1.25.2.8'.

Merge tag 'upstream/1.25.2.8'

Wed, May 8, 4:34 PM

elukey committed rADMK6266b42105a5: New upstream version 1.25.2.8.

New upstream version 1.25.2.8

Wed, May 8, 4:34 PM

elukey added a comment to T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023).

In T344324#9507114, @jijiki wrote:

Stalled until T356412 is picked up by Data-Persistence

Wed, May 8, 1:57 PM · Patch-For-Review, Content-Transform-Team, serviceops, Essential-Work, Wikimedia-Incident

elukey added a comment to T360414: Phase out cergen for Observability services.

In T360414#9779961, @fgiunchedi wrote:

Also cc T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends and @elukey since the thanos-fe work here will help with that task too

Wed, May 8, 10:18 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Tue, May 7

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Tried https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU#Reset_the_GPU_state and killed/restarted the mistral pod, just as a test to see if anything was in a weird state, but same error.

Tue, May 7, 4:16 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

A lot of useful info in https://en.wikipedia.org/wiki/Direct_Rendering_Manager, it is also mentioned DRM-Auth and what it does.

Tue, May 7, 4:08 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

ms-fe1009's envoy migrated to PKI! We'll wait a couple of days before proceeding with either eqiad or codfw.

Tue, May 7, 3:14 PM · SRE, SRE-swift-storage

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Still seeing the old issue with ROCm 5.6:

Tue, May 7, 2:45 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T362181: Encrypt Airflow connections to AQS Cassandra.

Niceee thanks a lot for all the work!

Tue, May 7, 2:36 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Data-Engineering, Data-Persistence, Cassandra

elukey added a comment to T362663: Add slow-logs for ML isvcs.

All the revscoring Docker images running in production now log the request id (associated with the related x-request-id header). This turned out to be sufficient to figure out how to reproduce traffic logged in the kserve access logs.

Tue, May 7, 1:56 PM · Machine-Learning-Team, ORES

elukey committed rMLIS74989e177ece: huggingface: upgrade base image.

huggingface: upgrade base image

Tue, May 7, 1:50 PM

Mon, May 6

elukey claimed T363829: Move cloud's PKI infrastructure to Bullseye/Bookworm.

Mon, May 6, 2:26 PM · Infrastructure-Foundations

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

Ah wow my bad! I inspected the docker image and it contains a ton of Nvidia binaries. Will review again the install procedure, really sneaky.

Mon, May 6, 1:50 PM · Lift-Wing, Machine-Learning-Team

elukey moved T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.

Mon, May 6, 1:32 PM · Machine-Learning-Team

elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

The changes have been successfully deployed on all Lift Wing clusters.

Mon, May 6, 1:31 PM · Machine-Learning-Team

elukey changed the point value for T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing from 1 to 10.

Mon, May 6, 1:31 PM · Machine-Learning-Team

elukey moved T362316: Migrate ml-services to mw-api-int from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.

Mon, May 6, 1:31 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s

elukey set the point value for T362316: Migrate ml-services to mw-api-int to 5.

Mon, May 6, 1:30 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s

elukey added a comment to T362316: Migrate ml-services to mw-api-int.

And eqiad migrated as well, all done :)

Mon, May 6, 1:30 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

In T362984#9768972, @elukey wrote:

== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1

Mon, May 6, 10:38 AM · Lift-Wing, Machine-Learning-Team

elukey updated subscribers of T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.

Hi Kevin! So https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_ng/values/ml-serve.yaml#L340 is the point to add the new config, I'd say commons.wikimedia.org should suffice. The endpoint is served by MediaWiki Appservers, so in my opinion we can just expand the list of available/allowed Host headers safely.

Mon, May 6, 9:36 AM · Patch-For-Review, Machine-Learning-Team

Fri, May 3

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1

Fri, May 3, 7:05 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T362984: GPU errors in hf image in ml-staging.

This should be the diff between libdr bullseye (2.4.104) and bookworm (2.4.114) versions:

Fri, May 3, 4:17 PM · Lift-Wing, Machine-Learning-Team

elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

After a chat with Tobias, we are going to test this:

Fri, May 3, 2:42 PM · Patch-For-Review, Machine-Learning-Team

elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

https://packages.debian.org/bookworm/rocm-smi
https://packages.debian.org/source/bookworm/rocm-smi-lib

Fri, May 3, 2:40 PM · Patch-For-Review, Machine-Learning-Team

elukey added a comment to T363191: Test if we can avoid ROCm debian packages on k8s nodes.

elukey@stat1010:~$ dpkg -S rocm-smi 
rocm-smi-lib: /opt/rocm-5.4.0/bin/rocm-smi

Fri, May 3, 2:26 PM · Patch-For-Review, Machine-Learning-Team

elukey added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

I have also reviewed the non-cpXXXX IPs found in netstat on ms-fe nodes, they seem all belonging to the thumbor pods, that should be using the mesh k8s module to contact swift (so already configured with the Root PKI CA bundle etc..). This means that moving ms-fe nodes to PKI should cause any TLS validation failure on the thumbor pod front. I think we can safely assume that the same applies for the Apache Traffic Server on cpXXXX nodes, but we can double check with traffic just to be sure (adding them to the code reviews).

Fri, May 3, 1:46 PM · SRE, SRE-swift-storage

elukey added a comment to T362316: Migrate ml-services to mw-api-int.

Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th.

Fri, May 3, 1:36 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s

elukey added a comment to T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.

Hi Kevin! You have two options:

Fri, May 3, 12:32 PM · Patch-For-Review, Machine-Learning-Team

elukey added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

Hi! Trying to answer inline, Chris can chime in if I miss anything and/or if I write something totally off :)

Fri, May 3, 12:22 PM · SRE, SRE-swift-storage

Thu, May 2

elukey updated subscribers of T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

To summarize previous discussions: we are currently relying on a TLS cert emitted by the puppet CA via cergen, a tool that we are trying to deprecate (see T357750).

Thu, May 2, 4:13 PM · SRE, SRE-swift-storage

elukey added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@Eevans IIUC kask terminates TLS by itself for session store, is it right? Would it be a problem to move to the mesh k8s module, namely to use the envoy sidecar that terminates TLS and proxies the request (in this case, kask would listen to a plaintext localhost:port combination). I am asking since we could move to PKI directly if kask's chart used the mesh module.

Thu, May 2, 3:50 PM · serviceops, Data-Persistence

elukey added a comment to T352647: Move Cassandra clusters to PKI.

All clusters on PKI!

Thu, May 2, 3:42 PM · Patch-For-Review, Data-Persistence, Cassandra

elukey updated the task description for T352647: Move Cassandra clusters to PKI.

Thu, May 2, 3:41 PM · Patch-For-Review, Data-Persistence, Cassandra

elukey added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

The cert is here:

Thu, May 2, 2:25 PM · serviceops, Data-Persistence

elukey updated the task description for T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

Thu, May 2, 1:25 PM · serviceops, Data-Persistence

elukey created T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.