Moved pki-test01 to Bullseye, I didn't know that dist-upgrade.sh was present in the puppet repo so I've done it manually.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Today
ML would be very happy to test the 6.x kernel since the GPU drivers are shipped directly with it, so we'd get a nice bump to those as well. I forgot about containerd right, I'll wait for Alex's approval before doing anything.
The new endpoint has been rolled out as part of the migration to the mw-int-ro endpoint, task done!
Upgrading to Bookworm is not straightforward since multiple packages need to be built etc.., so I filed a bug report to Debian while we wait:
In T363191#9805400, @elukey wrote:In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.
So far:
- amd-k8s-device-plugin was copied to bookworm
- kubelet is present for bookworm (another version though)
- rsyslog-kubernetes is not present in bookworm-wikimedia, so we'll need to build it.
Yesterday
Finally we found the issue, see https://github.com/ROCm/k8s-device-plugin/issues/65#issuecomment-2115414637
In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.
Wed, May 15
We rolled out PKI to thanos-fe1001 as test node, and we observed increase in cpu usage on Tegola (as anticipated). We are going to work on T344324 before proceeding any further.
@Jgiannelos ahhh I got fooled by the repo, didn't see that we use one branch for each release.. and the git tags fooled me as well.
We rolled out CFSSL/PKI cert to thanos-fe1001, one of 4 eqiad nodes, and from CPU graphs the usage seems to have gone up by roughly +50ms. No constant throttling and the app seems working fine. The main issue is what happens with all 4 nodes with PKI, and if in the future we'll need more.
Tue, May 14
Even better:
Following an advice from Janis, I tried on ml-staging2001:
Janis from ServiceOps suggested that maybe seccomp or apparmor are playing a role into this.
I took the time to re-read the whole task, and one thing that I missed was the fact that after a lot of upgrades we may be in a different position with the current version of Tegola (namely, the problem may not be present anymore, or in a different form).
I found also this interesting project that explains the issue very well: https://github.com/Kriechi/aws-s3-reverse-proxy/blob/master/README.md
Mon, May 13
This is totally strange:
Ok finally something that is consistent: NLLB with pytorch 2.2.1 and ROCm 5.7 shows:
Dropped all cassandra-ca old certs from puppet private.
It seems not to be related to the OS, since nllb-gpu on Bookworm ran fine on ml-staging2001 (with the GPU).
To keep archives happy: tried to set up a local sidear in staging (I think it was attempted before but I didn't find task entries sorry) and I got:
Fri, May 10
I'd vote to add the mesh support/configuration at this point, it seems less risky and error prone than allowing kask to reload TLS certs. The only concern would be the extra latency involved, but in theory it shouldn't be heavy (it add an extra hop/tcp conn to localhost, we can measure the impact in staging and decide).
Thu, May 9
On ml-staging2001 I checked the pod's details (via docker inspect) and found:
The doc in https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates seems already up to date, I added a note about the deprecation of the cassandra ca-manager way of getting TLS certs.
Wed, May 8
In T344324#9507114, @jijiki wrote:Stalled until T356412 is picked up by Data-Persistence
In T360414#9779961, @fgiunchedi wrote:Also cc T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends and @elukey since the thanos-fe work here will help with that task too
Tue, May 7
Tried https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU#Reset_the_GPU_state and killed/restarted the mistral pod, just as a test to see if anything was in a weird state, but same error.
A lot of useful info in https://en.wikipedia.org/wiki/Direct_Rendering_Manager, it is also mentioned DRM-Auth and what it does.
ms-fe1009's envoy migrated to PKI! We'll wait a couple of days before proceeding with either eqiad or codfw.
Still seeing the old issue with ROCm 5.6:
Niceee thanks a lot for all the work!
All the revscoring Docker images running in production now log the request id (associated with the related x-request-id header). This turned out to be sufficient to figure out how to reproduce traffic logged in the kserve access logs.
Mon, May 6
Ah wow my bad! I inspected the docker image and it contains a ton of Nvidia binaries. Will review again the install procedure, really sneaky.
The changes have been successfully deployed on all Lift Wing clusters.
And eqiad migrated as well, all done :)
In T362984#9768972, @elukey wrote:== Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1
Hi Kevin! So https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_ng/values/ml-serve.yaml#L340 is the point to add the new config, I'd say commons.wikimedia.org should suffice. The endpoint is served by MediaWiki Appservers, so in my opinion we can just expand the list of available/allowed Host headers safely.
Fri, May 3
== Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1
This should be the diff between libdr bullseye (2.4.104) and bookworm (2.4.114) versions:
After a chat with Tobias, we are going to test this:
elukey@stat1010:~$ dpkg -S rocm-smi rocm-smi-lib: /opt/rocm-5.4.0/bin/rocm-smi
I have also reviewed the non-cpXXXX IPs found in netstat on ms-fe nodes, they seem all belonging to the thumbor pods, that should be using the mesh k8s module to contact swift (so already configured with the Root PKI CA bundle etc..). This means that moving ms-fe nodes to PKI should cause any TLS validation failure on the thumbor pod front. I think we can safely assume that the same applies for the Apache Traffic Server on cpXXXX nodes, but we can double check with traffic just to be sure (adding them to the code reviews).
Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th.
Hi Kevin! You have two options:
Hi! Trying to answer inline, Chris can chime in if I miss anything and/or if I write something totally off :)
Thu, May 2
To summarize previous discussions: we are currently relying on a TLS cert emitted by the puppet CA via cergen, a tool that we are trying to deprecate (see T357750).
@Eevans IIUC kask terminates TLS by itself for session store, is it right? Would it be a problem to move to the mesh k8s module, namely to use the envoy sidecar that terminates TLS and proxies the request (in this case, kask would listen to a plaintext localhost:port combination). I am asking since we could move to PKI directly if kask's chart used the mesh module.
All clusters on PKI!
The cert is here:
Tue, Apr 30
All changes rebased and ready to go (for prod). The main idea is the following:
Restbase done!
Mon, Apr 29
After a lot of tests and config changes, we are almost ready to proceed with prod. Hopefully we'll get to it on April 2nd.
Opened T363725 for the redirects, as it can be tackled separately.
Untagging ML since this is an issue with the nodejs code, not ORES etc.. Filed a patch to fix, lemme know :)
Fri, Apr 26
Test in staging has been done, and it was successful! All the revscoring services are now running without WIKI_URL set explicitly.
Current status:
Wed, Apr 24
Tue, Apr 23
The only issue that I see from puppet is that prometheus::node_amd_rocm uses rocm smi to get info about what GPU to monitor.
Quick clarification - there are currently two places where we use ROCm-specific libs:
Also I confirm that AQS Cassandra runs now with PKI TLS certs, so we can start encrypting TLS connections anytime.
Filed a change for the stat nodes, the hadoop worker nodes already have the truststore!