User Details
- User Since
- Aug 31 2020, 9:52 AM (198 w, 5 d)
- Availability
- Available
- LDAP User
- Klausman
- MediaWiki User
- TKlausmann (WMF) [ Global Accounts ]
Thu, Jun 20
Wed, Jun 19
For anyone who wants to build the above binary form the Debian sources:
Machine is imaged and running. The PXE boot was "fixed" by an ugly hack mentioned in T304483#9906962 While the firmware problem remains, at least we are unblocked on this host.
So today I wanted to instal ml-staging2003. This is a new SMC hardware type and it hits this problem.
Tue, Jun 18
It looks like the primary interface can't see the network device (the console shows "media test failure, check cable".
Machine is drained and off, so you're free to reseat memory etc. Let me know when it's back (and what we might be able to do if the memory remains problematic).
Current state:
Mon, Jun 17
Tuesday sounds good. I'll drain and shutdown the machine on Tuesday 17:00 CEST/15:00 UTC/10:00CDT, does that work for you?
Wed, Jun 12
One note: since the default OS has changed (Bullseye->Bookworm), I updated the ticket desc accordingly --- we definitely want Bookworm.
Tue, Jun 11
I repooled the machine just now, as I don't want to fly this close to capacity ceiling for prolonged periods.
Wed, Jun 5
Can't upload the ASR since it's too large. Anywhere that I should upload it to?
Tue, Jun 4
Thu, May 30
Tue, May 28
- we had another instance of high lat (eswiki)
- logs show fetch features being slow (extract_cache)
- we havea repor that should help with root-causing the matter
- Mistral crashlooping, startup checks usually 5m , so we bumped to 10m, but it didn't help
- Bert model works, so likely Mistral issue
- the kubelet partition increase for the install phase is in review
- ml-staging1001 is now on Bookworm, dragonfly (distributed downloading of S3 stuff) needs to be bumped
- with bookworm, there no longer are GPU drivers on the base node (besides Debian kernel support), but driver/library code lives in the Docker images
Mon, May 27
May 22 2024
# build-production-images --select '*pytorch23*' == Step 0: scanning /srv/images/production-images/images == Will build the following images: * docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Step 1: building images == * Built image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Build done! == You can see the logs at ./docker-pkg-build.log == Step 0: scanning /srv/images/production-images/istio == Will build the following images: == Step 1: building images == == Step 2: publishing == == Build done! == You can see the logs at ./docker-pkg-build.log == Step 0: scanning /srv/images/production-images/cert-manager == Will build the following images: == Step 1: building images == == Step 2: publishing == == Build done! == You can see the logs at ./docker-pkg-build.log #
May 21 2024
Repooled the machine:
$ sudo confctl select 'name=ml-serve2002.codfw.wmnet' set/pooled=yes codfw/ml_serve/kubesvc/ml-serve2002.codfw.wmnet: pooled changed no => yes WARNING:conftool.announce:conftool action : set/pooled=yes; selector: name=ml-serve2002.codfw.wmnet
Since this has not re-occurred, I am closing the task for now. If it happens again, we can always re-open.
May 16 2024
May 15 2024
May 14 2024
- Connections from isvc namespaces on staging to the Cassandra machines now work, including TLS certs and SNI
- Next step: have an actual inference service actually talk to the cache, likely with code from https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/995001
- Still need to figure out long-term maintenance of Cassandra server-side config (users, passwords, namespaces, schemas); may hand off/soft-donate the machines to Data Persistence Team
All the machinery is now in place to make connections to Cassandra from isvcs on staging (in the experimental NS):
May 13 2024
May 10 2024
May 7 2024
Apr 30 2024
Apr 25 2024
Apr 24 2024
This has been implemented in change 1020194.
Apr 23 2024
Apr 19 2024
This is a bug in keras: it tries to open the file with mode r+b (read, append, binary), but since the file is owned by another user (nobody vs. somebody), the call fails. Why keras would need to be able to append to the file, I don't know.
Apr 18 2024
Apr 17 2024
I've experimented a bit on Thanos, and arrived at this query:
Apr 16 2024
Probably something like:
Apr 15 2024
Timeline (times in UTC):
We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now. I have confirmed this for ruwiki.