In T354970#9922418, @Eevans wrote:@klausman would you like me to upgrade the ml-cache cluster as well, or save that for you?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Yesterday
Yesterday
Tue, Jun 25
Tue, Jun 25
klausman set the point value for T368273: Roll out kserve-inference securityContext change to ML isvcs to 3.
Mon, Jun 24
Mon, Jun 24
klausman added a comment to T366688: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet.
All of those work for me, with a preference for Thursday, so I'll drain&power-off the machine Thu before 8am your time (1300 UTC/1500 CEST). If there's a need to move to Wed or Fri, just lmk and I'll schedule then.
Thu, Jun 20
Thu, Jun 20
Wed, Jun 19
Wed, Jun 19
For anyone who wants to build the above binary form the Debian sources:
Machine is imaged and running. The PXE boot was "fixed" by an ugly hack mentioned in T304483#9906962 While the firmware problem remains, at least we are unblocked on this host.
klausman renamed T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet from hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet to hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
So today I wanted to instal ml-staging2003. This is a new SMC hardware type and it hits this problem.
In T357415#9905563, @Papaul wrote:Information2
The server has only the the SFT-OOB-LIC license which is the Supermicro Out of band OOB license allowing you to only update BIOS and BMC/IPMI . This license will not allow you to upgrade the Add-on 10G and 25G network interface i mentioned in (1). This is the reason we are getting the error "ERROR:Not licensed to perform this request. The following licenses DCMS were needed Click here to return"The DCMS which is the Supermicrp's Data Center Management Suite license will unlock some feature such as the Supermicro Update manager which will then allow you to manage firmware and the Redfish support feature which is more for @Volans and @elukey for automation.
Hope all this information was helpful please let me know if you have any questions.
Tue, Jun 18
Tue, Jun 18
It looks like the primary interface can't see the network device (the console shows "media test failure, check cable".
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
Machine is drained and off, so you're free to reseat memory etc. Let me know when it's back (and what we might be able to do if the memory remains problematic).
klausman added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
Current state:
Mon, Jun 17
Mon, Jun 17
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
Tuesday sounds good. I'll drain and shutdown the machine on Tuesday 17:00 CEST/15:00 UTC/10:00CDT, does that work for you?
Wed, Jun 12
Wed, Jun 12
One note: since the default OS has changed (Bullseye->Bookworm), I updated the ticket desc accordingly --- we definitely want Bookworm.
Tue, Jun 11
Tue, Jun 11
klausman moved T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet from Unsorted to Watching on the Machine-Learning-Team board.
klausman moved T366688: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet from Unsorted to Watching on the Machine-Learning-Team board.
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
I repooled the machine just now, as I don't want to fly this close to capacity ceiling for prolonged periods.
Wed, Jun 5
Wed, Jun 5
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
ml-serve2001_racadm_getsel_output.txt17 KBDownload
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
Can't upload the ASR since it's too large. Anywhere that I should upload it to?
Tue, Jun 4
Tue, Jun 4
Thu, May 30
Thu, May 30
klausman added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..
In T343123#9813736, @Dbrant wrote:I'm checking out the Lift Wing API to generate article descriptions, and I see the URL path contains a colon : character. The problem is that our native network stack always urlencodes colons, but when I hit the same API path with an encoded colon (%3A) it returns 404.
Is this expected? Can we make the API accept a urlencoded path?
klausman moved T365971: Tweak partman recipe for ML k8s workers from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
May 28 2024
May 28 2024
klausman added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
- we had another instance of high lat (eswiki)
- logs show fetch features being slow (extract_cache)
- we have a repro that should help with root-causing the matter
klausman added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- Mistral crashlooping, startup checks usually 5m , so we bumped to 10m, but it didn't help
- Bert model works, so likely Mistral issue
- the kubelet partition increase for the install phase is in review
- ml-staging1001 is now on Bookworm, dragonfly (distributed downloading of S3 stuff) needs to be bumped
- with bookworm, there no longer are GPU drivers on the base node (besides Debian kernel support), but driver/library code lives in the Docker images
klausman set the point value for T365554: Run load tests for the rec-api-ng and update production resources to meet expected load to 3.
klausman moved T365581: Use multilingual revert risk model in Automoderator on supported wikis from Unsorted to Watching on the Machine-Learning-Team board.
klausman moved T365701: Enable Revert Risk RecentChanges filter on id.wiki from Unsorted to Watching on the Machine-Learning-Team board.
klausman moved T365834: Append wikitech link and contact info to revscoring model servers from Unsorted to Ready To Go on the Machine-Learning-Team board.
klausman set the point value for T365834: Append wikitech link and contact info to revscoring model servers to 1.
klausman set the point value for T365842: Allow setting huggingfaceserver cmd args from deployment-charts to 2.
klausman moved T365842: Allow setting huggingfaceserver cmd args from deployment-charts from Unsorted to In Progress on the Machine-Learning-Team board.
klausman moved T365971: Tweak partman recipe for ML k8s workers from Unsorted to In Progress on the Machine-Learning-Team board.
klausman moved T366015: Add pydantic validation to revertrisk model in liftwing-python package from Unsorted to Ready To Go on the Machine-Learning-Team board.
May 27 2024
May 27 2024
May 22 2024
May 22 2024
# build-production-images --select '*pytorch23*' == Step 0: scanning /srv/images/production-images/images == Will build the following images: * docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Step 1: building images == * Built image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Build done! == You can see the logs at ./docker-pkg-build.log == Step 0: scanning /srv/images/production-images/istio == Will build the following images: == Step 1: building images == == Step 2: publishing == == Build done! == You can see the logs at ./docker-pkg-build.log == Step 0: scanning /srv/images/production-images/cert-manager == Will build the following images: == Step 1: building images == == Step 2: publishing == == Build done! == You can see the logs at ./docker-pkg-build.log #
May 21 2024
May 21 2024
Repooled the machine:
$ sudo confctl select 'name=ml-serve2002.codfw.wmnet' set/pooled=yes codfw/ml_serve/kubesvc/ml-serve2002.codfw.wmnet: pooled changed no => yes WARNING:conftool.announce:conftool action : set/pooled=yes; selector: name=ml-serve2002.codfw.wmnet
klausman set the point value for T365479: Update kserve and knative-serving charts for new-style Calico network policies to 5.
klausman closed T360894: Investigate temporary high latency in revscoring service for wikidata as Resolved.
Since this has not re-occurred, I am closing the task for now. If it happens again, we can always re-open.
May 16 2024
May 16 2024
In T287491#9803609, @jijiki wrote:@klausman you can go head with kserve and knative-serving when you have some time, ping me for reviews etc
May 15 2024
May 15 2024
klausman added a subtask for T362503: ORES doesn't work (at least for ru- and ukwiki): Unknown Object (Task).
May 14 2024
May 14 2024
klausman added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Connections from isvc namespaces on staging to the Cassandra machines now work, including TLS certs and SNI
- Next step: have an actual inference service actually talk to the cache, likely with code from https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/995001
- Still need to figure out long-term maintenance of Cassandra server-side config (users, passwords, namespaces, schemas); may hand off/soft-donate the machines to Data Persistence Team
klausman added a comment to T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines.
All the machinery is now in place to make connections to Cassandra from isvcs on staging (in the experimental NS):
May 13 2024
May 13 2024
May 10 2024
May 10 2024
In T362984#9785902, @isarantopoulos wrote:I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 (related patch):
Traceback (most recent call last): File "/srv/app/llm/model.py", line 9, in <module> import torch File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module> from torch._C import * # noqa: F403 ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argumentHaven't found any related open issue so I'm currently testing different pytorch versions to see if the issue still exists
May 7 2024
May 7 2024
klausman added a comment to T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.
In T363449#9773855, @elukey wrote:@klausman leaving the decision to you :) You can file path anytime, today we'll roll out the transparent proxy changes for eqiad and then we'll be ok to proceed with the new commons host header. Before proceeding I'd suggest to check if calls to the MW API can accept commons host header and URI paths, I don't think any rewrite is happening in upper layers but better safe than sorry!
klausman set the point value for T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` to 2.
klausman closed T362661: Create basic alerts for isvcs to catch outages, a subtask of T362503: ORES doesn't work (at least for ru- and ukwiki), as Resolved.
Apr 30 2024
Apr 30 2024
klausman moved T362661: Create basic alerts for isvcs to catch outages from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Apr 25 2024
Apr 25 2024
Makefile: cleanup and slight reorganization
Apr 24 2024
Apr 24 2024
klausman added a comment to T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config`.
This has been implemented in change 1020194.
Apr 23 2024
Apr 23 2024
klausman changed the point value for T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines from 1 to 5.
klausman moved T362661: Create basic alerts for isvcs to catch outages from Unsorted to In Progress on the Machine-Learning-Team board.
Apr 19 2024
Apr 19 2024
This is a bug in keras: it tries to open the file with mode r+b (read, append, binary), but since the file is owned by another user (nobody vs. somebody), the call fails. Why keras would need to be able to append to the file, I don't know.
In T362749#9727438, @kevinbazira wrote:PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'
Apr 18 2024
Apr 18 2024
Apr 17 2024
Apr 17 2024
I've experimented a bit on Thanos, and arrived at this query:
Apr 16 2024
Apr 16 2024
Probably something like:
klausman updated the task description for T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config`.
gitignore: Ignore my_venv/ and models/ directories
Apr 15 2024
Apr 15 2024
Timeline (times in UTC):
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits