User Details
User Details
- User Since
- Jun 25 2020, 6:43 PM (208 w, 2 d)
- Availability
- Available
- IRC Nick
- chrisalbon
- LDAP User
- Calbon
- MediaWiki User
- CAlbon (WMF) [ Global Accounts ]
Tue, Jun 18
Tue, Jun 18
calbon moved T366528: Deployment of model updates from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
calbon moved T367537: Cloud VPS "machine-learning" project Buster deprecation from Unsorted to Backlog/SRE on the Machine-Learning-Team board.
calbon moved T367562: Cloud VPS "wikilabels" project Buster deprecation from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T367875: Reimage all ml-serve machines with Bookworm from Unsorted to Backlog/SRE on the Machine-Learning-Team board.
May 21 2024
May 21 2024
calbon added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.
People can now pip install and use models. Right now we only have a few models - the number of models should increase over time.
calbon moved T363505: Pass the maximum number of uploads to the logo detection service from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T364089: Have problem with migrating to LiftWing from ores from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T365166: Update Pytorch base image to 2.3.0 from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon set the point value for T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) to 1.
calbon moved T365253: Allow Kubernetes workers to be deployed on Bookworm from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T365291: ml-serve2002 memory errors on DIMM_B1 from Unsorted to Watching on the Machine-Learning-Team board.
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
- Calico improvements makes the whole workflow more streamlived
- Improve our incident response procedure
- Investigate CPU spikes
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm it will be fixed.
- Next step is to upgrade ml-staging to Bookworm then test.
- Working on upgrading HF with newer versions with ROCm 6.0. Tested them and they work and will be posting watch.
- Goal is to utilize GPU so we can deploy models from HuggingFace.
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Trying to fix up a Calico networking issue in Kubernetes
- After credentials, will send patched revert risk server to ml-staging
May 7 2024
May 7 2024
calbon placed T360455: Add Article Quality Model to LiftWing up for grabs.
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
- Narrowed down cause of symptoms of spike in CPU usage to feature extraction in revscoring isvc. Might be caused by some specific revids.
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- Wait for vendor (Supermicro) to finalize order of 2x for ml-staging.
- Chris's guess is ml-staging installed at end of quarter
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Working on plumbing on staging, should be done within week
- Feeling good about it
Apr 30 2024
Apr 30 2024
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Logging queries and logging when things are slow is the short term goal. Knowing WHY a query takes a long time is a future question
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
We have a theory that the ROCm drivers on the debian package is not required.
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Decision point: Do we upgrade ROCm drivers?
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Update: No update
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Rebased code after prototype.
- Waiting for istio change for making a new service, which is imminent
- Need to add new visual service that is tcp
Apr 25 2024
Apr 25 2024
calbon moved T360455: Add Article Quality Model to LiftWing from Watching to Unsorted on the Machine-Learning-Team board.
Apr 23 2024
Apr 23 2024
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question is going to be if we want to use an upgraded chassis for the ml-staging server.
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Merged puppet machinery to allow network policies to be generated for assorted cluster. So we can automatically generated the network policy without the 60 lines of istio config.
- Will merge change to network policy to allow Istio to talk to Cassandra.
Apr 16 2024
Apr 16 2024
calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from 2024 Q4: Lift Wing Python Package to 2024 Q4: Users can "pip install liftwing" and access 20% of models.
calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from Q4: Lift Wing Python Package to 2024 Q4: Lift Wing Python Package.
calbon renamed T348153: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. from Goal: Lift Wing users can request multiple predictions using a single request. to Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request..
calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-language-agnostic to Q3 2024 Goal: Implement caching for revertrisk-language-agnostic.
calbon renamed T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models from Goal: Inference Optimization for Hugging face/Pytorch models to Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.
calbon renamed T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production from Goal: Expand Lift Wing Cluster and add GPU capacity to production to Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .
calbon renamed T353814: Q3 2024 Goal: A plan for a training infrastructure from Goal: A plan for a training infrastructure to Q3 2024 Goal: A plan for a training infrastructure .
calbon moved T362671: ------ from Current Quarter Goals to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Mar 26 2024
Mar 26 2024
calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-multilingual to Goal: Implement caching for revertrisk-language-agnostic.
calbon added a comment to T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .
At risk because we don't have a GPU in the data centers yet.
calbon moved T360455: Add Article Quality Model to LiftWing from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T360593: Create an examples directory in the repository and add a basic README.md from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360637: Bump memory for registry[12]00[34] VMs from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360638: Create a Pytorch base image from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360990: drafttopic has two issue trackers from Unsorted to Backlog/Revscoring on the Machine-Learning-Team board.
Mar 19 2024
Mar 19 2024
calbon moved T359879: SLO dashboards for Lift Wing showing unexpected values from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360111: Set automatically libomp's num threads when using Pytorch from Unsorted to In Progress on the Machine-Learning-Team board.
calbon moved T360120: Run unit tests for the inference-services repo in CI from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
calbon set the point value for T360177: Support building and running of articletopic-outlink model-server via Makefile to 3.
calbon set the point value for T360212: Add pyopencl requirements to images that use resource_utils to 2.
calbon moved T360406: Error handling in Batch Predictions for RevertRisk Models from Unsorted to In Progress on the Machine-Learning-Team board.
calbon updated Other Assignee for T360406: Error handling in Batch Predictions for RevertRisk Models, added: achou.
calbon moved T360423: Deploy RevertRisk language-agnostic with knowledge integrity v0.6.0 from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Mar 5 2024
Mar 5 2024
calbon moved T358676: Host a logo detection model for Commons images from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T358744: Deploy RR-language-agnostic batch version to prod from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
calbon moved T358748: Prep work for (re)training workflow sprint from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon closed T358842: Investigate why WikiGPT returns an Internal Server Error, a subtask of T328494: WikiGPT Experiment, as Resolved.
calbon moved T358842: Investigate why WikiGPT returns an Internal Server Error from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.
calbon updated Other Assignee for T358953: Inconsistent data type for articlequality score predictions on ptwiki, added: isarantopoulos.
calbon moved T358953: Inconsistent data type for articlequality score predictions on ptwiki from Unsorted to Ready To Go on the Machine-Learning-Team board.