User Details
User Details
- User Since
- Jun 25 2020, 6:43 PM (220 w, 4 d)
- Availability
- Available
- IRC Nick
- chrisalbon
- LDAP User
- Calbon
- MediaWiki User
- CAlbon (WMF) [ Global Accounts ]
Tue, Aug 27
Tue, Aug 27
calbon added a comment to T371398: Goal 4: Support product teams in deploying production models..
- Recommendation API is live and in production
- Recently been supporting structured content team for using the logo detection in Lift Wing production.
- Updated the readability model
- Pre-saved context for revert risk https://phabricator.wikimedia.org/T356102,, https://phabricator.wikimedia.org/T364705
calbon added a comment to T371397: Goal 3: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services..
- Slow revscoring, started logging queries on the pod side, so that is gone when the pod is killed.
- Answer "Is there a reason we are not logging the query into logstash?"
calbon added a comment to T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU..
- machines are racked but not set up. Will set up one first to figure out disk layout and then the other one. Then will release to the research team
calbon added a comment to T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..
- GPU hosts are racked but not set up yet
- Software side slower
Aug 13 2024
Aug 13 2024
calbon added a comment to T371398: Goal 4: Support product teams in deploying production models..
Update
- Modernized recommendation API has been deployed to production
- API gateway setup underway
- Article quality LA: Ready on staging and want to bring it into production. Should we group models into common namespaces? Suggestion: create namespaces per area where the model is used: articles, revisions, images, etc.
calbon added a comment to T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU..
Update:
- Waiting for ml-lab machines to be delivered to the eqiad data center.
calbon renamed T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production. from Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that uses an inference optimization engine in production. to Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..
calbon added a comment to T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..
Infra
- Setting up the puppet roles
- Can't commit puppet roles until the machines are there
- Reached out to vendor
Jul 31 2024
Jul 31 2024
calbon added a project to T371398: Goal 4: Support product teams in deploying production models.: Goal.
Jul 30 2024
Jul 30 2024
calbon moved T369712: Request to update Readability model on Lift Wing from Unsorted to Ready To Go on the Machine-Learning-Team board.
Jun 18 2024
Jun 18 2024
calbon moved T366528: Deployment of model updates from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
calbon moved T367537: Cloud VPS "machine-learning" project Buster deprecation from Unsorted to Backlog/SRE on the Machine-Learning-Team board.
calbon moved T367562: Cloud VPS "wikilabels" project Buster deprecation from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T367875: Reimage all ml-serve machines with Bookworm from Unsorted to Backlog/SRE on the Machine-Learning-Team board.
May 21 2024
May 21 2024
calbon added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.
People can now pip install and use models. Right now we only have a few models - the number of models should increase over time.
calbon moved T363505: Pass the maximum number of uploads to the logo detection service from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T364089: Have problem with migrating to LiftWing from ores from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T365166: Update Pytorch base image to 2.3.0 from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon set the point value for T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) to 1.
calbon moved T365253: Allow Kubernetes workers to be deployed on Bookworm from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T365291: ml-serve2002 memory errors on DIMM_B1 from Unsorted to Watching on the Machine-Learning-Team board.
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
- Calico improvements makes the whole workflow more streamlived
- Improve our incident response procedure
- Investigate CPU spikes
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm it will be fixed.
- Next step is to upgrade ml-staging to Bookworm then test.
- Working on upgrading HF with newer versions with ROCm 6.0. Tested them and they work and will be posting watch.
- Goal is to utilize GPU so we can deploy models from HuggingFace.
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Trying to fix up a Calico networking issue in Kubernetes
- After credentials, will send patched revert risk server to ml-staging
May 7 2024
May 7 2024
calbon placed T360455: Add Article Quality Model to LiftWing up for grabs.
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
- Narrowed down cause of symptoms of spike in CPU usage to feature extraction in revscoring isvc. Might be caused by some specific revids.
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- Wait for vendor (Supermicro) to finalize order of 2x for ml-staging.
- Chris's guess is ml-staging installed at end of quarter
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Working on plumbing on staging, should be done within week
- Feeling good about it
Apr 30 2024
Apr 30 2024
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Logging queries and logging when things are slow is the short term goal. Knowing WHY a query takes a long time is a future question
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
We have a theory that the ROCm drivers on the debian package is not required.
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Decision point: Do we upgrade ROCm drivers?
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Update: No update
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Rebased code after prototype.
- Waiting for istio change for making a new service, which is imminent
- Need to add new visual service that is tcp
Apr 25 2024
Apr 25 2024
calbon moved T360455: Add Article Quality Model to LiftWing from Watching to Unsorted on the Machine-Learning-Team board.
Apr 23 2024
Apr 23 2024
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
- GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question is going to be if we want to use an upgraded chassis for the ml-staging server.
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
- Merged puppet machinery to allow network policies to be generated for assorted cluster. So we can automatically generated the network policy without the 60 lines of istio config.
- Will merge change to network policy to allow Istio to talk to Cassandra.
Apr 16 2024
Apr 16 2024
calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from 2024 Q4: Lift Wing Python Package to 2024 Q4: Users can "pip install liftwing" and access 20% of models.
calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from Q4: Lift Wing Python Package to 2024 Q4: Lift Wing Python Package.
calbon renamed T348153: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. from Goal: Lift Wing users can request multiple predictions using a single request. to Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request..
calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-language-agnostic to Q3 2024 Goal: Implement caching for revertrisk-language-agnostic.
calbon renamed T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models from Goal: Inference Optimization for Hugging face/Pytorch models to Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.
calbon renamed T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production from Goal: Expand Lift Wing Cluster and add GPU capacity to production to Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .
calbon renamed T353814: Q3 2024 Goal: A plan for a training infrastructure from Goal: A plan for a training infrastructure to Q3 2024 Goal: A plan for a training infrastructure .
calbon moved T362671: ------ from 2023-2024 Q4 Quarter Goals to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Mar 26 2024
Mar 26 2024
calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-multilingual to Goal: Implement caching for revertrisk-language-agnostic.
calbon added a comment to T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .
At risk because we don't have a GPU in the data centers yet.
calbon moved T360455: Add Article Quality Model to LiftWing from Unsorted to Watching on the Machine-Learning-Team board.
calbon moved T360593: Create an examples directory in the repository and add a basic README.md from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360637: Bump memory for registry[12]00[34] VMs from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360638: Create a Pytorch base image from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360990: drafttopic has two issue trackers from Unsorted to Backlog/Revscoring on the Machine-Learning-Team board.
Mar 19 2024
Mar 19 2024
calbon moved T359879: SLO dashboards for Lift Wing showing unexpected values from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved T360111: Set automatically libomp's num threads when using Pytorch from Unsorted to In Progress on the Machine-Learning-Team board.
calbon moved T360120: Run unit tests for the inference-services repo in CI from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
calbon set the point value for T360177: Support building and running of articletopic-outlink model-server via Makefile to 3.
calbon set the point value for T360212: Add pyopencl requirements to images that use resource_utils to 2.
calbon moved T360406: Error handling in Batch Predictions for RevertRisk Models from Unsorted to In Progress on the Machine-Learning-Team board.
calbon updated Other Assignee for T360406: Error handling in Batch Predictions for RevertRisk Models, added: achou.