Page MenuHomePhabricator

calbon (Chris Albon)
Director of Machine Learning

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jun 25 2020, 6:43 PM (208 w, 2 d)
Availability
Available
IRC Nick
chrisalbon
LDAP User
Calbon
MediaWiki User
CAlbon (WMF) [ Global Accounts ]

Recent Activity

Tue, Jun 18

calbon moved T366528: Deployment of model updates from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Tue, Jun 18, 2:55 PM · Research-engineering, Machine-Learning-Team, Research
calbon assigned T366772: Solve revscoring models increased latencies for big revision sizes to AikoChou.
Tue, Jun 18, 2:55 PM · Machine-Learning-Team
calbon reassigned T367293: Update blubber version in docker images from klausman to isarantopoulos.
Tue, Jun 18, 2:54 PM · Machine-Learning-Team
calbon assigned T367293: Update blubber version in docker images to klausman.
Tue, Jun 18, 2:53 PM · Machine-Learning-Team
calbon assigned T367537: Cloud VPS "machine-learning" project Buster deprecation to klausman.
Tue, Jun 18, 2:50 PM · Machine-Learning-Team, Cloud-VPS (Debian Buster Deprecation)
calbon moved T367537: Cloud VPS "machine-learning" project Buster deprecation from Unsorted to Backlog/SRE on the Machine-Learning-Team board.
Tue, Jun 18, 2:50 PM · Machine-Learning-Team, Cloud-VPS (Debian Buster Deprecation)
calbon moved T367562: Cloud VPS "wikilabels" project Buster deprecation from Unsorted to Watching on the Machine-Learning-Team board.
Tue, Jun 18, 2:49 PM · Machine-Learning-Team, Wikilabels, Cloud-VPS (Debian Buster Deprecation)
calbon moved T367875: Reimage all ml-serve machines with Bookworm from Unsorted to Backlog/SRE on the Machine-Learning-Team board.
Tue, Jun 18, 2:46 PM · Machine-Learning-Team

May 21 2024

calbon added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

People can now pip install and use models. Right now we only have a few models - the number of models should increase over time.

May 21 2024, 2:49 PM · Goal, Machine-Learning-Team
calbon moved T363505: Pass the maximum number of uploads to the logo detection service from Unsorted to Watching on the Machine-Learning-Team board.
May 21 2024, 2:48 PM · Machine-Learning-Team, Structured-Data-Backlog
calbon moved T364089: Have problem with migrating to LiftWing from ores from Unsorted to Watching on the Machine-Learning-Team board.
May 21 2024, 2:48 PM · Machine-Learning-Team
calbon assigned T363505: Pass the maximum number of uploads to the logo detection service to kevinbazira.
May 21 2024, 2:47 PM · Machine-Learning-Team, Structured-Data-Backlog
calbon assigned T364089: Have problem with migrating to LiftWing from ores to isarantopoulos.
May 21 2024, 2:46 PM · Machine-Learning-Team
calbon moved T365226: Investigate a way to return other 2xx status code from predict in kserve from Unsorted to Backlog/Other on the Machine-Learning-Team board.
May 21 2024, 2:45 PM · Machine-Learning-Team
calbon assigned T365226: Investigate a way to return other 2xx status code from predict in kserve to achou.
May 21 2024, 2:44 PM · Machine-Learning-Team
calbon moved T365166: Update Pytorch base image to 2.3.0 from Unsorted to Ready To Go on the Machine-Learning-Team board.
May 21 2024, 2:34 PM · Machine-Learning-Team
calbon moved T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) from Unsorted to Ready To Go on the Machine-Learning-Team board.
May 21 2024, 2:34 PM · Machine-Learning-Team
calbon set the point value for T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) to 1.
May 21 2024, 2:33 PM · Machine-Learning-Team
calbon set the point value for T365166: Update Pytorch base image to 2.3.0 to 1.
May 21 2024, 2:33 PM · Machine-Learning-Team
calbon assigned T365253: Allow Kubernetes workers to be deployed on Bookworm to elukey.
May 21 2024, 2:32 PM · Machine-Learning-Team, serviceops, Kubernetes
calbon moved T365253: Allow Kubernetes workers to be deployed on Bookworm from Unsorted to Ready To Go on the Machine-Learning-Team board.
May 21 2024, 2:32 PM · Machine-Learning-Team, serviceops, Kubernetes
calbon set the point value for T365253: Allow Kubernetes workers to be deployed on Bookworm to 3.
May 21 2024, 2:31 PM · Machine-Learning-Team, serviceops, Kubernetes
calbon moved T365291: ml-serve2002 memory errors on DIMM_B1 from Unsorted to Watching on the Machine-Learning-Team board.
May 21 2024, 2:29 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
calbon moved T365439: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL from Unsorted to Watching on the Machine-Learning-Team board.
May 21 2024, 2:25 PM · Machine-Learning-Team
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
  • Calico improvements makes the whole workflow more streamlived
  • Improve our incident response procedure
  • Investigate CPU spikes
May 21 2024, 2:18 PM · Goal, Machine-Learning-Team
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
  • Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm it will be fixed.
  • Next step is to upgrade ml-staging to Bookworm then test.
  • Working on upgrading HF with newer versions with ROCm 6.0. Tested them and they work and will be posting watch.
  • Goal is to utilize GPU so we can deploy models from HuggingFace.
May 21 2024, 2:16 PM · Goal, Machine-Learning-Team
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
  • Trying to fix up a Calico networking issue in Kubernetes
    • After credentials, will send patched revert risk server to ml-staging
May 21 2024, 2:07 PM · Goal, Machine-Learning-Team

May 7 2024

calbon placed T360455: Add Article Quality Model to LiftWing up for grabs.
May 7 2024, 2:24 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
calbon assigned T360455: Add Article Quality Model to LiftWing to kevinbazira.
May 7 2024, 2:23 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
  • Narrowed down cause of symptoms of spike in CPU usage to feature extraction in revscoring isvc. Might be caused by some specific revids.
May 7 2024, 2:19 PM · Goal, Machine-Learning-Team
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
  • Wait for vendor (Supermicro) to finalize order of 2x for ml-staging.
    • Chris's guess is ml-staging installed at end of quarter
May 7 2024, 2:10 PM · Goal, Machine-Learning-Team
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
  • Working on plumbing on staging, should be done within week
    • Feeling good about it
May 7 2024, 2:08 PM · Goal, Machine-Learning-Team

Apr 30 2024

calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.

Logging queries and logging when things are slow is the short term goal. Knowing WHY a query takes a long time is a future question

Apr 30 2024, 2:22 PM · Goal, Machine-Learning-Team
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

We have a theory that the ROCm drivers on the debian package is not required.

Apr 30 2024, 2:19 PM · Goal, Machine-Learning-Team
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Decision point: Do we upgrade ROCm drivers?

Apr 30 2024, 2:15 PM · Goal, Machine-Learning-Team
calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Update: No update

Apr 30 2024, 2:14 PM · Goal, Machine-Learning-Team
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
  • Rebased code after prototype.
    • Waiting for istio change for making a new service, which is imminent
    • Need to add new visual service that is tcp
Apr 30 2024, 2:13 PM · Goal, Machine-Learning-Team

Apr 25 2024

calbon moved T360455: Add Article Quality Model to LiftWing from Watching to Unsorted on the Machine-Learning-Team board.
Apr 25 2024, 5:07 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team

Apr 23 2024

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
  • GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question is going to be if we want to use an upgraded chassis for the ml-staging server.
Apr 23 2024, 2:25 PM · Goal, Machine-Learning-Team
calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
  • Merged puppet machinery to allow network policies to be generated for assorted cluster. So we can automatically generated the network policy without the 60 lines of istio config.
  • Will merge change to network policy to allow Istio to talk to Cassandra.
Apr 23 2024, 2:18 PM · Goal, Machine-Learning-Team

Apr 16 2024

calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from 2024 Q4: Lift Wing Python Package to 2024 Q4: Users can "pip install liftwing" and access 20% of models.
Apr 16 2024, 2:59 PM · Goal, Machine-Learning-Team
calbon added a project to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services: Goal.
Apr 16 2024, 2:58 PM · Goal, Machine-Learning-Team
calbon created T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Apr 16 2024, 2:57 PM · Goal, Machine-Learning-Team
calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from Q4: Lift Wing Python Package to 2024 Q4: Lift Wing Python Package.
Apr 16 2024, 2:57 PM · Goal, Machine-Learning-Team
calbon moved T348153: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. from Current Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.
Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team
calbon moved T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Current Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.
Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team
calbon moved T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models from Current Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.
Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team
calbon moved T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production from Current Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.
Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team
calbon moved T353814: Q3 2024 Goal: A plan for a training infrastructure from Current Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.
Apr 16 2024, 2:52 PM · Goal, Machine-Learning-Team
calbon renamed T348153: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. from Goal: Lift Wing users can request multiple predictions using a single request. to Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request..
Apr 16 2024, 2:52 PM · Goal, Machine-Learning-Team
calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-language-agnostic to Q3 2024 Goal: Implement caching for revertrisk-language-agnostic.
Apr 16 2024, 2:52 PM · Goal, Machine-Learning-Team
calbon renamed T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models from Goal: Inference Optimization for Hugging face/Pytorch models to Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.
Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team
calbon renamed T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production from Goal: Expand Lift Wing Cluster and add GPU capacity to production to Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .
Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team
calbon renamed T353814: Q3 2024 Goal: A plan for a training infrastructure from Goal: A plan for a training infrastructure to Q3 2024 Goal: A plan for a training infrastructure .
Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team
calbon created T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team
calbon moved T362671: ------ from Current Quarter Goals to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Apr 16 2024, 2:46 PM · Machine-Learning-Team
calbon closed T362671: ------ as Declined.
Apr 16 2024, 2:45 PM · Machine-Learning-Team
calbon created T362671: ------.
Apr 16 2024, 2:45 PM · Machine-Learning-Team
calbon created T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Apr 16 2024, 2:45 PM · Goal, Machine-Learning-Team

Mar 26 2024

calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-multilingual to Goal: Implement caching for revertrisk-language-agnostic.
Mar 26 2024, 2:41 PM · Goal, Machine-Learning-Team
calbon added a comment to T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .

At risk because we don't have a GPU in the data centers yet.

Mar 26 2024, 2:40 PM · Goal, Machine-Learning-Team
calbon moved T360455: Add Article Quality Model to LiftWing from Unsorted to Watching on the Machine-Learning-Team board.
Mar 26 2024, 2:35 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
calbon moved T360593: Create an examples directory in the repository and add a basic README.md from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 26 2024, 2:31 PM · Machine-Learning-Team
calbon moved T360637: Bump memory for registry[12]00[34] VMs from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 26 2024, 2:27 PM · Patch-For-Review, serviceops, Machine-Learning-Team
calbon moved T360638: Create a Pytorch base image from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 26 2024, 2:27 PM · Patch-For-Review, Machine-Learning-Team
calbon set the point value for T360638: Create a Pytorch base image to 3.
Mar 26 2024, 2:23 PM · Patch-For-Review, Machine-Learning-Team
calbon assigned T360894: Investigate temporary high latency in revscoring service for wikidata to klausman.
Mar 26 2024, 2:16 PM · Machine-Learning-Team
calbon moved T360990: drafttopic has two issue trackers from Unsorted to Backlog/Revscoring on the Machine-Learning-Team board.
Mar 26 2024, 2:15 PM · drafttopic-modeling, Machine-Learning-Team
calbon assigned T360990: drafttopic has two issue trackers to isarantopoulos.
Mar 26 2024, 2:14 PM · drafttopic-modeling, Machine-Learning-Team

Mar 19 2024

calbon moved T359879: SLO dashboards for Lift Wing showing unexpected values from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 19 2024, 2:55 PM · Machine-Learning-Team, Observability-Metrics
calbon moved T360111: Set automatically libomp's num threads when using Pytorch from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 19 2024, 2:55 PM · Machine-Learning-Team
calbon assigned T359879: SLO dashboards for Lift Wing showing unexpected values to elukey.
Mar 19 2024, 2:55 PM · Machine-Learning-Team, Observability-Metrics
calbon assigned T360111: Set automatically libomp's num threads when using Pytorch to elukey.
Mar 19 2024, 2:54 PM · Machine-Learning-Team
calbon assigned T360120: Run unit tests for the inference-services repo in CI to elukey.
Mar 19 2024, 2:50 PM · Machine-Learning-Team
calbon moved T360120: Run unit tests for the inference-services repo in CI from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Mar 19 2024, 2:49 PM · Machine-Learning-Team
calbon moved T360177: Support building and running of articletopic-outlink model-server via Makefile from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 19 2024, 2:49 PM · Machine-Learning-Team
calbon set the point value for T360177: Support building and running of articletopic-outlink model-server via Makefile to 3.
Mar 19 2024, 2:48 PM · Machine-Learning-Team
calbon set the point value for T360212: Add pyopencl requirements to images that use resource_utils to 2.
Mar 19 2024, 2:47 PM · Machine-Learning-Team
calbon assigned T360212: Add pyopencl requirements to images that use resource_utils to isarantopoulos.
Mar 19 2024, 2:47 PM · Machine-Learning-Team
calbon moved T356566: Entries on Special:Version page not alphabetically sorted (as ORES extension is listed as "Machine Learning Platform") from Unsorted to Watching on the Machine-Learning-Team board.
Mar 19 2024, 2:40 PM · MediaWiki-extensions-ORES, Machine-Learning-Team, MediaWiki-Special-pages
calbon moved T360406: Error handling in Batch Predictions for RevertRisk Models from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 19 2024, 2:38 PM · Patch-For-Review, Machine-Learning-Team
calbon updated Other Assignee for T360406: Error handling in Batch Predictions for RevertRisk Models, added: achou.
Mar 19 2024, 2:38 PM · Patch-For-Review, Machine-Learning-Team
calbon moved T360423: Deploy RevertRisk language-agnostic with knowledge integrity v0.6.0 from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Mar 19 2024, 2:38 PM · Machine-Learning-Team

Mar 5 2024

calbon moved T358676: Host a logo detection model for Commons images from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 5 2024, 3:59 PM · Structured-Data-Backlog (Current Work), Machine-Learning-Team
calbon assigned T358676: Host a logo detection model for Commons images to kevinbazira.
Mar 5 2024, 3:56 PM · Structured-Data-Backlog (Current Work), Machine-Learning-Team
calbon set the point value for T358744: Deploy RR-language-agnostic batch version to prod to 3.
Mar 5 2024, 3:49 PM · Machine-Learning-Team
calbon moved T358744: Deploy RR-language-agnostic batch version to prod from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Mar 5 2024, 3:49 PM · Machine-Learning-Team
calbon moved T358748: Prep work for (re)training workflow sprint from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 5 2024, 3:47 PM · Machine-Learning-Team
calbon set the point value for T358748: Prep work for (re)training workflow sprint to 2.
Mar 5 2024, 3:47 PM · Machine-Learning-Team
calbon moved T358831: Migrate usage of Database::delete, insert, update and upsert to QueryBuilder in ORES from Unsorted to Watching on the Machine-Learning-Team board.
Mar 5 2024, 3:45 PM · MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Machine-Learning-Team, MediaWiki-extensions-ORES, Technical-Debt
calbon closed T358842: Investigate why WikiGPT returns an Internal Server Error, a subtask of T328494: WikiGPT Experiment, as Resolved.
Mar 5 2024, 3:44 PM · Epic, Machine-Learning-Team
calbon closed T358842: Investigate why WikiGPT returns an Internal Server Error as Resolved.
Mar 5 2024, 3:44 PM · Machine-Learning-Team
calbon moved T358842: Investigate why WikiGPT returns an Internal Server Error from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 5 2024, 3:43 PM · Machine-Learning-Team
calbon updated Other Assignee for T358953: Inconsistent data type for articlequality score predictions on ptwiki, added: isarantopoulos.
Mar 5 2024, 3:43 PM · Machine-Learning-Team, ORES
calbon moved T358953: Inconsistent data type for articlequality score predictions on ptwiki from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 5 2024, 3:43 PM · Machine-Learning-Team, ORES
calbon closed T337213: Update to KServe 0.11 as Resolved.
Mar 5 2024, 3:38 PM · Machine-Learning-Team
calbon moved T359066: Add Licensing and Open Source requirement/strong preference to Lift Wing model deployment documentations from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 5 2024, 3:34 PM · Documentation, Software-Licensing, Machine-Learning-Team
calbon moved T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 5 2024, 3:30 PM · Machine-Learning-Team
calbon assigned T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images to elukey.
Mar 5 2024, 3:30 PM · Machine-Learning-Team
calbon set the point value for T359066: Add Licensing and Open Source requirement/strong preference to Lift Wing model deployment documentations to 1.
Mar 5 2024, 3:26 PM · Documentation, Software-Licensing, Machine-Learning-Team