Page MenuHomePhabricator

klausman (Tobias Klausmann)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 31 2020, 9:52 AM (198 w, 5 d)
Availability
Available
LDAP User
Klausman
MediaWiki User
TKlausmann (WMF) [ Global Accounts ]

Recent Activity

Thu, Jun 20

klausman created P65229 (An Untitled Masterwork).
Thu, Jun 20, 1:23 PM
klausman created P65224 (An Untitled Masterwork).
Thu, Jun 20, 1:04 PM

Wed, Jun 19

klausman added a comment to T304483: PXE boot NIC firmware regression .

For anyone who wants to build the above binary form the Debian sources:

Wed, Jun 19, 3:22 PM · Infrastructure-Foundations, DC-Ops
klausman added a comment to T357415: Q3:rack/setup/install ml-staging2003.

Machine is imaged and running. The PXE boot was "fixed" by an ugly hack mentioned in T304483#9906962 While the firmware problem remains, at least we are unblocked on this host.

Wed, Jun 19, 3:12 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman updated the task description for T357415: Q3:rack/setup/install ml-staging2003.
Wed, Jun 19, 3:11 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman renamed T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet from hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet to hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
Wed, Jun 19, 2:23 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman added a comment to T304483: PXE boot NIC firmware regression .

So today I wanted to instal ml-staging2003. This is a new SMC hardware type and it hits this problem.

Wed, Jun 19, 1:16 PM · Infrastructure-Foundations, DC-Ops
klausman added a comment to T357415: Q3:rack/setup/install ml-staging2003.

Information2
The server has only the the SFT-OOB-LIC license which is the Supermicro Out of band OOB license allowing you to only update BIOS and BMC/IPMI . This license will not allow you to upgrade the Add-on 10G and 25G network interface i mentioned in (1). This is the reason we are getting the error "ERROR:Not licensed to perform this request. The following licenses DCMS were needed Click here to return"

The DCMS which is the Supermicrp's Data Center Management Suite license will unlock some feature such as the Supermicro Update manager which will then allow you to manage firmware and the Redfish support feature which is more for @Volans and @elukey for automation.

Hope all this information was helpful please let me know if you have any questions.

Wed, Jun 19, 7:46 AM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops

Tue, Jun 18

klausman added a comment to T357415: Q3:rack/setup/install ml-staging2003.

It looks like the primary interface can't see the network device (the console shows "media test failure, check cable".

Tue, Jun 18, 3:26 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.

Machine is drained and off, so you're free to reseat memory etc. Let me know when it's back (and what we might be able to do if the memory remains problematic).

Tue, Jun 18, 2:38 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.

Current state:

Tue, Jun 18, 2:30 PM · Goal, Machine-Learning-Team
klausman created T367875: Reimage all ml-serve machines with Bookworm.
Tue, Jun 18, 1:22 PM · Machine-Learning-Team

Mon, Jun 17

klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.

Tuesday sounds good. I'll drain and shutdown the machine on Tuesday 17:00 CEST/15:00 UTC/10:00CDT, does that work for you?

Mon, Jun 17, 8:57 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops

Wed, Jun 12

klausman added a comment to T357415: Q3:rack/setup/install ml-staging2003.

One note: since the default OS has changed (Bullseye->Bookworm), I updated the ticket desc accordingly --- we definitely want Bookworm.

Wed, Jun 12, 12:31 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman updated the task description for T357415: Q3:rack/setup/install ml-staging2003.
Wed, Jun 12, 12:31 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops

Tue, Jun 11

klausman moved T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet from Unsorted to Watching on the Machine-Learning-Team board.
Tue, Jun 11, 2:46 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman moved T366688: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet from Unsorted to Watching on the Machine-Learning-Team board.
Tue, Jun 11, 2:46 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman moved T365842: Allow setting huggingfaceserver cmd args from deployment-charts from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Tue, Jun 11, 2:16 PM · Machine-Learning-Team
klausman moved T365479: Update kserve and knative-serving charts for new-style Calico network policies from In Progress to Ready To Go on the Machine-Learning-Team board.
Tue, Jun 11, 9:49 AM · Machine-Learning-Team, Patch-For-Review
klausman closed T365971: Tweak partman recipe for ML k8s workers as Resolved.
Tue, Jun 11, 9:45 AM · Machine-Learning-Team
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.

I repooled the machine just now, as I don't want to fly this close to capacity ceiling for prolonged periods.

Tue, Jun 11, 9:37 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops

Wed, Jun 5

klausman created T366688: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet.
Wed, Jun 5, 10:32 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.

Wed, Jun 5, 9:23 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.

Can't upload the ASR since it's too large. Anywhere that I should upload it to?

Wed, Jun 5, 8:59 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman added a comment to T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
Wed, Jun 5, 8:47 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman created T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet.
Wed, Jun 5, 8:45 AM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops

Tue, Jun 4

klausman closed T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines as Resolved.
Tue, Jun 4, 2:47 PM · Patch-For-Review, Machine-Learning-Team
klausman closed T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines, a subtask of T356256: Epic: Implement prototype inference service that uses Cassandra for request caching, as Resolved.
Tue, Jun 4, 2:47 PM · Epic, Machine-Learning-Team

Thu, May 30

klausman moved T365439: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL from Watching to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Thu, May 30, 2:05 PM · Machine-Learning-Team
klausman added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

I'm checking out the Lift Wing API to generate article descriptions, and I see the URL path contains a colon : character. The problem is that our native network stack always urlencodes colons, but when I hit the same API path with an encoded colon (%3A) it returns 404.
Is this expected? Can we make the API accept a urlencoded path?

Thu, May 30, 12:10 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team
klausman moved T365971: Tweak partman recipe for ML k8s workers from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Thu, May 30, 9:34 AM · Machine-Learning-Team

Tue, May 28

klausman added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
  • we had another instance of high lat (eswiki)
  • logs show fetch features being slow (extract_cache)
  • we havea repor that should help with root-causing the matter
Tue, May 28, 3:05 PM · Goal, Machine-Learning-Team
klausman added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
  • Mistral crashlooping, startup checks usually 5m , so we bumped to 10m, but it didn't help
  • Bert model works, so likely Mistral issue
  • the kubelet partition increase for the install phase is in review
  • ml-staging1001 is now on Bookworm, dragonfly (distributed downloading of S3 stuff) needs to be bumped
  • with bookworm, there no longer are GPU drivers on the base node (besides Debian kernel support), but driver/library code lives in the Docker images
Tue, May 28, 3:04 PM · Goal, Machine-Learning-Team
klausman moved T365554: Run load tests for the rec-api-ng and update production resources to meet expected load from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, May 28, 2:28 PM · Machine-Learning-Team
klausman set the point value for T365554: Run load tests for the rec-api-ng and update production resources to meet expected load to 3.
Tue, May 28, 2:27 PM · Machine-Learning-Team
klausman moved T365581: Use multilingual revert risk model in Automoderator on supported wikis from Unsorted to Watching on the Machine-Learning-Team board.
Tue, May 28, 2:26 PM · Machine-Learning-Team, Automoderator, Moderator-Tools-Team
klausman moved T365701: Enable Revert Risk RecentChanges filter on id.wiki from Unsorted to Watching on the Machine-Learning-Team board.
Tue, May 28, 2:22 PM · Machine-Learning-Team, Growth-Team, Wikipedia-Android-App-Backlog, MediaWiki-Recent-changes, MediaWiki-extensions-ORES, Moderator-Tools-Team
klausman moved T365834: Append wikitech link and contact info to revscoring model servers from Unsorted to Ready To Go on the Machine-Learning-Team board.
Tue, May 28, 2:20 PM · Machine-Learning-Team
klausman set the point value for T365834: Append wikitech link and contact info to revscoring model servers to 1.
Tue, May 28, 2:20 PM · Machine-Learning-Team
klausman set the point value for T365842: Allow setting huggingfaceserver cmd args from deployment-charts to 2.
Tue, May 28, 2:17 PM · Machine-Learning-Team
klausman moved T365842: Allow setting huggingfaceserver cmd args from deployment-charts from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, May 28, 2:17 PM · Machine-Learning-Team
klausman set the point value for T365971: Tweak partman recipe for ML k8s workers to 2.
Tue, May 28, 2:16 PM · Machine-Learning-Team
klausman moved T365971: Tweak partman recipe for ML k8s workers from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, May 28, 2:16 PM · Machine-Learning-Team
klausman moved T366015: Add pydantic validation to revertrisk model in liftwing-python package from Unsorted to Ready To Go on the Machine-Learning-Team board.
Tue, May 28, 2:16 PM · Machine-Learning-Team

Mon, May 27

klausman added a subtask for T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services: T365971: Tweak partman recipe for ML k8s workers.
Mon, May 27, 9:13 AM · Goal, Machine-Learning-Team
klausman added a parent task for T365971: Tweak partman recipe for ML k8s workers: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Mon, May 27, 9:13 AM · Machine-Learning-Team
klausman created T365971: Tweak partman recipe for ML k8s workers.
Mon, May 27, 9:12 AM · Machine-Learning-Team

May 22 2024

klausman added a comment to T365166: Update Pytorch base image to 2.3.0.
# build-production-images --select '*pytorch23*'
== Step 0: scanning /srv/images/production-images/images ==
Will build the following images:
* docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1
== Build done! ==
You can see the logs at ./docker-pkg-build.log
== Step 0: scanning /srv/images/production-images/istio ==
Will build the following images:
== Step 1: building images ==
== Step 2: publishing ==
== Build done! ==
You can see the logs at ./docker-pkg-build.log
== Step 0: scanning /srv/images/production-images/cert-manager ==
Will build the following images:
== Step 1: building images ==
== Step 2: publishing ==
== Build done! ==
You can see the logs at ./docker-pkg-build.log
#
May 22 2024, 4:50 PM · Machine-Learning-Team

May 21 2024

klausman added a comment to T365291: ml-serve2002 memory errors on DIMM_B1.

Repooled the machine:

$ sudo confctl select 'name=ml-serve2002.codfw.wmnet' set/pooled=yes
codfw/ml_serve/kubesvc/ml-serve2002.codfw.wmnet: pooled changed no => yes
WARNING:conftool.announce:conftool action : set/pooled=yes; selector: name=ml-serve2002.codfw.wmnet
May 21 2024, 2:38 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman set the point value for T365479: Update kserve and knative-serving charts for new-style Calico network policies to 5.
May 21 2024, 2:23 PM · Machine-Learning-Team, Patch-For-Review
klausman moved T365479: Update kserve and knative-serving charts for new-style Calico network policies from Unsorted to In Progress on the Machine-Learning-Team board.
May 21 2024, 2:22 PM · Machine-Learning-Team, Patch-For-Review
klausman closed T360894: Investigate temporary high latency in revscoring service for wikidata as Resolved.

Since this has not re-occurred, I am closing the task for now. If it happens again, we can always re-open.

May 21 2024, 2:22 PM · Machine-Learning-Team
klausman moved T360894: Investigate temporary high latency in revscoring service for wikidata from Ready To Go to 2023-2024 Q4 Done on the Machine-Learning-Team board.
May 21 2024, 2:22 PM · Machine-Learning-Team
klausman added a project to T365479: Update kserve and knative-serving charts for new-style Calico network policies: Machine-Learning-Team.
May 21 2024, 2:09 PM · Machine-Learning-Team, Patch-For-Review
klausman added a subtask for T287491: Allow to address Kubernetes API servers from NetworkPolicy: T365479: Update kserve and knative-serving charts for new-style Calico network policies.
May 21 2024, 2:05 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
klausman added a parent task for T365479: Update kserve and knative-serving charts for new-style Calico network policies: T287491: Allow to address Kubernetes API servers from NetworkPolicy.
May 21 2024, 2:05 PM · Machine-Learning-Team, Patch-For-Review
klausman created T365479: Update kserve and knative-serving charts for new-style Calico network policies.
May 21 2024, 2:05 PM · Machine-Learning-Team, Patch-For-Review

May 16 2024

klausman added a comment to T287491: Allow to address Kubernetes API servers from NetworkPolicy.

@klausman you can go head with kserve and knative-serving when you have some time, ping me for reviews etc

May 16 2024, 9:31 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

May 15 2024

klausman added a subtask for T362503: ORES doesn't work (at least for ru- and ukwiki): Unknown Object (Task).
May 15 2024, 2:22 PM · Patch-For-Review, Machine-Learning-Team, ORES

May 14 2024

klausman moved T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
May 14 2024, 2:22 PM · Patch-For-Review, Machine-Learning-Team
klausman added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
  • Connections from isvc namespaces on staging to the Cassandra machines now work, including TLS certs and SNI
  • Next step: have an actual inference service actually talk to the cache, likely with code from https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/995001
  • Still need to figure out long-term maintenance of Cassandra server-side config (users, passwords, namespaces, schemas); may hand off/soft-donate the machines to Data Persistence Team
May 14 2024, 2:20 PM · Goal, Machine-Learning-Team
klausman added a comment to T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines.

All the machinery is now in place to make connections to Cassandra from isvcs on staging (in the experimental NS):

May 14 2024, 2:09 PM · Patch-For-Review, Machine-Learning-Team
klausman created P62376 (An Untitled Masterwork).
May 14 2024, 9:07 AM

May 13 2024

klausman added a comment to T362984: GPU errors in hf image in ml-staging.

From https://man7.org/linux/man-pages/man2/access.2.html

May 13 2024, 4:18 PM · Lift-Wing, Machine-Learning-Team

May 10 2024

klausman added a comment to T362984: GPU errors in hf image in ml-staging.

I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 (related patch):

Traceback (most recent call last):
  File "/srv/app/llm/model.py", line 9, in <module>
    import torch
  File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument

Haven't found any related open issue so I'm currently testing different pytorch versions to see if the issue still exists

May 10 2024, 1:35 PM · Lift-Wing, Machine-Learning-Team

May 7 2024

klausman added a comment to T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.

@klausman leaving the decision to you :) You can file path anytime, today we'll roll out the transparent proxy changes for eqiad and then we'll be ok to proceed with the new commons host header. Before proceeding I'd suggest to check if calls to the MW API can accept commons host header and URI paths, I don't think any rewrite is happening in upper layers but better safe than sorry!

May 7 2024, 2:32 PM · Patch-For-Review, Machine-Learning-Team
klausman set the point value for T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` to 2.
May 7 2024, 2:24 PM · Machine-Learning-Team, Cassandra
klausman closed T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` as Resolved.
May 7 2024, 2:23 PM · Machine-Learning-Team, Cassandra
klausman closed T362661: Create basic alerts for isvcs to catch outages as Resolved.
May 7 2024, 2:22 PM · Machine-Learning-Team, ORES
klausman closed T362661: Create basic alerts for isvcs to catch outages, a subtask of T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services, as Resolved.
May 7 2024, 2:22 PM · Goal, Machine-Learning-Team
klausman closed T362661: Create basic alerts for isvcs to catch outages, a subtask of T362503: ORES doesn't work (at least for ru- and ukwiki), as Resolved.
May 7 2024, 2:22 PM · Patch-For-Review, Machine-Learning-Team, ORES

Apr 30 2024

klausman added a parent task for T362663: Add slow-logs for ML isvcs: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Apr 30 2024, 2:24 PM · Machine-Learning-Team, ORES
klausman added a subtask for T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services: T362663: Add slow-logs for ML isvcs.
Apr 30 2024, 2:24 PM · Goal, Machine-Learning-Team
klausman moved T362661: Create basic alerts for isvcs to catch outages from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Apr 30 2024, 9:11 AM · Machine-Learning-Team, ORES
klausman moved T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config` from Unsorted to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Apr 30 2024, 9:10 AM · Machine-Learning-Team, Cassandra

Apr 25 2024

klausman committed rMLIScee803ccb824: Makefile: cleanup and slight reorganization.
Makefile: cleanup and slight reorganization
Apr 25 2024, 4:10 PM

Apr 24 2024

klausman added a project to T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config`: Machine-Learning-Team.
Apr 24 2024, 8:39 AM · Machine-Learning-Team, Cassandra
klausman added a comment to T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config`.

This has been implemented in change 1020194.

Apr 24 2024, 8:39 AM · Machine-Learning-Team, Cassandra

Apr 23 2024

klausman changed the point value for T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines from 1 to 5.
Apr 23 2024, 2:46 PM · Patch-For-Review, Machine-Learning-Team
klausman added a parent task for T362661: Create basic alerts for isvcs to catch outages: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Apr 23 2024, 2:41 PM · Machine-Learning-Team, ORES
klausman added a subtask for T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services: T362661: Create basic alerts for isvcs to catch outages.
Apr 23 2024, 2:41 PM · Goal, Machine-Learning-Team
klausman moved T362661: Create basic alerts for isvcs to catch outages from Unsorted to In Progress on the Machine-Learning-Team board.
Apr 23 2024, 2:15 PM · Machine-Learning-Team, ORES
klausman set the point value for T362661: Create basic alerts for isvcs to catch outages to 1.
Apr 23 2024, 2:07 PM · Machine-Learning-Team, ORES

Apr 19 2024

klausman added a comment to T362749: Deploy logo-detection model-server to LiftWing staging.

This is a bug in keras: it tries to open the file with mode r+b (read, append, binary), but since the file is owned by another user (nobody vs. somebody), the call fails. Why keras would need to be able to append to the file, I don't know.

Apr 19 2024, 9:37 AM · Machine-Learning-Team
klausman added a comment to T362749: Deploy logo-detection model-server to LiftWing staging.

PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'

Apr 19 2024, 7:27 AM · Machine-Learning-Team

Apr 18 2024

klausman claimed T362661: Create basic alerts for isvcs to catch outages.
Apr 18 2024, 10:03 AM · Machine-Learning-Team, ORES

Apr 17 2024

klausman created P60791 (An Untitled Masterwork).
Apr 17 2024, 3:11 PM
klausman added a comment to T362661: Create basic alerts for isvcs to catch outages.

I've experimented a bit on Thanos, and arrived at this query:

Apr 17 2024, 9:26 AM · Machine-Learning-Team, ORES

Apr 16 2024

klausman claimed T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.
Apr 16 2024, 2:57 PM · Goal, Machine-Learning-Team
klausman added a comment to T362661: Create basic alerts for isvcs to catch outages.

Probably something like:

Apr 16 2024, 2:04 PM · Machine-Learning-Team, ORES
klausman updated the task description for T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config`.
Apr 16 2024, 1:27 PM · Machine-Learning-Team, Cassandra
klausman created T362649: Figure out a way to query Cassandra node IPs from `profile::kubernetes::deployment_server::global_config`.
Apr 16 2024, 12:28 PM · Machine-Learning-Team, Cassandra
klausman committed rMLIS42f6c02713aa: gitignore: Ignore my_venv/ and models/ directories.
gitignore: Ignore my_venv/ and models/ directories
Apr 16 2024, 11:02 AM

Apr 15 2024

klausman added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

Timeline (times in UTC):

Apr 15 2024, 1:10 PM · Patch-For-Review, Machine-Learning-Team, ORES
klausman added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now. I have confirmed this for ruwiki.

Apr 15 2024, 8:37 AM · Patch-For-Review, Machine-Learning-Team, ORES

Mar 26 2024

klausman closed T359569: Investigate if it is possible to reduce torch's package size, a subtask of T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images, as Resolved.
Mar 26 2024, 2:47 PM · Machine-Learning-Team
klausman closed T359569: Investigate if it is possible to reduce torch's package size as Resolved.
Mar 26 2024, 2:47 PM · Machine-Learning-Team
klausman moved T359569: Investigate if it is possible to reduce torch's package size from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 26 2024, 2:46 PM · Machine-Learning-Team
klausman moved T360894: Investigate temporary high latency in revscoring service for wikidata from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 26 2024, 2:17 PM · Machine-Learning-Team
klausman set the point value for T360894: Investigate temporary high latency in revscoring service for wikidata to 3.
Mar 26 2024, 2:16 PM · Machine-Learning-Team