Page MenuHomePhabricator
Feed Advanced Search

Nov 14 2023

klausman updated the task description for T349619: Migrate roles to puppet7.
Nov 14 2023, 10:25 AM · Patch-For-Review, Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
klausman added a comment to T351114: Transient error while running lift wing topic model .

Luca has raised a few questions that may reveal relevant information:

Nov 14 2023, 10:15 AM · Product-Analytics, User-Iflorez, Machine-Learning-Team
klausman created P53382 (An Untitled Masterwork).
Nov 14 2023, 9:47 AM

Oct 24 2023

klausman created T349632: Add deprecation warnings to ORES-related repositories on Github.
Oct 24 2023, 2:54 PM · Patch-For-Review, ORES, Machine-Learning-Team
klausman updated the task description for T337213: Update to KServe 0.11.
Oct 24 2023, 10:14 AM · Machine-Learning-Team

Oct 18 2023

klausman added a parent task for T349180: Discuss caching strategies for Lift Wing: T348155: Goal: Decide on an optional Lift Wing caching strategy for model servers.
Oct 18 2023, 10:35 AM · Machine-Learning-Team, Lift-Wing
klausman added a subtask for T348155: Goal: Decide on an optional Lift Wing caching strategy for model servers: T349180: Discuss caching strategies for Lift Wing.
Oct 18 2023, 10:35 AM · Goal, Machine-Learning-Team
klausman created T349180: Discuss caching strategies for Lift Wing.
Oct 18 2023, 10:34 AM · Machine-Learning-Team, Lift-Wing

Oct 12 2023

klausman closed T339231: Expand the Lift Wing workers' kubelet partition as Resolved.
Oct 12 2023, 10:51 AM · Machine-Learning-Team

Oct 10 2023

klausman renamed T348298: Add revertrisk-language-agnostic to RecentChanges filters from Add revertrisk-multilingual to RecentChanges filters to Add revertrisk-language-agnostic to RecentChanges filters.
Oct 10 2023, 2:58 PM · MW-1.43-notes (1.43.0-wmf.2; 2024-04-23), MW-1.42-notes (1.42.0-wmf.16; 2024-01-30), Wikipedia-Android-App-Backlog, Growth-Team, MediaWiki-extensions-ORES, Machine-Learning-Team
klausman merged T348515: decommission ores100*.eqiad.wmnet into T348144: decommission ores{1001..1009}.eqiad.wmnet.
Oct 10 2023, 10:47 AM · SRE, ops-eqiad, Machine-Learning-Team, decommission-hardware
klausman merged task T348515: decommission ores100*.eqiad.wmnet into T348144: decommission ores{1001..1009}.eqiad.wmnet.
Oct 10 2023, 10:46 AM · SRE, ops-eqiad, decommission-hardware
klausman merged T348514: decommission ores200*.codfw.wmnet into T348462: decommission ores{2001..2009}.codfw.wmnet.
Oct 10 2023, 10:46 AM · SRE, ops-codfw, Machine-Learning-Team, decommission-hardware
klausman merged task T348514: decommission ores200*.codfw.wmnet into T348462: decommission ores{2001..2009}.codfw.wmnet.
Oct 10 2023, 10:46 AM · SRE, ops-codfw, decommission-hardware
klausman updated the task description for T348515: decommission ores100*.eqiad.wmnet.
Oct 10 2023, 10:43 AM · SRE, ops-eqiad, decommission-hardware
klausman updated the task description for T348514: decommission ores200*.codfw.wmnet.
Oct 10 2023, 10:43 AM · SRE, ops-codfw, decommission-hardware
klausman created T348515: decommission ores100*.eqiad.wmnet.
Oct 10 2023, 10:43 AM · SRE, ops-eqiad, decommission-hardware
klausman created T348514: decommission ores200*.codfw.wmnet.
Oct 10 2023, 10:42 AM · SRE, ops-codfw, decommission-hardware

Oct 5 2023

klausman updated the task description for T348144: decommission ores{1001..1009}.eqiad.wmnet.
Oct 5 2023, 10:00 AM · SRE, ops-eqiad, Machine-Learning-Team, decommission-hardware

Oct 4 2023

klausman updated the task description for T348144: decommission ores{1001..1009}.eqiad.wmnet.
Oct 4 2023, 2:46 PM · SRE, ops-eqiad, Machine-Learning-Team, decommission-hardware
klausman added a comment to T347278: Decommission ORES configurations and servers.

File T348144 for decomming.

Oct 4 2023, 1:41 PM · Patch-For-Review, Machine-Learning-Team
klausman created T348144: decommission ores{1001..1009}.eqiad.wmnet.
Oct 4 2023, 1:40 PM · SRE, ops-eqiad, Machine-Learning-Team, decommission-hardware
klausman added a comment to T347278: Decommission ORES configurations and servers.

After discussion on IRC, I have also shutdown 1001 and 2001.

Oct 4 2023, 1:23 PM · Patch-For-Review, Machine-Learning-Team
klausman added a comment to T347278: Decommission ORES configurations and servers.

The machines ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet have been shut down (1001 and 2001 are still running in case we need files from them).

Oct 4 2023, 1:04 PM · Patch-For-Review, Machine-Learning-Team

Oct 2 2023

klausman added a comment to T347838: Add sha512 checksum files to all the ML's models in the public dir.

Do you think it would be useful to also keep the checksums in a different place, with permissions independent of the backing store behind the published/ directory?

Oct 2 2023, 11:15 AM · Machine-Learning-Team
klausman added a comment to T334182: Deploy multilingual readability model to LiftWing.

SLO dashboard now available at: https://grafana-rw.wikimedia.org/d/slo-Lift_Wing_Readability/lift-wing-readability-slo-s?orgId=1

Oct 2 2023, 8:57 AM · Research, Machine-Learning-Team

Sep 29 2023

klausman moved T334182: Deploy multilingual readability model to LiftWing from In Progress to Complete Q3 2022/23 on the Machine-Learning-Team board.
Sep 29 2023, 2:58 PM · Research, Machine-Learning-Team

Sep 28 2023

klausman updated the task description for T347278: Decommission ORES configurations and servers.
Sep 28 2023, 3:03 PM · Patch-For-Review, Machine-Learning-Team

Sep 25 2023

klausman claimed T347278: Decommission ORES configurations and servers.
Sep 25 2023, 3:04 PM · Patch-For-Review, Machine-Learning-Team

Sep 20 2023

klausman added a comment to T339231: Expand the Lift Wing workers' kubelet partition.

I've done 1002-1008 today, and everything went smoothly. All done!

Sep 20 2023, 9:39 AM · Machine-Learning-Team
klausman moved T339231: Expand the Lift Wing workers' kubelet partition from Backlog/SRE to Complete Q3 2022/23 on the Machine-Learning-Team board.
Sep 20 2023, 9:39 AM · Machine-Learning-Team
klausman claimed T339231: Expand the Lift Wing workers' kubelet partition.
Sep 20 2023, 9:37 AM · Machine-Learning-Team

Sep 19 2023

klausman added a comment to T346144: Hardcode the SLO time windows in Grafana dashboards generated via Grizzly.

I am very in favor of this scheme.

Sep 19 2023, 3:02 PM · SRE Observability (FY2023/2024-Q1), serviceops, observability

Sep 13 2023

klausman added a comment to T334182: Deploy multilingual readability model to LiftWing.

The service has been moved from the experimental namespace to readability in staging-codfw, and newly deployed to the same namespace to serve-codfw and -eqiad.

Sep 13 2023, 10:34 AM · Research, Machine-Learning-Team

Sep 12 2023

klausman added a comment to T346032: Elevate LiftWing access to WME tier for development and production environment.

One thing of note: after elevating the tier like Luca did yesterday, the token has to be re-issued using the webui to have the new limit baked into it.

Sep 12 2023, 3:01 PM · Wikimedia Enterprise, Machine-Learning-Team
klausman claimed T341693: Defined and measured SLO for every production service - COMPLETE.
Sep 12 2023, 2:19 PM · Goal, Machine-Learning-Team

Aug 22 2023

klausman added a comment to T339231: Expand the Lift Wing workers' kubelet partition.

2007 and 2008 are now also done, again without problems.

Aug 22 2023, 1:32 PM · Machine-Learning-Team
klausman committed rLPRIc63835e6367b: deployment_server: Add fake secrets fir LW readability isvc.
deployment_server: Add fake secrets fir LW readability isvc
Aug 22 2023, 12:38 PM

Aug 21 2023

klausman added a comment to T339231: Expand the Lift Wing workers' kubelet partition.

Machines ml-serve2001-2006 are now done. Zero errors or irregularities. Will do 7 and 8 later this week.

Aug 21 2023, 8:22 AM · Machine-Learning-Team

Aug 15 2023

klausman added a comment to T339231: Expand the Lift Wing workers' kubelet partition.

(copied from T343900, this ticket is more appropriate for this info)

Aug 15 2023, 10:50 AM · Machine-Learning-Team
klausman added a comment to T343900: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99.

The problem is only really relevant for LLMs (Large Language Models), since they need more local disk space. Or at least the specific ones we tried did. We have plenty of disk space on our workers so far, so having a bigger kubelet partition/fs is quite feasible.

Aug 15 2023, 10:43 AM · Machine-Learning-Team, sre-alert-triage
klausman added a comment to T343900: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99.

I have done ml2002 and ml2003 today (two machines to force some pods back onto 2002, to see it works properly). So far, everything seems fine.

Aug 15 2023, 9:17 AM · Machine-Learning-Team, sre-alert-triage

Aug 14 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

Should be all clean now:

Aug 14 2023, 1:33 PM · Machine-Learning-Team

Aug 11 2023

klausman added a comment to T344051: Caching strategies for scores in Lift Wing.

Upsides of local-ish to LW caching (e.g. Cassandra):

Aug 11 2023, 11:33 AM · Machine-Learning-Team

Aug 10 2023

klausman added a comment to T327620: Define SLI/SLO for Lift Wing.

While making/experimenting with the SLO dahsboard it became clear that the label cardinality of our input metrics is so high (>10k) that direct computation from the input metrics is not feasible --- the result set is so large that we get at best partial results.

Aug 10 2023, 2:04 PM · Machine-Learning-Team

Aug 8 2023

klausman moved T340822: Revert Risk multi-lingual model performance and reliability may need a review from In Progress to Watching on the Machine-Learning-Team board.
Aug 8 2023, 2:21 PM · Machine-Learning-Team

Aug 3 2023

klausman created T343446: Investigate high API latency on LW k8s.
Aug 3 2023, 1:01 PM · Machine-Learning-Team

Jul 26 2023

klausman created T342765: ML-Team will soon stop using LFS on Gerrit (for ORES deployment).
Jul 26 2023, 1:22 PM · git-lfs, Release-Engineering-Team (Radar), Gerrit, Machine-Learning-Team
klausman created T342735: Design/Feature discussion: return codes for LW services to signal "the revision doesn't exist".
Jul 26 2023, 9:50 AM · Machine-Learning-Team

Jul 25 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

I've also run the same test tool against RR-ML and got much worse latency overall (though no errors, which is great):

Jul 25 2023, 10:26 AM · Machine-Learning-Team

Jul 24 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

Luca and I have gotten to the bottom of where the 503s come from. It was ultimately caused by the autoscaling activators being underprovisioned (both replica-count and memory quota). After patches 940391 and 940391, the 503s are gone, even on somewhat stressful tests.

Jul 24 2023, 2:29 PM · Machine-Learning-Team

Jul 20 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

In the ingress gateway, these 503s look like this:

Jul 20 2023, 1:53 PM · Machine-Learning-Team
klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

I am still seeing (rare) 503s, and in the queue proxy pod this is logged:

Jul 20 2023, 1:18 PM · Machine-Learning-Team
klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

With 400(!) workers, for 5m:

Jul 20 2023, 12:14 PM · Machine-Learning-Team
klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

With the fixed version deployed to all clusters, I ran a load test again. Note that the throughput would likely be higher with more workers (i.e. it's limited by the client, not the inference service).

Jul 20 2023, 11:11 AM · Machine-Learning-Team
klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

I have done some testing today:

Jul 20 2023, 9:51 AM · Machine-Learning-Team

Jul 18 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

In an effort to solve the practical problem (getting good RR inference without too many errors and timeouts), I'll do some testing on the other RR model (agnostic) to see if it's suitable as an alternative (maybe temporarily).

Jul 18 2023, 3:03 PM · Machine-Learning-Team

Jul 17 2023

klausman added a comment to T327620: Define SLI/SLO for Lift Wing.

I've now also managed to add some latency bucketing stuff. Not 100% yet if it is what we want, but in any case, it's progress.

Jul 17 2023, 4:38 PM · Machine-Learning-Team
klausman added a comment to T327620: Define SLI/SLO for Lift Wing.

I made some progress on the experimental dashboard (https://grafana.wikimedia.org/goto/VSolQfj4k?orgId=1). Request count (and 200 vs non-200) I now have a better mental model/grasp of. The current setup is not ML-specific, but rather Istio-specific. So it may not apply to all k8s setups, but only those where Istio is used.

Jul 17 2023, 1:46 PM · Machine-Learning-Team
klausman added a comment to T327620: Define SLI/SLO for Lift Wing.

I am working on Grafana/Thanos directly for now because it's a shorter change-try loop to find the right metrics than doing it with Grizzly directly. Even with templating, we still need specific metrics to put into slo_definitions.libsonnet. Arguably, that's the most difficult part due to the sheer number of metrics surfaced even just by our own Prom instance.

Jul 17 2023, 9:15 AM · Machine-Learning-Team

Jul 14 2023

klausman added a comment to T327620: Define SLI/SLO for Lift Wing.

https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start.

Jul 14 2023, 3:30 PM · Machine-Learning-Team

Jul 13 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

Digging into the logging of the kserve and istio containers a bit, I have found a few things:

Jul 13 2023, 11:38 AM · Machine-Learning-Team

Jul 12 2023

klausman placed T341657: hw troubleshooting: iDrac stuck for ores2003.codfw.wmnet up for grabs.
Jul 12 2023, 2:23 PM · Machine-Learning-Team, DC-Ops
klausman reopened T341657: hw troubleshooting: iDrac stuck for ores2003.codfw.wmnet as "Open".

Correction: the iDrac is still down.

Jul 12 2023, 1:05 PM · Machine-Learning-Team, DC-Ops
klausman closed T341657: hw troubleshooting: iDrac stuck for ores2003.codfw.wmnet as Resolved.

The iDrac is reachable again, so it likely was a different issue.

Jul 12 2023, 8:53 AM · Machine-Learning-Team, DC-Ops
klausman created T341657: hw troubleshooting: iDrac stuck for ores2003.codfw.wmnet.
Jul 12 2023, 8:00 AM · Machine-Learning-Team, DC-Ops

Jul 3 2023

klausman added a comment to T340822: Revert Risk multi-lingual model performance and reliability may need a review.

shows URX error state, aka:

Jul 3 2023, 1:16 PM · Machine-Learning-Team

Jun 21 2023

klausman added a comment to T327620: Define SLI/SLO for Lift Wing.

Luca and I had a longer discussion via mail and IRC, about whether the backend-induced latency of an Inference Service should count towards the SLO budget or not.

Jun 21 2023, 1:05 PM · Machine-Learning-Team

Jun 20 2023

klausman created P49459 (An Untitled Masterwork).
Jun 20 2023, 3:26 PM
klausman created P49458 (An Untitled Masterwork).
Jun 20 2023, 3:05 PM
klausman created P49457 (An Untitled Masterwork).
Jun 20 2023, 2:56 PM
klausman added a comment to T328899: Add a new outlink topic stream for EventGate main.

Change 930610 has been pushed to prod, so now we get the full feed from changeprop.

Jun 20 2023, 12:45 PM · Data Engineering and Event Platform Team, Data-Engineering, Event-Platform, Machine-Learning-Team

Jun 15 2023

klausman added a comment to T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME.

I discussed the above questions with Luca today, and I think for now we can proceed with telling WME to start exploring the documentation we have (and tell us where there are gaps), and start testing against LiftWing/APIGW. This should surface any issues that might still be there, even if in the future the actual implementation of access to LW and rate limiting changes.

Jun 15 2023, 3:16 PM · API Platform, Machine-Learning-Team
klausman added a comment to T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME.

After some experimenting, the state of how rate limits for API tokens, the API gateway and Lift Wing currently are applied seems to be this:

Jun 15 2023, 3:13 PM · API Platform, Machine-Learning-Team
klausman updated the language for P49439 (An Untitled Masterwork) from autodetect to diff.
Jun 15 2023, 9:55 AM
klausman created P49439 (An Untitled Masterwork).
Jun 15 2023, 9:55 AM

Jun 12 2023

klausman added a comment to T335480: Test KServe inference batching.

A few thoughts:

Jun 12 2023, 3:30 PM · Machine-Learning-Team, Epic

Jun 9 2023

klausman closed T338623: Can't delete images from docker registry (from build2001 using docker-registryctl) as Invalid.

This was caused by me using the wrong host.

Jun 9 2023, 3:21 PM · Machine-Learning-Team, serviceops
klausman created T338623: Can't delete images from docker registry (from build2001 using docker-registryctl).
Jun 9 2023, 3:12 PM · Machine-Learning-Team, serviceops

Jun 5 2023

klausman updated the task description for T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME.
Jun 5 2023, 8:38 AM · API Platform, Machine-Learning-Team
klausman created T338121: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME.
Jun 5 2023, 8:36 AM · API Platform, Machine-Learning-Team

May 31 2023

klausman changed the visibility for P48672 (An Untitled Masterwork).
May 31 2023, 3:13 PM
klausman removed a project from P48672 (An Untitled Masterwork): WMF-NDA.
May 31 2023, 3:12 PM
klausman created P48672 (An Untitled Masterwork).
May 31 2023, 2:52 PM
klausman added a subtask for T108027: Collect per-cgroup cpu/mem and other system level metrics: T337836: Cadvisor may be breaking Kubernetes worker nodes.
May 31 2023, 10:22 AM · SRE Observability (FY2023/2024-Q1), Patch-For-Review, Observability-Metrics, User-fgiunchedi, SRE
klausman added a parent task for T337836: Cadvisor may be breaking Kubernetes worker nodes: T108027: Collect per-cgroup cpu/mem and other system level metrics.
May 31 2023, 10:21 AM · serviceops-radar, Prod-Kubernetes, Kubernetes
klausman created T337836: Cadvisor may be breaking Kubernetes worker nodes.
May 31 2023, 10:21 AM · serviceops-radar, Prod-Kubernetes, Kubernetes

May 30 2023

klausman closed T337369: Shut down and deconfigure NLLB setup on AWS as Resolved.
May 30 2023, 2:59 PM · MinT, Machine-Learning-Team
klausman closed T337378: Fix Regular Expression in API GW config for revert risk as Resolved.
May 30 2023, 2:59 PM · Machine-Learning-Team

May 26 2023

klausman moved T337369: Shut down and deconfigure NLLB setup on AWS from In Progress to Complete Q3 2022/23 on the Machine-Learning-Team board.
May 26 2023, 9:02 AM · MinT, Machine-Learning-Team
klausman added a comment to T337369: Shut down and deconfigure NLLB setup on AWS.

After examining the setup some more, I figure I can delete the images on S3 as well, they are easy enough to reproduce with the docs I have. So this has been completed.

May 26 2023, 9:02 AM · MinT, Machine-Learning-Team

May 25 2023

klausman moved T337213: Update to KServe 0.11 from Unsorted to Blocked on the Machine-Learning-Team board.
May 25 2023, 10:57 AM · Machine-Learning-Team
klausman moved T337369: Shut down and deconfigure NLLB setup on AWS from Unsorted to In Progress on the Machine-Learning-Team board.
May 25 2023, 10:57 AM · MinT, Machine-Learning-Team
klausman moved T337378: Fix Regular Expression in API GW config for revert risk from Unsorted to Complete Q3 2022/23 on the Machine-Learning-Team board.
May 25 2023, 10:56 AM · Machine-Learning-Team
klausman added a comment to T337378: Fix Regular Expression in API GW config for revert risk.

Changes have been merged and deployed, Bot eqiad and codfw (and staging) sections of the API GW work fine (tested from within clusters), as well as remote (from my home machine).

May 25 2023, 10:56 AM · Machine-Learning-Team

May 24 2023

klausman created T337378: Fix Regular Expression in API GW config for revert risk.
May 24 2023, 9:48 AM · Machine-Learning-Team
klausman claimed T327241: Move the kserve custom helm chart to the upstream one.
May 24 2023, 9:10 AM · Machine-Learning-Team
klausman added a comment to T327241: Move the kserve custom helm chart to the upstream one.

This should probably until we have updated to kserve 0.11 (T337213), I will tackle this after that.

May 24 2023, 9:10 AM · Machine-Learning-Team
klausman added a comment to T337369: Shut down and deconfigure NLLB setup on AWS.

Instances have been shutdown and S3 has been cleared of all but the latest checkpoint etc.

May 24 2023, 9:05 AM · MinT, Machine-Learning-Team
klausman changed the status of T337369: Shut down and deconfigure NLLB setup on AWS from Open to In Progress.
May 24 2023, 8:58 AM · MinT, Machine-Learning-Team