Page MenuHomePhabricator
Feed Advanced Search

Apr 15 2024

klausman added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now. I have confirmed this for ruwiki.

Apr 15 2024, 8:37 AM · Patch-For-Review, Machine-Learning-Team, ORES

Mar 26 2024

klausman closed T359569: Investigate if it is possible to reduce torch's package size, a subtask of T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images, as Resolved.
Mar 26 2024, 2:47 PM · Machine-Learning-Team
klausman closed T359569: Investigate if it is possible to reduce torch's package size as Resolved.
Mar 26 2024, 2:47 PM · Machine-Learning-Team
klausman moved T359569: Investigate if it is possible to reduce torch's package size from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 26 2024, 2:46 PM · Machine-Learning-Team
klausman moved T360894: Investigate temporary high latency in revscoring service for wikidata from Unsorted to Ready To Go on the Machine-Learning-Team board.
Mar 26 2024, 2:17 PM · Machine-Learning-Team
klausman set the point value for T360894: Investigate temporary high latency in revscoring service for wikidata to 3.
Mar 26 2024, 2:16 PM · Machine-Learning-Team

Mar 25 2024

klausman created T360894: Investigate temporary high latency in revscoring service for wikidata.
Mar 25 2024, 2:22 PM · Machine-Learning-Team
klausman added a comment to T360446: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent).

Found the drive as absent in iDRAC. Physically, the drive is there but is not blinking like the other drives.

For this one, the recommended remedy is to reseat this drive. is that safe to do at this time?

Mar 25 2024, 1:07 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops

Mar 22 2024

klausman added a comment to T359569: Investigate if it is possible to reduce torch's package size.

During some experimentation with various approaches of generating the Docker images differently, and stripping out unneeded information, I have tried the following things:

Mar 22 2024, 1:40 PM · Machine-Learning-Team

Mar 19 2024

klausman moved T360446: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) from Unsorted to Watching on the Machine-Learning-Team board.
Mar 19 2024, 3:37 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman created T360446: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent).
Mar 19 2024, 3:34 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
klausman moved T358655: Set SLO for the article-descriptions isvc hosted on LiftWing from In Progress to Ready To Go on the Machine-Learning-Team board.
Mar 19 2024, 2:48 PM · Machine-Learning-Team
klausman removed a project from T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines: Epic.
Mar 19 2024, 2:37 PM · Patch-For-Review, Machine-Learning-Team
klausman moved T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 19 2024, 2:35 PM · Patch-For-Review, Machine-Learning-Team
klausman set the point value for T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines to 1.
Mar 19 2024, 2:35 PM · Patch-For-Review, Machine-Learning-Team
klausman created T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines.
Mar 19 2024, 1:12 PM · Patch-For-Review, Machine-Learning-Team

Mar 7 2024

klausman closed T340822: Revert Risk multi-lingual model performance and reliability may need a review, a subtask of T333453: Lift Wing improvements to get out of MVP state, as Resolved.
Mar 7 2024, 10:25 AM · Epic, Machine-Learning-Team
klausman closed T340822: Revert Risk multi-lingual model performance and reliability may need a review as Resolved.
Mar 7 2024, 10:25 AM · Machine-Learning-Team
klausman moved T340822: Revert Risk multi-lingual model performance and reliability may need a review from Watching to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 7 2024, 10:25 AM · Machine-Learning-Team

Mar 5 2024

klausman closed T358654: Create external endpoint for article-descriptions isvc hosted on LiftWing as Resolved.
Mar 5 2024, 3:59 PM · Patch-For-Review, Machine-Learning-Team
klausman moved T358655: Set SLO for the article-descriptions isvc hosted on LiftWing from Unsorted to In Progress on the Machine-Learning-Team board.
Mar 5 2024, 3:59 PM · Machine-Learning-Team
klausman moved T358654: Create external endpoint for article-descriptions isvc hosted on LiftWing from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 5 2024, 3:59 PM · Patch-For-Review, Machine-Learning-Team
klausman closed T358654: Create external endpoint for article-descriptions isvc hosted on LiftWing, a subtask of T358467: Move the article-descriptions model server from staging to production, as Resolved.
Mar 5 2024, 3:59 PM · Machine-Learning-Team
klausman moved T358742: Investigate InfServiceHighMemoryUsage for article-descriptions from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Mar 5 2024, 3:52 PM · Machine-Learning-Team
klausman closed T358742: Investigate InfServiceHighMemoryUsage for article-descriptions as Resolved.
Mar 5 2024, 3:51 PM · Machine-Learning-Team
klausman claimed T358742: Investigate InfServiceHighMemoryUsage for article-descriptions.
Mar 5 2024, 3:17 PM · Machine-Learning-Team
klausman claimed T358655: Set SLO for the article-descriptions isvc hosted on LiftWing.
Mar 5 2024, 3:17 PM · Machine-Learning-Team
klausman added a comment to T358467: Move the article-descriptions model server from staging to production.

The article-descriptions model server was firing InfServiceHighMemoryUsage alerts. This usually happens when an isvc uses >90% of its limit for 5mins. I have increased the memory limit used by this model server from 4Gi to 5Gi so that prod can handle processing more isvc requests without running out of memory.

Mar 5 2024, 2:05 PM · Machine-Learning-Team
klausman added a comment to T358742: Investigate InfServiceHighMemoryUsage for article-descriptions.

This was indeed caused by using the wrong metric. We have chosen to move to using the existing k8s alerts.

Mar 5 2024, 2:03 PM · Machine-Learning-Team
klausman added a comment to T358654: Create external endpoint for article-descriptions isvc hosted on LiftWing.

And the external endpoint is live:

Mar 5 2024, 12:01 PM · Patch-For-Review, Machine-Learning-Team
klausman closed T354516: Requesting write access to ml-staging-codfw for ML team, a subtask of T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models, as Resolved.
Mar 5 2024, 11:51 AM · Goal, Machine-Learning-Team
klausman closed T354516: Requesting write access to ml-staging-codfw for ML team as Resolved.
Mar 5 2024, 11:51 AM · Patch-For-Review, SRE, Machine-Learning-Team
klausman closed T347262: Set SLO for the recommendation-api-ng service hosted on LiftWing, a subtask of T347015: Deploy the recommendation-api-ng on LiftWing, as Resolved.
Mar 5 2024, 11:50 AM · Machine-Learning-Team
klausman closed T347262: Set SLO for the recommendation-api-ng service hosted on LiftWing as Resolved.
Mar 5 2024, 11:50 AM · Machine-Learning-Team
klausman closed T349180: Discuss caching strategies for Lift Wing, a subtask of T348155: Goal: Decide on an optional Lift Wing caching strategy for model servers, as Resolved.
Mar 5 2024, 11:50 AM · Goal, Machine-Learning-Team
klausman closed T349180: Discuss caching strategies for Lift Wing as Resolved.
Mar 5 2024, 11:50 AM · Machine-Learning-Team, Lift-Wing

Feb 29 2024

klausman added a comment to T358742: Investigate InfServiceHighMemoryUsage for article-descriptions.

Hypothesis why the other services never alerted: their base usage (container_memory_working_set_bytes) is much lower than the limit, and they don't do enough disk-I/O to fill the page cache to the point the combined metric (container_memory_usage_bytes) gets close to the limit

Feb 29 2024, 9:38 AM · Machine-Learning-Team
klausman added a comment to T358742: Investigate InfServiceHighMemoryUsage for article-descriptions.

I have found this:

Feb 29 2024, 8:58 AM · Machine-Learning-Team
klausman created T358742: Investigate InfServiceHighMemoryUsage for article-descriptions.
Feb 29 2024, 7:55 AM · Machine-Learning-Team

Feb 28 2024

klausman added a comment to T356256: Epic: Implement prototype inference service that uses Cassandra for request caching.
  • What is the schema selected for the data stored in Cassandra? We should document it in here so people can find it, and probably discuss the replication strategy etc.. (for example, do we want to eventually be able to replicate a write to eqiad in codfw and vice-versa? etc.. Cassandra does a lot of things automatically but they need to be stated).
Feb 28 2024, 1:44 PM · Epic, Machine-Learning-Team
klausman claimed T358654: Create external endpoint for article-descriptions isvc hosted on LiftWing.
Feb 28 2024, 12:14 PM · Patch-For-Review, Machine-Learning-Team

Feb 27 2024

klausman updated the task description for T357415: Q3:rack/setup/install ml-staging2003.
Feb 27 2024, 2:14 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman added a comment to T357415: Q3:rack/setup/install ml-staging2003.

I've updated the partman lines. I will update modules/profile/data/profile/installserver/preseed.yaml to include the new host in a moment, so standard imaging should pick the right recipe for the host.

Feb 27 2024, 2:06 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman updated the task description for T357415: Q3:rack/setup/install ml-staging2003.
Feb 27 2024, 2:05 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops
klausman added a comment to T358467: Move the article-descriptions model server from staging to production.

One addendum to the 'None has no attribute "shape"': this happened only once, the same request seconds later (and before!) worked just fine.

Feb 27 2024, 10:53 AM · Machine-Learning-Team
klausman added a comment to T358467: Move the article-descriptions model server from staging to production.

I just got an error when querying the service:

Feb 27 2024, 10:40 AM · Machine-Learning-Team

Feb 26 2024

klausman added a comment to T358467: Move the article-descriptions model server from staging to production.

I had missed pushing the admin_ng change. That is fixed now, so pushing the model server config should work now.

Feb 26 2024, 2:49 PM · Machine-Learning-Team
klausman committed rLPRI7e5cc835fc6d: k8s: Add faux secrest for article-descriptions on Lift Wing.
k8s: Add faux secrest for article-descriptions on Lift Wing
Feb 26 2024, 1:44 PM

Feb 23 2024

klausman added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.
  • the model takes into account articles (the first paragraphs in our case) and short descriptions in all languages where the article is available.
Feb 23 2024, 3:32 PM · Wikipedia-Android-App-Backlog, Machine-Learning-Team

Feb 22 2024

klausman added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

Another option is using something like https://mobileapps.discovery.wmnet:4102/es.wikipedia.org/v1/page/summary/Madrid, so there would be neither RESTbase nor the REST API in the path, but I am seeing similar latencies there.

Feb 22 2024, 12:13 PM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
klausman added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

Note the wide variety of latencies, spanning fromn 118ms for "Coal", to more than 10x that for "Poetry". This indicates to me that any rigorous latency testing has to use a wide dataset of pages that the summaries are requested for.

Feb 22 2024, 11:11 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
klausman added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

And with a variety of pages requested:

Feb 22 2024, 11:10 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
klausman added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.
Feb 22 2024, 11:07 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
klausman added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

This is run from within the container article-descriptions-predictor-default-00025-deployment-5czmjql currently running on staging:

Feb 22 2024, 11:03 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team

Feb 14 2024

klausman updated the task description for T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw.
Feb 14 2024, 10:24 AM · DBA, ops-codfw, netops, Infrastructure-Foundations, SRE

Feb 13 2024

klausman updated the language for P56705 (An Untitled Masterwork) from autodetect to text.
Feb 13 2024, 4:00 PM
klausman created P56705 (An Untitled Masterwork).
Feb 13 2024, 4:00 PM
klausman closed T355757: Drain & shutdown ml-serve2005.codfw.wmnet for physical move, a subtask of T355437: Relocating servers out of A1 in codfw, as Resolved.
Feb 13 2024, 3:59 PM · Data-Persistence, SRE, ops-codfw
klausman closed T355757: Drain & shutdown ml-serve2005.codfw.wmnet for physical move as Resolved.
Feb 13 2024, 3:59 PM · Machine-Learning-Team
klausman closed T355759: Drain and silence ml-serve2002.codfw.wmnet, a subtask of T355544: Migrate hosts from codfw row A/B ASW to new LSW devices, as Resolved.
Feb 13 2024, 3:58 PM · ops-codfw, Infrastructure-Foundations, netops, SRE
klausman closed T355759: Drain and silence ml-serve2002.codfw.wmnet as Resolved.
Feb 13 2024, 3:58 PM · Machine-Learning-Team
klausman closed T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw, a subtask of T356661: Cross fleet runc upgrades, as Resolved.
Feb 13 2024, 3:58 PM · serviceops
klausman closed T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw as Resolved.
Feb 13 2024, 3:58 PM · Machine-Learning-Team
klausman updated the task description for T356256: Epic: Implement prototype inference service that uses Cassandra for request caching.
Feb 13 2024, 3:19 PM · Epic, Machine-Learning-Team
klausman updated the task description for T356256: Epic: Implement prototype inference service that uses Cassandra for request caching.
Feb 13 2024, 3:19 PM · Epic, Machine-Learning-Team

Feb 12 2024

klausman created P56687 (An Untitled Masterwork).
Feb 12 2024, 5:22 PM

Feb 8 2024

klausman moved T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Feb 8 2024, 2:20 PM · Machine-Learning-Team
klausman added a comment to T356661: Cross fleet runc upgrades.

ml-serve in codfw also done, so all done for ML team

Feb 8 2024, 2:20 PM · serviceops

Feb 7 2024

klausman closed T356873: Downtime ml-cache2001 for network link move as Resolved.

Downtime has been added.

Feb 7 2024, 2:45 PM · Machine-Learning-Team
klausman closed T356873: Downtime ml-cache2001 for network link move, a subtask of T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw, as Resolved.
Feb 7 2024, 2:45 PM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
klausman added a subtask for T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw: T356873: Downtime ml-cache2001 for network link move.
Feb 7 2024, 2:43 PM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
klausman added a parent task for T356873: Downtime ml-cache2001 for network link move: T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw.
Feb 7 2024, 2:43 PM · Machine-Learning-Team
klausman created T356873: Downtime ml-cache2001 for network link move.
Feb 7 2024, 2:42 PM · Machine-Learning-Team
klausman added a subtask for T356661: Cross fleet runc upgrades: T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw.
Feb 7 2024, 2:35 PM · serviceops
klausman added a parent task for T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw: T356661: Cross fleet runc upgrades.
Feb 7 2024, 2:35 PM · Machine-Learning-Team
klausman added a comment to T356661: Cross fleet runc upgrades.

ml-serve1xxx are all done.

Feb 7 2024, 2:33 PM · serviceops
klausman set the point value for T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw to 1.
Feb 7 2024, 2:32 PM · Machine-Learning-Team
klausman moved T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw from Unsorted to In Progress on the Machine-Learning-Team board.
Feb 7 2024, 2:19 PM · Machine-Learning-Team
klausman claimed T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw.
Feb 7 2024, 2:15 PM · Machine-Learning-Team
klausman created T356867: Do a rolling drain/undrain of LW k8s nodes in eqiad and codfw.
Feb 7 2024, 2:14 PM · Machine-Learning-Team
klausman moved T354516: Requesting write access to ml-staging-codfw for ML team from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Feb 7 2024, 2:00 PM · Patch-For-Review, SRE, Machine-Learning-Team
klausman added a comment to T354516: Requesting write access to ml-staging-codfw for ML team.

After dropping the version specifiers (/v...) at the end of the apiGroups directives, this is now working properly.

Feb 7 2024, 1:59 PM · Patch-For-Review, SRE, Machine-Learning-Team
klausman added a comment to T356661: Cross fleet runc upgrades.

Roll-restart of the staging ML cluster is done, eqiad and codfw prod clusters today and tomorrow.

Feb 7 2024, 1:52 PM · serviceops

Feb 6 2024

klausman created P56346 (An Untitled Masterwork).
Feb 6 2024, 4:10 PM
klausman moved T356256: Epic: Implement prototype inference service that uses Cassandra for request caching from Unsorted to In Progress on the Machine-Learning-Team board.
Feb 6 2024, 3:32 PM · Epic, Machine-Learning-Team

Jan 31 2024

klausman updated the language for P55958 (An Untitled Masterwork) from autodetect to json.
Jan 31 2024, 1:09 PM
klausman created P55958 (An Untitled Masterwork).
Jan 31 2024, 1:09 PM
klausman created T356256: Epic: Implement prototype inference service that uses Cassandra for request caching.
Jan 31 2024, 10:17 AM · Epic, Machine-Learning-Team

Jan 30 2024

klausman moved T349180: Discuss caching strategies for Lift Wing from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Jan 30 2024, 3:44 PM · Machine-Learning-Team, Lift-Wing
klausman moved T347262: Set SLO for the recommendation-api-ng service hosted on LiftWing from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Jan 30 2024, 3:44 PM · Machine-Learning-Team
klausman added a comment to T356158: Emit revertrisk scores to statsd and plot in Grafana.

One question for clarification: what piece of software would be talking to statsd? RR as it runs on LW cannot access any statsd atm, since it is mostly isolated.

Jan 30 2024, 3:12 PM · Patch-For-Review, Machine-Learning-Team, User-kostajh, MediaWiki-extensions-WikimediaEvents, ORES
klausman updated the task description for T355866: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw.
Jan 30 2024, 9:49 AM · DBA, ops-codfw, netops, Infrastructure-Foundations, SRE

Jan 25 2024

klausman moved T355759: Drain and silence ml-serve2002.codfw.wmnet from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Jan 25 2024, 4:59 PM · Machine-Learning-Team
klausman added a comment to T355759: Drain and silence ml-serve2002.codfw.wmnet.

Downtime done and machine is back in service.

Jan 25 2024, 4:59 PM · Machine-Learning-Team
klausman added a comment to T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw.

Nice work. On our machine (ml-serve2002), it was but four seconds:

Jan 25 2024, 4:57 PM · Data-Persistence, ops-codfw, netops, Infrastructure-Foundations, SRE
klausman moved T355757: Drain & shutdown ml-serve2005.codfw.wmnet for physical move from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.
Jan 25 2024, 3:46 PM · Machine-Learning-Team
klausman moved T355759: Drain and silence ml-serve2002.codfw.wmnet from Blocked to In Progress on the Machine-Learning-Team board.
Jan 25 2024, 3:46 PM · Machine-Learning-Team
klausman added a comment to T355757: Drain & shutdown ml-serve2005.codfw.wmnet for physical move.

Move complete, machine undrained.

Jan 25 2024, 3:45 PM · Machine-Learning-Team

Jan 24 2024

klausman added a comment to T355437: Relocating servers out of A1 in codfw.

ml-serve2005 is back up and working fine

Jan 24 2024, 5:07 PM · Data-Persistence, SRE, ops-codfw
klausman moved T355759: Drain and silence ml-serve2002.codfw.wmnet from Unsorted to Blocked on the Machine-Learning-Team board.
Jan 24 2024, 3:21 PM · Machine-Learning-Team