Page MenuHomePhabricator

BWojtowicz-WMF (bwojtowicz)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
May 6 2025, 11:26 AM (31 w, 5 d)
Availability
Available
LDAP User
Bartosz Wójtowicz
MediaWiki User
BWojtowicz-WMF [ Global Accounts ]

Recent Activity

Wed, Dec 10

BWojtowicz-WMF added a comment to T411758: Explore optimizations/scaling for Revise Tone Task Generator in LiftWing.

I'm coming with a small update from early experimentation results.

Wed, Dec 10, 1:28 PM · Machine-Learning-Team

Thu, Dec 4

BWojtowicz-WMF moved T408538: Create a Revise Tone Task Generator in LiftWing from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Thu, Dec 4, 10:10 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF created T411758: Explore optimizations/scaling for Revise Tone Task Generator in LiftWing.
Thu, Dec 4, 10:09 AM · Machine-Learning-Team

Fri, Nov 28

BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

After some development time, the Revise Tone Task Generator service is happily running on LiftWing and is processing all edits on enwiki, ptwiki, frwiki and arwiki matching our topic criteria!
Looking at Istio Grafana Dashboard, we can see we're processing 1-2 requests per second with median response time of ~200ms and p95 response of 1s. This includes us ingesting data to Cassandra and sending the weighted tag update event.

Fri, Nov 28, 1:43 PM · Patch-For-Review, Machine-Learning-Team

Wed, Nov 26

BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

@elukey I think you might be right that it was the specificity of the Python code I've been using.
When sending the request in Python (via the requests library), I've been setting the header to 'Content-Type': 'application/json'. This _probably_ means, it did not infer any other headers, but used only the ones I defined. If I won't define any headers, it will probably infer both Content-Type and Host correctly. Will check this! :D

Wed, Nov 26, 2:59 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

@elukey They domains below are resolvable to the same IP, but when sending requests they all produced the same 502 error:

Wed, Nov 26, 2:35 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Thank you for all of your help investigating and finding the solution to enable the pod-to-pod communication!
I'm very happy to confirm that the solution Luca suggested works and is already integrated in our production service. We use a combination of http://outlink-topic-model.articletopic-outlink/v1/models/outlink-topic-model:predict as URL and outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local as Host header to communicate with the service.

Wed, Nov 26, 11:23 AM · Patch-For-Review, Machine-Learning-Team

Fri, Nov 21

BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Notes on connection issues discovered during development.

Fri, Nov 21, 7:20 AM · Patch-For-Review, Machine-Learning-Team

Nov 14 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Update / Task on pause

Nov 14 2025, 11:13 AM · OKR-Work, Goal, Machine-Learning-Team

Nov 12 2025

BWojtowicz-WMF added a comment to T409850: Cassandra role & grants for Lift Wing isvc integration.

When the service starts, Lift Wing will validate whether the target table exists, so we'll need SELECT as well. @BWojtowicz-WMF, is it correct?

Nov 12 2025, 1:32 PM · Data-Persistence, Machine-Learning-Team

Nov 6 2025

BWojtowicz-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

for local workflows it might be good to have it in a docker compose

Nov 6 2025, 1:08 PM · Machine-Learning-Team

Nov 4 2025

BWojtowicz-WMF claimed T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..
Nov 4 2025, 10:45 AM · Lift-Wing, Machine-Learning-Team
BWojtowicz-WMF moved T401778: Evaluate adding caching mechanism for article topic model to make data available at scale from In Progress to Blocked on the Machine-Learning-Team board.
Nov 4 2025, 10:44 AM · Machine-Learning-Team
BWojtowicz-WMF claimed T408538: Create a Revise Tone Task Generator in LiftWing.
Nov 4 2025, 10:43 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF moved T404294: Merge articletopic outlink model transformer and predictor pods from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Nov 4 2025, 10:42 AM · Goal, Machine-Learning-Team

Oct 24 2025

BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

Thank you for helping and sharing all the logs!

Oct 24 2025, 9:25 AM · Patch-For-Review, Machine-Learning-Team

Oct 23 2025

BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

@jsn.sherman
Hmm this is very interesting, I could not reproduce it on my Mac machine yet. Can you share the exact commands that you are running?

Oct 23 2025, 2:24 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

I think I found the culprit - the issue stems from our base docker image, which contains the old version of typing_extensions preinstalled in /opt/lib/python/site-packages/typing_extensions.py. However, just adding the pin to typing_extensions==4.15.0 in requirements.txt does not solve the issue as I shared in https://phabricator.wikimedia.org/T408068#11301601.

Oct 23 2025, 8:40 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

Looking into it! I can reproduce this issue on my machine. I’ve also confirmed that we luckily don’t encounter this issue on LiftWing, which is interesting.

Oct 23 2025, 8:26 AM · Patch-For-Review, Machine-Learning-Team

Oct 21 2025

BWojtowicz-WMF created T407843: Introduce re-try mechanisms for MW API requests in LiftWing models.
Oct 21 2025, 9:26 AM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF updated subscribers of T407784: LiftWing fiwiki-damaging model returning 500.

I've looked through our Logstash hunting for 500 errors for fiwiki-damaging in the last month. Indeed in the last month, we had 13 days where those errors occured, ranging from 4 to 72 occurrences on those days. All of those are caused by LiftWing failing to fetch data from MW API due to 503 Service Unavailable error:

Oct 21 2025, 9:19 AM · Lift-Wing

Oct 14 2025

BWojtowicz-WMF closed T394301: Reimplement the model-upload script to take into consideration new use cases as Resolved.
Oct 14 2025, 1:40 PM · Essential-Work, Machine-Learning-Team

Oct 13 2025

BWojtowicz-WMF closed T407102: Update unit test assertion in article topic model as Resolved.
Oct 13 2025, 10:32 AM · Machine-Learning-Team
BWojtowicz-WMF created T407102: Update unit test assertion in article topic model.
Oct 13 2025, 9:50 AM · Machine-Learning-Team

Oct 10 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report

Oct 10 2025, 12:51 PM · OKR-Work, Goal, Machine-Learning-Team

Oct 2 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report
Sharing a day earlier as I'm OOO on 3rd of October.

Oct 2 2025, 1:49 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

@Eevans
Thank you very much for elaborating on the history and differences between those two. I was curious what kind of optimizations could be done there like the RAID10 storage and higher density, it's very interesting!
I agree that even if there are no major differences, we should still deploy our Cache in the RESTBase cluster, which is meant for this type of processing.

Oct 2 2025, 7:44 AM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Oct 1 2025

BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.

Oct 1 2025, 2:01 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence
BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better fit for the RESTBase cluster (RESTBase, like AQS, is a misnomer here, both are multi-tenant clusters). The AQS > cluster is (or at least has been) geared more toward materialized representations, analytics, etc. The things persisting data there mostly follow an ETL pattern (even though we've talked about using event > streams, and a more Lamba architecture). Most of what is there is time-series, or versioned, where data is written but not updated. The RESTBase cluster has primarily been for caching (and a bit of application > state). Primarily caching alternate representations of content, but caching nonetheless. Those caches have been maintained by changeprop jobs, jobs that hit a service with a no-cache header, which then writes > though to Cassandra... which sounds familiar?

Oct 1 2025, 7:27 AM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Sep 30 2025

BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

@Ottomata @isarantopoulos
Thank you for the suggestion and discussion about using the wiki_id. The article model does not currently work for other Wikis, but I very much like the idea if standardizing our DB schemas across different models to use page_id and wiki_id for indices.
To not alter the current API parameters to the model, which expects lang parameter, I've created a static lang->wiki_id mapping for each Wikipedia language, which will be used internally by our application code to translate between lang and wiki_id when interacting with cache.

Sep 30 2025, 8:37 AM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Sep 26 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report

Sep 26 2025, 12:25 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

@isarantopoulos I agree, I initially got scared when I saw the new response times on my local machine, but underestimated how faster the requests are inside our cluster :D

Sep 26 2025, 9:10 AM · Lift-Wing, Machine-Learning-Team

Sep 25 2025

BWojtowicz-WMF added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

I've done a small analysis on performance implications of introducing the page_id parameter.
I've ran the experiments on the statbox machines to closer reflect the real time of communication with Wikipedia servers, however it might still not perfectly resemble the query performance when deployed on LiftWing.

Sep 25 2025, 1:58 PM · Lift-Wing, Machine-Learning-Team

Sep 24 2025

BWojtowicz-WMF renamed T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. from [articletopic-outlink] fetch data from mwapi using revid instead of article title to [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..
Sep 24 2025, 11:59 AM · Lift-Wing, Machine-Learning-Team

Sep 23 2025

BWojtowicz-WMF added a comment to T404294: Merge articletopic outlink model transformer and predictor pods .

The merged architecture has been deployed on both staging and production clusters. It's also been tested by sending requests manually and verifying the responses are correct.

Sep 23 2025, 8:09 AM · Goal, Machine-Learning-Team

Sep 22 2025

BWojtowicz-WMF added a comment to T404294: Merge articletopic outlink model transformer and predictor pods .

In https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1187739, we've combined the transformer and predictor logic into a single pod. Now, the full processing is done by a single predictor pod.

Sep 22 2025, 8:50 AM · Goal, Machine-Learning-Team

Sep 19 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report

Sep 19 2025, 11:31 AM · OKR-Work, Goal, Machine-Learning-Team

Sep 18 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

could we agree on using the page_id parameter for the requests done in relation to Year in Review?

Understood, and yes, that sounds reasonable!

Sep 18 2025, 9:21 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

Yes, I would keep this task open until the documentation has been updated.

Sep 18 2025, 9:16 AM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

Why do we need Cache

Sep 18 2025, 9:02 AM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Sep 17 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title parameter? If so, that would be the best of both worlds, since I could envision scenarios where either variation would be useful.

Sep 17 2025, 7:00 AM · OKR-Work, Goal, Machine-Learning-Team

Sep 16 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

We have 1 technical question about the way Apps side will query our LiftWing model to retrieve the article topics. Currently, our LiftWing model expects users to pass page_title and lang parameters in POST requests to our model. ML team is also considering adding a page_id parameter that could be used instead of page_title.

Sep 16 2025, 10:10 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

I've tested the option to use page_id in the model and found out that it's straightforward to modify the current outlinks query by using pageids=... instead of titles=.... This means we can easily allow both page_id and page_title options in our model via different POST arguments and use either pageids or titles in our query.

Sep 16 2025, 9:52 AM · Lift-Wing, Machine-Learning-Team

Sep 15 2025

BWojtowicz-WMF created T404559: Revertrisk CPU utilization spike / unavailable replicas.
Sep 15 2025, 8:31 AM · Machine-Learning-Team

Sep 12 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Summary of progress:

Sep 12 2025, 11:57 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF moved T404294: Merge articletopic outlink model transformer and predictor pods from Unsorted to In Progress on the Machine-Learning-Team board.
Sep 12 2025, 8:30 AM · Goal, Machine-Learning-Team
BWojtowicz-WMF changed the status of T404294: Merge articletopic outlink model transformer and predictor pods , a subtask of T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review, from Open to In Progress.
Sep 12 2025, 8:20 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF changed the status of T404294: Merge articletopic outlink model transformer and predictor pods from Open to In Progress.
Sep 12 2025, 8:20 AM · Goal, Machine-Learning-Team

Sep 10 2025

BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Thank you for the discussion @Ottomata and @Eevans!

I think I'm leaning more into storing all predictions under the key of wiki + page_title + model_version and omitting the threshold alltogether from the Cache, leaving the prediction filtering to the application level. This indeed sounds to me like a way more flexible approach in the long term and also makes the data stored in Cache easier to understand then using binning strategy with different threshold or confidence_probability.

The table schema would look like this:

CREATE TABLE articletopic_cache (
  page_title    text,
  wiki          text,
  model_version text,
  predictions   map<text, float>,  -- Maps 64 topics to predicted probability score
  last_updated  timestamp,
  PRIMARY KEY((wiki, page_title), model_version)
)

My only worry for storing all 64 prediction topic+scores per row was the storage, but I might be unaware of some potential compression possibilities here.
My estimation was that a single entry storing 64 prediction topics+scores would be ~4.5 kilobytes based on the number total characters. Scaling this out to 65mil articles would sum up to ~270GB of storage with no compression.

We have the headroom now (meaning: this isn't something we'd have to fast-track a purchase for or anything), so I think this is more or less a question of value. Whatever the actual utilization ends up being once we have a fully seasoned dataset, what will matter most is that we can rationalize the cost. In other words: is the value provided equal to or greater than the cost of the storage? If it's worth doing at all, my sense is that it's probably worth the cost of a few hundred gigs of storage. And optimizing utilization later seems like it'd be fairly trivial (i.e. regenerating the cache, overwriting existing entries with a smaller map of predications).

Sep 10 2025, 9:49 AM · Machine-Learning-Team
BWojtowicz-WMF closed T383119: Update revertrisk to kserve 0.15.2, a subtask of T367048: Update kserve to 0.15.2, as Resolved.
Sep 10 2025, 7:18 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF closed T383119: Update revertrisk to kserve 0.15.2 as Resolved.
Sep 10 2025, 7:17 AM · Essential-Work, Patch-For-Review, Lift-Wing, Machine-Learning-Team

Sep 5 2025

BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Thank you for the discussion @Ottomata and @Eevans!

Sep 5 2025, 8:23 AM · Machine-Learning-Team

Sep 4 2025

BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Why is it more versatile?

@Eevans

Sep 4 2025, 9:34 AM · Machine-Learning-Team

Aug 29 2025

BWojtowicz-WMF created T403254: Article topic cache backfilling using article_topic hive table.
Aug 29 2025, 11:48 AM · Machine-Learning-Team
BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

I'm adding a high-level diagram of the Cache design including the backfilling process, interactions with LiftWing and its users.

Aug 29 2025, 9:29 AM · Machine-Learning-Team
BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Q: IIUC this is meant to be a 'query cache', rather than a more general purpose prediction cache, yes?

Aug 29 2025, 9:12 AM · Machine-Learning-Team

Aug 22 2025

BWojtowicz-WMF updated subscribers of T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

We've had a discussion meeting yesterday with @Eevans, @AikoChou and @klausman, thank you all for attending!
I'm sharing the notes below:

Aug 22 2025, 7:30 AM · Machine-Learning-Team

Aug 20 2025

BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Thank you for the explanations @Eevans! I see that I have some confusion around existing Cassandra deployments, I'm sorry for that, but I'm happy that we can clear them out :)

Aug 20 2025, 8:59 AM · Machine-Learning-Team

Aug 19 2025

BWojtowicz-WMF updated subscribers of T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Thank you for the quick answers @Eevans! I'll schedule a call for us, where I will share the larger context, but I also think it'll be useful to continue the discussion in this ticket.

Aug 19 2025, 10:24 AM · Machine-Learning-Team

Aug 18 2025

BWojtowicz-WMF updated subscribers of T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Hello @Eevans @Marostegui! In relation to work described in this ticket, we'd like to use the existing Cassandra deployment on the staging ML cluster to validate our design for the caching mechanism. In order to do that, we would need to create the needed keyspace/table and users in the Cassandra deployment. Once we'd run tests and validate the idea in staging environment, we would like to create a similar deployment in the production cluster.

Aug 18 2025, 10:14 AM · Machine-Learning-Team

Aug 15 2025

BWojtowicz-WMF updated the task description for T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.
Aug 15 2025, 5:56 AM · Machine-Learning-Team

Aug 14 2025

BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

I'm sharing my notes on the Cache design. Those are not final yet and feedback is hugely welcome on any of the points below!

Aug 14 2025, 2:32 PM · Machine-Learning-Team
BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Based on the estimates given in message above and discussion during ML Technical Team Meeting, we've decided to go with adding a cache service to LiftWing and populating it with all ~65 million article topics. Later, each article topic request will be hitting this cache, allowing us to achieve way higher throughput, although the total processing time for Year in Review will still be in the span of weeks.

Aug 14 2025, 7:00 AM · Machine-Learning-Team
BWojtowicz-WMF updated the task description for T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.
Aug 14 2025, 6:29 AM · Machine-Learning-Team

Aug 13 2025

BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

High level questions to answer:

  1. What is the setup that could solve this problem?

a) Add LiftWing caching with all articles - There are many unknowns, but it can potentially solve the problem.
b) Use Data Gateway - Potentially solves part of backfilling, but not standard mode of operation, we can not integrate it into LiftWing.

  1. How do we populate the data at scale?
  2. How do we serve the data at scale?
Aug 13 2025, 2:01 PM · Machine-Learning-Team
BWojtowicz-WMF added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

Caching POC

Aug 13 2025, 7:20 AM · Machine-Learning-Team
BWojtowicz-WMF updated the task description for T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.
Aug 13 2025, 6:48 AM · Machine-Learning-Team
BWojtowicz-WMF created T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.
Aug 13 2025, 6:26 AM · Machine-Learning-Team

Aug 7 2025

BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

@elukey I think it sounds like a good compromise. I'll go this way, thank you!

Aug 7 2025, 1:12 PM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

@elukey I think the idea was that we'd provide a Makefile such that everybody could run just make model-upload ..., which would include installing the venv and running the script. However, this makefile would also need to be in PATH so that everyone can call this from anywhere.

Aug 7 2025, 12:28 PM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T400352: Upgrade readability model server from debian bullseye to bookworm.

@OKarakaya-WMF
I agree, let's put this in blocked until we update the catboost version in the upstream repository.

Aug 7 2025, 12:09 PM · Patch-For-Review, Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T356256: Epic: Implement prototype inference service that uses Cassandra for request caching.

I've started working towards the goal of making article topic available at scale. One of the tasks in this goal is introducing caching mechanism for article topic model.

Aug 7 2025, 12:05 PM · Patch-For-Review, Epic, Machine-Learning-Team

Aug 1 2025

BWojtowicz-WMF closed T400351: Upgrade article-descriptions model servers from debian bullseye to bookworm, a subtask of T400144: Upgrade remaining model servers from debian bullseye to bookworm, as Resolved.
Aug 1 2025, 8:53 AM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF closed T400351: Upgrade article-descriptions model servers from debian bullseye to bookworm as Resolved.
Aug 1 2025, 8:53 AM · Essential-Work, Machine-Learning-Team

Jul 31 2025

BWojtowicz-WMF added a comment to T400606: Investigate `edit-check` returning empty responses.

We realized that the original issue happened by querying model via API Gateway at https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict, but previous experiments were being performed by querying the service from our VPC via https://inference.svc.eqiad.wmnet:30443/v1/models/edit-check:predict.

Jul 31 2025, 12:28 PM · Editing-team (Tracking), EditCheck, Machine-Learning-Team

Jul 30 2025

BWojtowicz-WMF added a comment to T400606: Investigate `edit-check` returning empty responses.

Small update: The connection reset errors shown above are "successful" requests, but they are not returning 200 status codes - they return 502 codes as expected. This is due our service reaching max capacity during load tests of around 40 requests per second. Thus, it's most likely unrelated to the reported bug.

Jul 30 2025, 11:29 AM · Editing-team (Tracking), EditCheck, Machine-Learning-Team

Jul 29 2025

BWojtowicz-WMF added a comment to T400606: Investigate `edit-check` returning empty responses.

I've updated the staging deployment of edit-check to be able to autoscale up to 3 replicas. I've re-ran the load-testing script with the statistics shown below:

Jul 29 2025, 8:56 AM · Editing-team (Tracking), EditCheck, Machine-Learning-Team
BWojtowicz-WMF added a comment to T400606: Investigate `edit-check` returning empty responses.

I've ran a load-test on staging cluster with 10000 requests, each of them returned a proper non-empty response. The statistics are shown below.

Jul 29 2025, 7:15 AM · Editing-team (Tracking), EditCheck, Machine-Learning-Team
BWojtowicz-WMF added a comment to T400606: Investigate `edit-check` returning empty responses.

Unfortunately, it seems that we won't be able to retrieve the exact timestamps nor the number of failed requests as they were not logged in our general client-error-logging.

Jul 29 2025, 6:34 AM · Editing-team (Tracking), EditCheck, Machine-Learning-Team

Jul 28 2025

BWojtowicz-WMF created T400606: Investigate `edit-check` returning empty responses.
Jul 28 2025, 11:46 AM · Editing-team (Tracking), EditCheck, Machine-Learning-Team
BWojtowicz-WMF created T400602: Investigate reference-need persistently unavailable replicas alert.
Jul 28 2025, 10:22 AM · Essential-Work, Machine-Learning-Team

Jul 24 2025

BWojtowicz-WMF added a comment to T399437: revertrisk model servers should return a 400 response for non canonical language names.

@kevinbazira As you said, it looks like we're failing due to missing NumPy dependency. It's indeed not defined in our inference-services dependencies in src/models/revert_risk_model/model_server/multilingual/requirements.txt and also it's not defined in the knowledge-integrity dependencies.

Jul 24 2025, 1:45 PM · Lift-Wing, Machine-Learning-Team

Jul 22 2025

BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

We've discussed the points above in our ML Team Meeting, which resulted in a following plan:

Jul 22 2025, 3:02 PM · Essential-Work, Machine-Learning-Team

Jul 21 2025

BWojtowicz-WMF updated subscribers of T380722: Update kserve to v0.15.2* on ML clusters.

I've managed to spin up a local cluster with minikube, following our documentation. The documentation is a little outdated, thus I'll be updating it this week with the discovered improvements.
On my local cluster, I've installed new kserve version directly from kserve github charts and I could successfully deploy our services, which means there should be no dependency conflicts between new kserve version and our current setup.

Jul 21 2025, 11:12 AM · Essential-Work, Machine-Learning-Team, Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

As of this time, I've reimplemented the model-upload script in Python, tested its functionality and merged it into puppet repository. However, it's not currently functional as I made one major mistake - I've implemented and tested it with boto3==1.26.27 version, which is distributed via .deb package on Debian Bookworm, whereas our stat machines ar based on Debian Bullseye, which distributes older boto3==1.13.14 version.

Jul 21 2025, 9:06 AM · Essential-Work, Machine-Learning-Team

Jul 3 2025

BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

I've made the Python script for model-upload work with just urllib3 and boto3 as external dependencies, both of which are available as debian packages python3-boto3 and python3-urllib3. I also successfully tested the use-cases we want to support.

Jul 3 2025, 8:27 AM · Essential-Work, Machine-Learning-Team

Jun 30 2025

BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

Thank you both for the answers, this helps a lot! @elukey @gkyziridis

Jun 30 2025, 9:01 AM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

I've started work on this ticket and I've reimplemented the bash script in Python, where I take advantage of boto3 to handle connection to Swift/s3.
However, I'm facing the following questions at the moment:

Jun 30 2025, 7:44 AM · Essential-Work, Machine-Learning-Team

Jun 24 2025

BWojtowicz-WMF added a comment to T393865: Simplify pre-commit hooks within inference-services repository..

All of the work that has been planned for this task has been completed and merged 🎉

Jun 24 2025, 1:27 PM · Machine-Learning-Team

Jun 20 2025

BWojtowicz-WMF added a comment to T383119: Update revertrisk to kserve 0.15.2.

I've created an MR in knowledge_integrity repository, which is solving dependency conflicts with newest version of kserve: https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/54. I've tested the changes by updating the knowledge_integrity dependency in revert_risk to target this branch and updating kserve to 0.15.2. With those changes I can build the model locally and querying it works.

Jun 20 2025, 1:56 PM · Essential-Work, Patch-For-Review, Lift-Wing, Machine-Learning-Team

Jun 11 2025

BWojtowicz-WMF updated the task description for T393865: Simplify pre-commit hooks within inference-services repository..
Jun 11 2025, 8:44 AM · Machine-Learning-Team

May 21 2025

BWojtowicz-WMF added a comment to T393865: Simplify pre-commit hooks within inference-services repository..

Following up on the message above:

May 21 2025, 2:20 PM · Machine-Learning-Team

May 20 2025

BWojtowicz-WMF added a comment to T393865: Simplify pre-commit hooks within inference-services repository..

So far we've merged 2 patches:

  1. Removing isort, black and pyupgrade in favour of ruff.
  2. Enabling import sorting in the repository.
May 20 2025, 3:07 PM · Machine-Learning-Team

May 19 2025

BWojtowicz-WMF updated subscribers of T393865: Simplify pre-commit hooks within inference-services repository..

Second patch enabling import sorting within inference-services repo is ready for review.

May 19 2025, 9:38 AM · Machine-Learning-Team

May 16 2025

BWojtowicz-WMF added a comment to T393595: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz .

Thank you @BCornwall for the help!
I have all needed SSH access now, however I'm not sure about Kerberos - I did not receive any email with temporary password yet. Is there anything else I need to request besides the Kerberos identity?

May 16 2025, 9:41 AM · LDAP-Access-Requests, Machine-Learning-Team, SRE, SRE-Access-Requests

May 15 2025

BWojtowicz-WMF added a comment to T393865: Simplify pre-commit hooks within inference-services repository..

This task focuses on simplifying our pre-commit setup within inference-services repo. The plan is to:

  1. Remove isort, black and pyupgrade in favor of using ruff for all formatting, linting, upgrading syntax and import sorting. Reproduce current behavior as closely as possible.
  2. Update ruff to newer version.
  3. Remove unused dependencies.
  4. Enable import sorting in the repository.
  5. Evaluate current rules and change them to desirable.
May 15 2025, 1:40 PM · Machine-Learning-Team

May 12 2025

BWojtowicz-WMF created T393865: Simplify pre-commit hooks within inference-services repository..
May 12 2025, 7:47 AM · Machine-Learning-Team

May 9 2025

BWojtowicz-WMF added a comment to T393595: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz .

Hello @Eevans, I've temporarily added my public SSH key for prod on my user page User:BWojtowicz-WMF.

May 9 2025, 6:48 AM · LDAP-Access-Requests, Machine-Learning-Team, SRE, SRE-Access-Requests