Page MenuHomePhabricator

BWojtowicz-WMF (bwojtowicz)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
May 6 2025, 11:26 AM (49 w, 1 d)
Availability
Available
LDAP User
Bartosz Wójtowicz
MediaWiki User
BWojtowicz-WMF [ Global Accounts ]

Recent Activity

Today

BWojtowicz-WMF added a comment to T421903: Investigate enabling gRPC in LiftWing model servers.

To speak on enabling gRPC for ISVC, our plan would be to use the Kserve's V2 Inference Protocol, which supports both gRPC and HTTP/REST interfaces. Currently, all our services were built with V1 protocol in mind, which only supports HTTP/REST interface.

Wed, Apr 15, 7:33 AM · Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Fri, Mar 27

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Fri, Mar 27, 1:22 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

What are you doing with the threshold argument? Are you late filtering the response from the inference service, or invoking the service with a threshold as the constraint? If the latter, is there any reason you couldn't late filter a cached response (i.e. is the cached response somehow constrained to a limited set of thresholds)?

Fri, Mar 27, 1:11 PM · User-Eevans, Data-Persistence

Thu, Mar 26

BWojtowicz-WMF added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

Okay, I've done a few not too technical sketches trying to visualize the issue we're facing.

Thu, Mar 26, 2:14 PM · User-Eevans, Data-Persistence

Wed, Mar 25

BWojtowicz-WMF added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

@Joe Hoarde's HTTP API only exposes wiki_id/page_id/revision_id parameters, which would cover the use-case for the Mobile Apps team. However, our service also exposes additional parameters (e.g. page_title, threshold) that some users rely on. On top of that, exposing HTTP API is extremely useful for us for development/debugging.
I think those would not be as problematic if we were building a new service with hoarde in mind from the beginning, however we're trying to integrate caching into existing services.

Wed, Mar 25, 3:02 PM · User-Eevans, Data-Persistence
BWojtowicz-WMF added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

I want to share a small update from our side on where we are.

Wed, Mar 25, 12:38 PM · User-Eevans, Data-Persistence

Tue, Mar 24

BWojtowicz-WMF added a comment to T420931: Load test current state of the Article Topic service.

I see the regime with >10s p99 latencies, however it happened during the night and not during running those tests. It seems to me that the Grafana numbers aligns well with the reported latencies above see:

  1. page_id + lang requests: https://grafana.wikimedia.org/goto/cfgzhd4aveg3kf?orgId=1
  2. page_title + lang requests: https://grafana.wikimedia.org/goto/ffgzhfegn63uoc?orgId=1
  3. page_id + lang + revision_id requests: https://grafana.wikimedia.org/goto/bfgzhglvqmpdse?orgId=1
Tue, Mar 24, 1:06 PM · OKR-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T416475: Unify and improve load testing strategy for inference services.

When investigating T420931, I found that my custom async load test script achieves >300 RPS against the same service with 5 replicas, whereas the locust test against 1 replica reports only ~0.67 RPS. The discrepancy comes down to the Locust configuration:

Tue, Mar 24, 9:36 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work
BWojtowicz-WMF added a comment to T420931: Load test current state of the Article Topic service.

I'm sharing load test numbers tested against production deployment on eqiad using internal endpoint. I've made sure the responses return valid predictions and I ran the load test after a few hours of cooldown to make results are not skewed by caching on the MWAPI side.

Tue, Mar 24, 6:54 AM · OKR-Work, Machine-Learning-Team

Mon, Mar 23

BWojtowicz-WMF added a comment to T420931: Load test current state of the Article Topic service.

@Isaac The details of the cache and how exactly will it be implemented to Article Topics is still not fully decided. Current approaches we explored would work with page_id, whereas page_title requests would not go through cache. This ticket does not take cache into consideration, but we're verifying how fast can we get without cache. As a bonus, I can also check the page_title variant in this ticket so we'll have more context on it :)

Mon, Mar 23, 2:36 PM · OKR-Work, Machine-Learning-Team
BWojtowicz-WMF created T420931: Load test current state of the Article Topic service.
Mon, Mar 23, 2:06 PM · OKR-Work, Machine-Learning-Team

Tue, Mar 17

BWojtowicz-WMF added a comment to T418832: Deploy CoPE-A on LiftWing.

After lowering the maximum input token length to 4096, we seem to be able to process all incoming requests. I will figure out optimizations we could make to allow bigger input lengths, but the current 4096 token limit should already be good enough for testing our policies.

Tue, Mar 17, 9:06 AM · Patch-For-Review, Product Safety and Integrity, Machine-Learning-Team

Mar 16 2026

BWojtowicz-WMF added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

If I understand you correctly (and if I don't, please don't hesitate to correct me), you're arguing that we might have uses that can't be satisfied, which would force a product team to build an HTTP API to serve them, one that would otherwise have also worked as the lambda (while providing an example of a hypothetical use-case). Or put another way, that (a, above) we might have past use cases with extant HTTP APIs, and (b) we might have (unavoidable) future ones too.

Mar 16 2026, 3:11 PM · User-Eevans, Data-Persistence
BWojtowicz-WMF added a comment to T418832: Deploy CoPE-A on LiftWing.

After deployment, CoPE-A-9B model server was successfully processing small requests of less than 500 input tokens.

Mar 16 2026, 11:49 AM · Patch-For-Review, Product Safety and Integrity, Machine-Learning-Team
BWojtowicz-WMF added a comment to T418832: Deploy CoPE-A on LiftWing.

The CoPE-A-9B model is now deployed on LiftWing.

Mar 16 2026, 10:54 AM · Patch-For-Review, Product Safety and Integrity, Machine-Learning-Team

Mar 13 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Mar 13 2026, 2:37 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

That said, there's a practical concern on the ML Platform side and other non-ML services willing to integrate. The vast majority of our internal services communicate over HTTP, and our infra (mesh, ingress, routing) is built around that. To integrate with gRPC-only Hoarde, each service team would need to either deploy an adapter/proxy alongside their service, or deploy a gRPC-capable replica. I've prototyped the adapter approach and discussed the replica route with Luca - both are technically feasible. But as Hoarde would onboard more use-cases, I can imagine this becoming a pattern where every integrating service carries extra infrastructure just to bridge the protocol gap. So we’re shifting complexity and maintenance from Hoarde to its clients.

To me the above is the main concern, since almost every service at the foundation runs HTTP..

Is this because you a) envision the service being used to do caching for already implemented systems, b) as-yet-implemented systems that will invariably need an HTTP service anyway (and if so, why), or c) because using something other than HTTP generally imposes a burden that exceeds the benefits of using grpc (and if so, how)?

Mar 13 2026, 9:03 AM · User-Eevans, Data-Persistence

Mar 12 2026

BWojtowicz-WMF added a comment to T418832: Deploy CoPE-A on LiftWing.

Small update on the progress.

Mar 12 2026, 8:50 AM · Patch-For-Review, Product Safety and Integrity, Machine-Learning-Team

Mar 11 2026

BWojtowicz-WMF moved T400602: Investigate reference-need persistently unavailable replicas alert from Ready To Go to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Mar 11 2026, 12:51 PM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF closed T400602: Investigate reference-need persistently unavailable replicas alert as Resolved.
Mar 11 2026, 12:50 PM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T400602: Investigate reference-need persistently unavailable replicas alert.

Resolving this as this was a single time incident and the underlying concern about reference-need's high resource requests (22 CPUs, 6Gi memory) and its impact on cluster scheduling is now tracked as part of T414431, where we are optimizing resource utilization across all ISVCs.

Mar 11 2026, 12:50 PM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF moved T417860: Explore gpt-oss-safeguard-20b from Unsorted to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Mar 11 2026, 10:34 AM · Machine-Learning-Team
BWojtowicz-WMF closed T417860: Explore gpt-oss-safeguard-20b, a subtask of T418267: Q2 FY2025-26 Goal: Host a content policy evaluation model on LiftWing, as Resolved.
Mar 11 2026, 10:34 AM · Goal, Machine-Learning-Team
BWojtowicz-WMF closed T417860: Explore gpt-oss-safeguard-20b as Resolved.
Mar 11 2026, 10:34 AM · Machine-Learning-Team
BWojtowicz-WMF added a comment to T417860: Explore gpt-oss-safeguard-20b.

Closing this Task as exploration phase is complete. The key outcomes from this task:

Mar 11 2026, 10:34 AM · Machine-Learning-Team

Mar 9 2026

BWojtowicz-WMF updated subscribers of T414112: Deploy instance of hoarde as linked-artifacts(?) in k8s.

Hii all! Wanted to bring up a discussion that came up while working on the Article Topics integration with Hoarde. It's about the lambda interface protocol and whether gRPC should be the only supported method or whether we should consider HTTP. I've been discussing this with both Eric and Luca, and I think there are valid points on both sides so I wanted to open it up here.

Mar 9 2026, 9:40 AM · ServiceOps-Services-Oids, ServiceOps new, User-Eevans, Patch-For-Review, Data-Persistence

Mar 6 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Mar 6 2026, 11:33 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T418832: Deploy CoPE-A on LiftWing.

Update on quantization experiments

Mar 6 2026, 9:23 AM · Patch-For-Review, Product Safety and Integrity, Machine-Learning-Team

Mar 3 2026

BWojtowicz-WMF added a comment to T418832: Deploy CoPE-A on LiftWing.

I've managed to spin up the CoPE-A model on ml-lab1002 machine on single MI210 GPU and tested it with sample request.

Mar 3 2026, 12:04 PM · Patch-For-Review, Product Safety and Integrity, Machine-Learning-Team

Feb 27 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Feb 27 2026, 1:50 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF updated the task description for T418350: Deploy gpt-oss-safeguard-20b on LiftWing.
Feb 27 2026, 12:39 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), OKR-Work
BWojtowicz-WMF updated the task description for T418350: Deploy gpt-oss-safeguard-20b on LiftWing.
Feb 27 2026, 12:38 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), OKR-Work

Feb 26 2026

BWojtowicz-WMF created T418493: Integrate Article Topic model with the new caching service.
Feb 26 2026, 3:02 PM · Patch-For-Review, OKR-Work, Machine-Learning-Team

Feb 25 2026

BWojtowicz-WMF moved T408068: Revertrisk multilingual fails locally when ran with docker compose from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Feb 25 2026, 11:51 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF created T418351: Optimize gpt-oss-safeguard-20b LiftWing deployment.
Feb 25 2026, 9:35 AM · Machine-Learning-Team (Q4 FY2025-26)
BWojtowicz-WMF created T418350: Deploy gpt-oss-safeguard-20b on LiftWing.
Feb 25 2026, 9:31 AM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), OKR-Work
BWojtowicz-WMF added a subtask for T418267: Q2 FY2025-26 Goal: Host a content policy evaluation model on LiftWing: T417860: Explore gpt-oss-safeguard-20b.
Feb 25 2026, 9:21 AM · Goal, Machine-Learning-Team
BWojtowicz-WMF added a parent task for T417860: Explore gpt-oss-safeguard-20b: T418267: Q2 FY2025-26 Goal: Host a content policy evaluation model on LiftWing.
Feb 25 2026, 9:21 AM · Machine-Learning-Team

Feb 19 2026

BWojtowicz-WMF closed T408068: Revertrisk multilingual fails locally when ran with docker compose as Resolved.
Feb 19 2026, 5:04 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T417860: Explore gpt-oss-safeguard-20b.

What have I done so far

Feb 19 2026, 8:42 AM · Machine-Learning-Team
BWojtowicz-WMF created T417860: Explore gpt-oss-safeguard-20b.
Feb 19 2026, 8:25 AM · Machine-Learning-Team

Feb 13 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Feb 13 2026, 8:54 AM · OKR-Work, Goal, Machine-Learning-Team

Feb 4 2026

BWojtowicz-WMF created T416475: Unify and improve load testing strategy for inference services.
Feb 4 2026, 1:38 PM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work

Jan 30 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Jan 30 2026, 10:11 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF moved T414431: Optimize resource utilization for InferenceServices on LiftWing cluster from Unsorted to Ready To Go on the Machine-Learning-Team board.
Jan 30 2026, 9:17 AM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF moved T414573: Extend Article Topics model to support `revision_id` parameter. from Unsorted to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Jan 30 2026, 9:11 AM · OKR-Work, Machine-Learning-Team
BWojtowicz-WMF moved T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. from Ready To Go to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Jan 30 2026, 9:11 AM · Lift-Wing, Machine-Learning-Team
BWojtowicz-WMF closed T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. as Resolved.
Jan 30 2026, 9:10 AM · Lift-Wing, Machine-Learning-Team
BWojtowicz-WMF closed T414573: Extend Article Topics model to support `revision_id` parameter., a subtask of T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review, as Resolved.
Jan 30 2026, 9:09 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF closed T414573: Extend Article Topics model to support `revision_id` parameter. as Resolved.
Jan 30 2026, 9:09 AM · OKR-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T414573: Extend Article Topics model to support `revision_id` parameter..

The new service supporting revision_id as an optional input parameter is live.
As expected, the queries using the revision_id parameter are ~4x slower due to the separate queries we need to make to catch QIDs linked to a specific revision of the page.
The API documentation is also updated now: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction.

Jan 30 2026, 9:09 AM · OKR-Work, Machine-Learning-Team

Jan 23 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Jan 23 2026, 12:42 PM · OKR-Work, Goal, Machine-Learning-Team

Jan 21 2026

BWojtowicz-WMF updated the task description for T414431: Optimize resource utilization for InferenceServices on LiftWing cluster.
Jan 21 2026, 1:05 PM · Essential-Work, Machine-Learning-Team

Jan 19 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Small Weekly Update

Jan 19 2026, 7:22 AM · OKR-Work, Goal, Machine-Learning-Team

Jan 14 2026

BWojtowicz-WMF created T414573: Extend Article Topics model to support `revision_id` parameter..
Jan 14 2026, 12:39 PM · OKR-Work, Machine-Learning-Team

Jan 13 2026

BWojtowicz-WMF added a comment to T414431: Optimize resource utilization for InferenceServices on LiftWing cluster.

I went through utilization graphs of our InferenceServices and it seems there is a lot of CPU savings we could make, whereas Memory is usually set quite reasonably with no major overcommitments.

Jan 13 2026, 1:14 PM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF created T414431: Optimize resource utilization for InferenceServices on LiftWing cluster.
Jan 13 2026, 11:21 AM · Essential-Work, Machine-Learning-Team

Jan 12 2026

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Update

Jan 12 2026, 8:02 AM · OKR-Work, Goal, Machine-Learning-Team

Dec 10 2025

BWojtowicz-WMF added a comment to T411758: Explore optimizations/scaling for Revise Tone Task Generator in LiftWing.

I'm coming with a small update from early experimentation results.

Dec 10 2025, 1:28 PM · Machine-Learning-Team

Dec 4 2025

BWojtowicz-WMF moved T408538: Create a Revise Tone Task Generator in LiftWing from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Dec 4 2025, 10:10 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF created T411758: Explore optimizations/scaling for Revise Tone Task Generator in LiftWing.
Dec 4 2025, 10:09 AM · Machine-Learning-Team

Nov 28 2025

BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

After some development time, the Revise Tone Task Generator service is happily running on LiftWing and is processing all edits on enwiki, ptwiki, frwiki and arwiki matching our topic criteria!
Looking at Istio Grafana Dashboard, we can see we're processing 1-2 requests per second with median response time of ~200ms and p95 response of 1s. This includes us ingesting data to Cassandra and sending the weighted tag update event.

Nov 28 2025, 1:43 PM · Patch-For-Review, Machine-Learning-Team

Nov 26 2025

BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

@elukey I think you might be right that it was the specificity of the Python code I've been using.
When sending the request in Python (via the requests library), I've been setting the header to 'Content-Type': 'application/json'. This _probably_ means, it did not infer any other headers, but used only the ones I defined. If I won't define any headers, it will probably infer both Content-Type and Host correctly. Will check this! :D

Nov 26 2025, 2:59 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

@elukey They domains below are resolvable to the same IP, but when sending requests they all produced the same 502 error:

Nov 26 2025, 2:35 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Thank you for all of your help investigating and finding the solution to enable the pod-to-pod communication!
I'm very happy to confirm that the solution Luca suggested works and is already integrated in our production service. We use a combination of http://outlink-topic-model.articletopic-outlink/v1/models/outlink-topic-model:predict as URL and outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local as Host header to communicate with the service.

Nov 26 2025, 11:23 AM · Patch-For-Review, Machine-Learning-Team

Nov 21 2025

BWojtowicz-WMF added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Notes on connection issues discovered during development.

Nov 21 2025, 7:20 AM · Patch-For-Review, Machine-Learning-Team

Nov 14 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Update / Task on pause

Nov 14 2025, 11:13 AM · OKR-Work, Goal, Machine-Learning-Team

Nov 12 2025

BWojtowicz-WMF added a comment to T409850: Cassandra role & grants for Lift Wing isvc integration.

When the service starts, Lift Wing will validate whether the target table exists, so we'll need SELECT as well. @BWojtowicz-WMF, is it correct?

Nov 12 2025, 1:32 PM · Data-Persistence, Machine-Learning-Team

Nov 6 2025

BWojtowicz-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

for local workflows it might be good to have it in a docker compose

Nov 6 2025, 1:08 PM · Machine-Learning-Team

Nov 4 2025

BWojtowicz-WMF claimed T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..
Nov 4 2025, 10:45 AM · Lift-Wing, Machine-Learning-Team
BWojtowicz-WMF moved T401778: Evaluate adding caching mechanism for article topic model to make data available at scale from In Progress to Blocked on the Machine-Learning-Team board.
Nov 4 2025, 10:44 AM · Machine-Learning-Team
BWojtowicz-WMF claimed T408538: Create a Revise Tone Task Generator in LiftWing.
Nov 4 2025, 10:43 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF moved T404294: Merge articletopic outlink model transformer and predictor pods from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Nov 4 2025, 10:42 AM · Goal, Machine-Learning-Team

Oct 24 2025

BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

Thank you for helping and sharing all the logs!

Oct 24 2025, 9:25 AM · Patch-For-Review, Machine-Learning-Team

Oct 23 2025

BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

@jsn.sherman
Hmm this is very interesting, I could not reproduce it on my Mac machine yet. Can you share the exact commands that you are running?

Oct 23 2025, 2:24 PM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

I think I found the culprit - the issue stems from our base docker image, which contains the old version of typing_extensions preinstalled in /opt/lib/python/site-packages/typing_extensions.py. However, just adding the pin to typing_extensions==4.15.0 in requirements.txt does not solve the issue as I shared in https://phabricator.wikimedia.org/T408068#11301601.

Oct 23 2025, 8:40 AM · Patch-For-Review, Machine-Learning-Team
BWojtowicz-WMF added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

Looking into it! I can reproduce this issue on my machine. I’ve also confirmed that we luckily don’t encounter this issue on LiftWing, which is interesting.

Oct 23 2025, 8:26 AM · Patch-For-Review, Machine-Learning-Team

Oct 21 2025

BWojtowicz-WMF created T407843: Introduce re-try mechanisms for MW API requests in LiftWing models.
Oct 21 2025, 9:26 AM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work
BWojtowicz-WMF updated subscribers of T407784: LiftWing fiwiki-damaging model returning 500.

I've looked through our Logstash hunting for 500 errors for fiwiki-damaging in the last month. Indeed in the last month, we had 13 days where those errors occured, ranging from 4 to 72 occurrences on those days. All of those are caused by LiftWing failing to fetch data from MW API due to 503 Service Unavailable error:

Oct 21 2025, 9:19 AM · Lift-Wing

Oct 14 2025

BWojtowicz-WMF closed T394301: Reimplement the model-upload script to take into consideration new use cases as Resolved.
Oct 14 2025, 1:40 PM · Essential-Work, Machine-Learning-Team

Oct 13 2025

BWojtowicz-WMF closed T407102: Update unit test assertion in article topic model as Resolved.
Oct 13 2025, 10:32 AM · Machine-Learning-Team
BWojtowicz-WMF created T407102: Update unit test assertion in article topic model.
Oct 13 2025, 9:50 AM · Machine-Learning-Team

Oct 10 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report

Oct 10 2025, 12:51 PM · OKR-Work, Goal, Machine-Learning-Team

Oct 2 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report
Sharing a day earlier as I'm OOO on 3rd of October.

Oct 2 2025, 1:49 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

@Eevans
Thank you very much for elaborating on the history and differences between those two. I was curious what kind of optimizations could be done there like the RAID10 storage and higher density, it's very interesting!
I agree that even if there are no major differences, we should still deploy our Cache in the RESTBase cluster, which is meant for this type of processing.

Oct 2 2025, 7:44 AM · User-Eevans, Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Oct 1 2025

BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.

Oct 1 2025, 2:01 PM · User-Eevans, Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence
BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better fit for the RESTBase cluster (RESTBase, like AQS, is a misnomer here, both are multi-tenant clusters). The AQS > cluster is (or at least has been) geared more toward materialized representations, analytics, etc. The things persisting data there mostly follow an ETL pattern (even though we've talked about using event > streams, and a more Lamba architecture). Most of what is there is time-series, or versioned, where data is written but not updated. The RESTBase cluster has primarily been for caching (and a bit of application > state). Primarily caching alternate representations of content, but caching nonetheless. Those caches have been maintained by changeprop jobs, jobs that hit a service with a no-cache header, which then writes > though to Cassandra... which sounds familiar?

Oct 1 2025, 7:27 AM · User-Eevans, Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Sep 30 2025

BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

@Ottomata @isarantopoulos
Thank you for the suggestion and discussion about using the wiki_id. The article model does not currently work for other Wikis, but I very much like the idea if standardizing our DB schemas across different models to use page_id and wiki_id for indices.
To not alter the current API parameters to the model, which expects lang parameter, I've created a static lang->wiki_id mapping for each Wikipedia language, which will be used internally by our application code to translate between lang and wiki_id when interacting with cache.

Sep 30 2025, 8:37 AM · User-Eevans, Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Sep 26 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report

Sep 26 2025, 12:25 PM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

@isarantopoulos I agree, I initially got scared when I saw the new response times on my local machine, but underestimated how faster the requests are inside our cluster :D

Sep 26 2025, 9:10 AM · Lift-Wing, Machine-Learning-Team

Sep 25 2025

BWojtowicz-WMF added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

I've done a small analysis on performance implications of introducing the page_id parameter.
I've ran the experiments on the statbox machines to closer reflect the real time of communication with Wikipedia servers, however it might still not perfectly resemble the query performance when deployed on LiftWing.

Sep 25 2025, 1:58 PM · Lift-Wing, Machine-Learning-Team

Sep 24 2025

BWojtowicz-WMF renamed T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter. from [articletopic-outlink] fetch data from mwapi using revid instead of article title to [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..
Sep 24 2025, 11:59 AM · Lift-Wing, Machine-Learning-Team

Sep 23 2025

BWojtowicz-WMF added a comment to T404294: Merge articletopic outlink model transformer and predictor pods .

The merged architecture has been deployed on both staging and production clusters. It's also been tested by sending requests manually and verifying the responses are correct.

Sep 23 2025, 8:09 AM · Goal, Machine-Learning-Team

Sep 22 2025

BWojtowicz-WMF added a comment to T404294: Merge articletopic outlink model transformer and predictor pods .

In https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1187739, we've combined the transformer and predictor logic into a single pod. Now, the full processing is done by a single predictor pod.

Sep 22 2025, 8:50 AM · Goal, Machine-Learning-Team

Sep 19 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

Weekly Report

Sep 19 2025, 11:31 AM · OKR-Work, Goal, Machine-Learning-Team

Sep 18 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

could we agree on using the page_id parameter for the requests done in relation to Year in Review?

Understood, and yes, that sounds reasonable!

Sep 18 2025, 9:21 AM · OKR-Work, Goal, Machine-Learning-Team
BWojtowicz-WMF added a comment to T394301: Reimplement the model-upload script to take into consideration new use cases.

Yes, I would keep this task open until the documentation has been updated.

Sep 18 2025, 9:16 AM · Essential-Work, Machine-Learning-Team
BWojtowicz-WMF added a comment to T402984: Data Persistence Design Review: Article topic model caching.

Why do we need Cache

Sep 18 2025, 9:02 AM · User-Eevans, Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence

Sep 17 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title parameter? If so, that would be the best of both worlds, since I could envision scenarios where either variation would be useful.

Sep 17 2025, 7:00 AM · OKR-Work, Goal, Machine-Learning-Team

Sep 16 2025

BWojtowicz-WMF added a comment to T392833: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review.

We have 1 technical question about the way Apps side will query our LiftWing model to retrieve the article topics. Currently, our LiftWing model expects users to pass page_title and lang parameters in POST requests to our model. ML team is also considering adding a page_id parameter that could be used instead of page_title.

Sep 16 2025, 10:10 AM · OKR-Work, Goal, Machine-Learning-Team