User Details
- User Since
- May 6 2025, 11:26 AM (49 w, 1 d)
- Availability
- Available
- LDAP User
- Bartosz Wójtowicz
- MediaWiki User
- BWojtowicz-WMF [ Global Accounts ]
Today
To speak on enabling gRPC for ISVC, our plan would be to use the Kserve's V2 Inference Protocol, which supports both gRPC and HTTP/REST interfaces. Currently, all our services were built with V1 protocol in mind, which only supports HTTP/REST interface.
Fri, Mar 27
Weekly Update
What are you doing with the threshold argument? Are you late filtering the response from the inference service, or invoking the service with a threshold as the constraint? If the latter, is there any reason you couldn't late filter a cached response (i.e. is the cached response somehow constrained to a limited set of thresholds)?
Thu, Mar 26
Okay, I've done a few not too technical sketches trying to visualize the issue we're facing.
Wed, Mar 25
@Joe Hoarde's HTTP API only exposes wiki_id/page_id/revision_id parameters, which would cover the use-case for the Mobile Apps team. However, our service also exposes additional parameters (e.g. page_title, threshold) that some users rely on. On top of that, exposing HTTP API is extremely useful for us for development/debugging.
I think those would not be as problematic if we were building a new service with hoarde in mind from the beginning, however we're trying to integrate caching into existing services.
I want to share a small update from our side on where we are.
Tue, Mar 24
I see the regime with >10s p99 latencies, however it happened during the night and not during running those tests. It seems to me that the Grafana numbers aligns well with the reported latencies above see:
- page_id + lang requests: https://grafana.wikimedia.org/goto/cfgzhd4aveg3kf?orgId=1
- page_title + lang requests: https://grafana.wikimedia.org/goto/ffgzhfegn63uoc?orgId=1
- page_id + lang + revision_id requests: https://grafana.wikimedia.org/goto/bfgzhglvqmpdse?orgId=1
When investigating T420931, I found that my custom async load test script achieves >300 RPS against the same service with 5 replicas, whereas the locust test against 1 replica reports only ~0.67 RPS. The discrepancy comes down to the Locust configuration:
I'm sharing load test numbers tested against production deployment on eqiad using internal endpoint. I've made sure the responses return valid predictions and I ran the load test after a few hours of cooldown to make results are not skewed by caching on the MWAPI side.
Mon, Mar 23
@Isaac The details of the cache and how exactly will it be implemented to Article Topics is still not fully decided. Current approaches we explored would work with page_id, whereas page_title requests would not go through cache. This ticket does not take cache into consideration, but we're verifying how fast can we get without cache. As a bonus, I can also check the page_title variant in this ticket so we'll have more context on it :)
Tue, Mar 17
After lowering the maximum input token length to 4096, we seem to be able to process all incoming requests. I will figure out optimizations we could make to allow bigger input lengths, but the current 4096 token limit should already be good enough for testing our policies.
Mar 16 2026
If I understand you correctly (and if I don't, please don't hesitate to correct me), you're arguing that we might have uses that can't be satisfied, which would force a product team to build an HTTP API to serve them, one that would otherwise have also worked as the lambda (while providing an example of a hypothetical use-case). Or put another way, that (a, above) we might have past use cases with extant HTTP APIs, and (b) we might have (unavoidable) future ones too.
After deployment, CoPE-A-9B model server was successfully processing small requests of less than 500 input tokens.
The CoPE-A-9B model is now deployed on LiftWing.
Mar 13 2026
Weekly Update
Mar 12 2026
Small update on the progress.
Mar 11 2026
Resolving this as this was a single time incident and the underlying concern about reference-need's high resource requests (22 CPUs, 6Gi memory) and its impact on cluster scheduling is now tracked as part of T414431, where we are optimizing resource utilization across all ISVCs.
Closing this Task as exploration phase is complete. The key outcomes from this task:
Mar 9 2026
Hii all! Wanted to bring up a discussion that came up while working on the Article Topics integration with Hoarde. It's about the lambda interface protocol and whether gRPC should be the only supported method or whether we should consider HTTP. I've been discussing this with both Eric and Luca, and I think there are valid points on both sides so I wanted to open it up here.
Mar 6 2026
Weekly Update
Update on quantization experiments
Mar 3 2026
I've managed to spin up the CoPE-A model on ml-lab1002 machine on single MI210 GPU and tested it with sample request.
Feb 27 2026
Weekly Update
Feb 26 2026
Feb 25 2026
Feb 19 2026
What have I done so far
Feb 13 2026
Weekly Update
Feb 4 2026
Jan 30 2026
Weekly Update
The new service supporting revision_id as an optional input parameter is live.
As expected, the queries using the revision_id parameter are ~4x slower due to the separate queries we need to make to catch QIDs linked to a specific revision of the page.
The API documentation is also updated now: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction.
Jan 23 2026
Weekly Update
Jan 21 2026
Jan 19 2026
Small Weekly Update
Jan 14 2026
Jan 13 2026
I went through utilization graphs of our InferenceServices and it seems there is a lot of CPU savings we could make, whereas Memory is usually set quite reasonably with no major overcommitments.
Jan 12 2026
Weekly Update
Dec 10 2025
I'm coming with a small update from early experimentation results.
Dec 4 2025
Nov 28 2025
After some development time, the Revise Tone Task Generator service is happily running on LiftWing and is processing all edits on enwiki, ptwiki, frwiki and arwiki matching our topic criteria!
Looking at Istio Grafana Dashboard, we can see we're processing 1-2 requests per second with median response time of ~200ms and p95 response of 1s. This includes us ingesting data to Cassandra and sending the weighted tag update event.
Nov 26 2025
@elukey I think you might be right that it was the specificity of the Python code I've been using.
When sending the request in Python (via the requests library), I've been setting the header to 'Content-Type': 'application/json'. This _probably_ means, it did not infer any other headers, but used only the ones I defined. If I won't define any headers, it will probably infer both Content-Type and Host correctly. Will check this! :D
@elukey They domains below are resolvable to the same IP, but when sending requests they all produced the same 502 error:
Thank you for all of your help investigating and finding the solution to enable the pod-to-pod communication!
I'm very happy to confirm that the solution Luca suggested works and is already integrated in our production service. We use a combination of http://outlink-topic-model.articletopic-outlink/v1/models/outlink-topic-model:predict as URL and outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local as Host header to communicate with the service.
Nov 21 2025
Notes on connection issues discovered during development.
Nov 14 2025
Update / Task on pause
Nov 12 2025
When the service starts, Lift Wing will validate whether the target table exists, so we'll need SELECT as well. @BWojtowicz-WMF, is it correct?
Nov 6 2025
for local workflows it might be good to have it in a docker compose
Nov 4 2025
Oct 24 2025
Thank you for helping and sharing all the logs!
Oct 23 2025
@jsn.sherman
Hmm this is very interesting, I could not reproduce it on my Mac machine yet. Can you share the exact commands that you are running?
I think I found the culprit - the issue stems from our base docker image, which contains the old version of typing_extensions preinstalled in /opt/lib/python/site-packages/typing_extensions.py. However, just adding the pin to typing_extensions==4.15.0 in requirements.txt does not solve the issue as I shared in https://phabricator.wikimedia.org/T408068#11301601.
Looking into it! I can reproduce this issue on my machine. I’ve also confirmed that we luckily don’t encounter this issue on LiftWing, which is interesting.
Oct 21 2025
I've looked through our Logstash hunting for 500 errors for fiwiki-damaging in the last month. Indeed in the last month, we had 13 days where those errors occured, ranging from 4 to 72 occurrences on those days. All of those are caused by LiftWing failing to fetch data from MW API due to 503 Service Unavailable error:
Oct 14 2025
Oct 13 2025
Oct 10 2025
Weekly Report
Oct 2 2025
Weekly Report
Sharing a day earlier as I'm OOO on 3rd of October.
@Eevans
Thank you very much for elaborating on the history and differences between those two. I was curious what kind of optimizations could be done there like the RAID10 storage and higher density, it's very interesting!
I agree that even if there are no major differences, we should still deploy our Cache in the RESTBase cluster, which is meant for this type of processing.
Oct 1 2025
In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.
On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better fit for the RESTBase cluster (RESTBase, like AQS, is a misnomer here, both are multi-tenant clusters). The AQS > cluster is (or at least has been) geared more toward materialized representations, analytics, etc. The things persisting data there mostly follow an ETL pattern (even though we've talked about using event > streams, and a more Lamba architecture). Most of what is there is time-series, or versioned, where data is written but not updated. The RESTBase cluster has primarily been for caching (and a bit of application > state). Primarily caching alternate representations of content, but caching nonetheless. Those caches have been maintained by changeprop jobs, jobs that hit a service with a no-cache header, which then writes > though to Cassandra... which sounds familiar?
Sep 30 2025
@Ottomata @isarantopoulos
Thank you for the suggestion and discussion about using the wiki_id. The article model does not currently work for other Wikis, but I very much like the idea if standardizing our DB schemas across different models to use page_id and wiki_id for indices.
To not alter the current API parameters to the model, which expects lang parameter, I've created a static lang->wiki_id mapping for each Wikipedia language, which will be used internally by our application code to translate between lang and wiki_id when interacting with cache.
Sep 26 2025
Weekly Report
@isarantopoulos I agree, I initially got scared when I saw the new response times on my local machine, but underestimated how faster the requests are inside our cluster :D
Sep 25 2025
I've done a small analysis on performance implications of introducing the page_id parameter.
I've ran the experiments on the statbox machines to closer reflect the real time of communication with Wikipedia servers, however it might still not perfectly resemble the query performance when deployed on LiftWing.
Sep 24 2025
Sep 23 2025
The merged architecture has been deployed on both staging and production clusters. It's also been tested by sending requests manually and verifying the responses are correct.
Sep 22 2025
In https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1187739, we've combined the transformer and predictor logic into a single pod. Now, the full processing is done by a single predictor pod.
Sep 19 2025
Weekly Report
Sep 18 2025
Yes, I would keep this task open until the documentation has been updated.
Why do we need Cache
Sep 17 2025
When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title parameter? If so, that would be the best of both worlds, since I could envision scenarios where either variation would be useful.
Sep 16 2025
We have 1 technical question about the way Apps side will query our LiftWing model to retrieve the article topics. Currently, our LiftWing model expects users to pass page_title and lang parameters in POST requests to our model. ML team is also considering adding a page_id parameter that could be used instead of page_title.