User Details
- User Since
- May 6 2025, 11:26 AM (31 w, 5 d)
- Availability
- Available
- LDAP User
- Bartosz Wójtowicz
- MediaWiki User
- BWojtowicz-WMF [ Global Accounts ]
Wed, Dec 10
I'm coming with a small update from early experimentation results.
Thu, Dec 4
Fri, Nov 28
After some development time, the Revise Tone Task Generator service is happily running on LiftWing and is processing all edits on enwiki, ptwiki, frwiki and arwiki matching our topic criteria!
Looking at Istio Grafana Dashboard, we can see we're processing 1-2 requests per second with median response time of ~200ms and p95 response of 1s. This includes us ingesting data to Cassandra and sending the weighted tag update event.
Wed, Nov 26
@elukey I think you might be right that it was the specificity of the Python code I've been using.
When sending the request in Python (via the requests library), I've been setting the header to 'Content-Type': 'application/json'. This _probably_ means, it did not infer any other headers, but used only the ones I defined. If I won't define any headers, it will probably infer both Content-Type and Host correctly. Will check this! :D
@elukey They domains below are resolvable to the same IP, but when sending requests they all produced the same 502 error:
Thank you for all of your help investigating and finding the solution to enable the pod-to-pod communication!
I'm very happy to confirm that the solution Luca suggested works and is already integrated in our production service. We use a combination of http://outlink-topic-model.articletopic-outlink/v1/models/outlink-topic-model:predict as URL and outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local as Host header to communicate with the service.
Fri, Nov 21
Notes on connection issues discovered during development.
Nov 14 2025
Update / Task on pause
Nov 12 2025
When the service starts, Lift Wing will validate whether the target table exists, so we'll need SELECT as well. @BWojtowicz-WMF, is it correct?
Nov 6 2025
for local workflows it might be good to have it in a docker compose
Nov 4 2025
Oct 24 2025
Thank you for helping and sharing all the logs!
Oct 23 2025
@jsn.sherman
Hmm this is very interesting, I could not reproduce it on my Mac machine yet. Can you share the exact commands that you are running?
I think I found the culprit - the issue stems from our base docker image, which contains the old version of typing_extensions preinstalled in /opt/lib/python/site-packages/typing_extensions.py. However, just adding the pin to typing_extensions==4.15.0 in requirements.txt does not solve the issue as I shared in https://phabricator.wikimedia.org/T408068#11301601.
Looking into it! I can reproduce this issue on my machine. I’ve also confirmed that we luckily don’t encounter this issue on LiftWing, which is interesting.
Oct 21 2025
I've looked through our Logstash hunting for 500 errors for fiwiki-damaging in the last month. Indeed in the last month, we had 13 days where those errors occured, ranging from 4 to 72 occurrences on those days. All of those are caused by LiftWing failing to fetch data from MW API due to 503 Service Unavailable error:
Oct 14 2025
Oct 13 2025
Oct 10 2025
Weekly Report
Oct 2 2025
Weekly Report
Sharing a day earlier as I'm OOO on 3rd of October.
@Eevans
Thank you very much for elaborating on the history and differences between those two. I was curious what kind of optimizations could be done there like the RAID10 storage and higher density, it's very interesting!
I agree that even if there are no major differences, we should still deploy our Cache in the RESTBase cluster, which is meant for this type of processing.
Oct 1 2025
In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.
On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better fit for the RESTBase cluster (RESTBase, like AQS, is a misnomer here, both are multi-tenant clusters). The AQS > cluster is (or at least has been) geared more toward materialized representations, analytics, etc. The things persisting data there mostly follow an ETL pattern (even though we've talked about using event > streams, and a more Lamba architecture). Most of what is there is time-series, or versioned, where data is written but not updated. The RESTBase cluster has primarily been for caching (and a bit of application > state). Primarily caching alternate representations of content, but caching nonetheless. Those caches have been maintained by changeprop jobs, jobs that hit a service with a no-cache header, which then writes > though to Cassandra... which sounds familiar?
Sep 30 2025
@Ottomata @isarantopoulos
Thank you for the suggestion and discussion about using the wiki_id. The article model does not currently work for other Wikis, but I very much like the idea if standardizing our DB schemas across different models to use page_id and wiki_id for indices.
To not alter the current API parameters to the model, which expects lang parameter, I've created a static lang->wiki_id mapping for each Wikipedia language, which will be used internally by our application code to translate between lang and wiki_id when interacting with cache.
Sep 26 2025
Weekly Report
@isarantopoulos I agree, I initially got scared when I saw the new response times on my local machine, but underestimated how faster the requests are inside our cluster :D
Sep 25 2025
I've done a small analysis on performance implications of introducing the page_id parameter.
I've ran the experiments on the statbox machines to closer reflect the real time of communication with Wikipedia servers, however it might still not perfectly resemble the query performance when deployed on LiftWing.
Sep 24 2025
Sep 23 2025
The merged architecture has been deployed on both staging and production clusters. It's also been tested by sending requests manually and verifying the responses are correct.
Sep 22 2025
In https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1187739, we've combined the transformer and predictor logic into a single pod. Now, the full processing is done by a single predictor pod.
Sep 19 2025
Weekly Report
Sep 18 2025
Yes, I would keep this task open until the documentation has been updated.
Why do we need Cache
Sep 17 2025
When you say you'll "add" a page_id parameter, does this mean you'll keep the page_title parameter? If so, that would be the best of both worlds, since I could envision scenarios where either variation would be useful.
Sep 16 2025
We have 1 technical question about the way Apps side will query our LiftWing model to retrieve the article topics. Currently, our LiftWing model expects users to pass page_title and lang parameters in POST requests to our model. ML team is also considering adding a page_id parameter that could be used instead of page_title.
I've tested the option to use page_id in the model and found out that it's straightforward to modify the current outlinks query by using pageids=... instead of titles=.... This means we can easily allow both page_id and page_title options in our model via different POST arguments and use either pageids or titles in our query.
Sep 15 2025
Sep 12 2025
Summary of progress:
Sep 10 2025
Sep 5 2025
Sep 4 2025
Why is it more versatile?
Aug 29 2025
I'm adding a high-level diagram of the Cache design including the backfilling process, interactions with LiftWing and its users.
Q: IIUC this is meant to be a 'query cache', rather than a more general purpose prediction cache, yes?
Aug 22 2025
Aug 20 2025
Thank you for the explanations @Eevans! I see that I have some confusion around existing Cassandra deployments, I'm sorry for that, but I'm happy that we can clear them out :)
Aug 19 2025
Thank you for the quick answers @Eevans! I'll schedule a call for us, where I will share the larger context, but I also think it'll be useful to continue the discussion in this ticket.
Aug 18 2025
Hello @Eevans @Marostegui! In relation to work described in this ticket, we'd like to use the existing Cassandra deployment on the staging ML cluster to validate our design for the caching mechanism. In order to do that, we would need to create the needed keyspace/table and users in the Cassandra deployment. Once we'd run tests and validate the idea in staging environment, we would like to create a similar deployment in the production cluster.
Aug 15 2025
Aug 14 2025
I'm sharing my notes on the Cache design. Those are not final yet and feedback is hugely welcome on any of the points below!
Based on the estimates given in message above and discussion during ML Technical Team Meeting, we've decided to go with adding a cache service to LiftWing and populating it with all ~65 million article topics. Later, each article topic request will be hitting this cache, allowing us to achieve way higher throughput, although the total processing time for Year in Review will still be in the span of weeks.
Aug 13 2025
High level questions to answer:
- What is the setup that could solve this problem?
a) Add LiftWing caching with all articles - There are many unknowns, but it can potentially solve the problem.
b) Use Data Gateway - Potentially solves part of backfilling, but not standard mode of operation, we can not integrate it into LiftWing.
- How do we populate the data at scale?
- How do we serve the data at scale?
Caching POC
Aug 7 2025
@elukey I think it sounds like a good compromise. I'll go this way, thank you!
@elukey I think the idea was that we'd provide a Makefile such that everybody could run just make model-upload ..., which would include installing the venv and running the script. However, this makefile would also need to be in PATH so that everyone can call this from anywhere.
@OKarakaya-WMF
I agree, let's put this in blocked until we update the catboost version in the upstream repository.
I've started working towards the goal of making article topic available at scale. One of the tasks in this goal is introducing caching mechanism for article topic model.
Aug 1 2025
Jul 31 2025
We realized that the original issue happened by querying model via API Gateway at https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict, but previous experiments were being performed by querying the service from our VPC via https://inference.svc.eqiad.wmnet:30443/v1/models/edit-check:predict.
Jul 30 2025
Small update: The connection reset errors shown above are "successful" requests, but they are not returning 200 status codes - they return 502 codes as expected. This is due our service reaching max capacity during load tests of around 40 requests per second. Thus, it's most likely unrelated to the reported bug.
Jul 29 2025
I've updated the staging deployment of edit-check to be able to autoscale up to 3 replicas. I've re-ran the load-testing script with the statistics shown below:
I've ran a load-test on staging cluster with 10000 requests, each of them returned a proper non-empty response. The statistics are shown below.
Unfortunately, it seems that we won't be able to retrieve the exact timestamps nor the number of failed requests as they were not logged in our general client-error-logging.
Jul 28 2025
Jul 24 2025
@kevinbazira As you said, it looks like we're failing due to missing NumPy dependency. It's indeed not defined in our inference-services dependencies in src/models/revert_risk_model/model_server/multilingual/requirements.txt and also it's not defined in the knowledge-integrity dependencies.
Jul 22 2025
We've discussed the points above in our ML Team Meeting, which resulted in a following plan:
Jul 21 2025
I've managed to spin up a local cluster with minikube, following our documentation. The documentation is a little outdated, thus I'll be updating it this week with the discovered improvements.
On my local cluster, I've installed new kserve version directly from kserve github charts and I could successfully deploy our services, which means there should be no dependency conflicts between new kserve version and our current setup.
As of this time, I've reimplemented the model-upload script in Python, tested its functionality and merged it into puppet repository. However, it's not currently functional as I made one major mistake - I've implemented and tested it with boto3==1.26.27 version, which is distributed via .deb package on Debian Bookworm, whereas our stat machines ar based on Debian Bullseye, which distributes older boto3==1.13.14 version.
Jul 3 2025
I've made the Python script for model-upload work with just urllib3 and boto3 as external dependencies, both of which are available as debian packages python3-boto3 and python3-urllib3. I also successfully tested the use-cases we want to support.
Jun 30 2025
Thank you both for the answers, this helps a lot! @elukey @gkyziridis
I've started work on this ticket and I've reimplemented the bash script in Python, where I take advantage of boto3 to handle connection to Swift/s3.
However, I'm facing the following questions at the moment:
Jun 24 2025
All of the work that has been planned for this task has been completed and merged 🎉
Jun 20 2025
I've created an MR in knowledge_integrity repository, which is solving dependency conflicts with newest version of kserve: https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/54. I've tested the changes by updating the knowledge_integrity dependency in revert_risk to target this branch and updating kserve to 0.15.2. With those changes I can build the model locally and querying it works.
Jun 11 2025
May 21 2025
Following up on the message above:
May 20 2025
So far we've merged 2 patches:
May 19 2025
Second patch enabling import sorting within inference-services repo is ready for review.
May 16 2025
Thank you @BCornwall for the help!
I have all needed SSH access now, however I'm not sure about Kerberos - I did not receive any email with temporary password yet. Is there anything else I need to request besides the Kerberos identity?
May 15 2025
This task focuses on simplifying our pre-commit setup within inference-services repo. The plan is to:
- Remove isort, black and pyupgrade in favor of using ruff for all formatting, linting, upgrading syntax and import sorting. Reproduce current behavior as closely as possible.
- Update ruff to newer version.
- Remove unused dependencies.
- Enable import sorting in the repository.
- Evaluate current rules and change them to desirable.
May 12 2025
May 9 2025
Hello @Eevans, I've temporarily added my public SSH key for prod on my user page User:BWojtowicz-WMF.