User Details
- User Since
- Nov 1 2022, 12:34 PM (162 w, 6 d)
- Availability
- Busy Busy until Mar 5 2026.
- LDAP User
- Ilias Sarantopoulos
- MediaWiki User
- ISarantopoulos-WMF [ Global Accounts ]
Oct 29 2025
Oct 27 2025
Will you be using the same definition for the filters as described in T392148: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to? In that work a single filter was defined, and its threshold was configured to ensure fewer than 15% false positives.
Oct 24 2025
Thanks Sam! Resolving this then.
Oct 23 2025
After enabling the GPU in the above patch the latencies have been stable and SLO targets are now met and are close to 99% vs the 90% target. No further actions need to be taken at this point.
https://grafana.wikimedia.org/goto/BsgK3OgDR?orgId=1
https://slo.wikimedia.org/objectives?expr={__name__=%22tonecheck-latency-v1%22,%20revision=%221%22,%20service=%22tonecheck%22,%20team=%22ml%22}&grouping={}&from=now-1h&to=now
Let's also update the API GW documentation and then resolve this https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction
This is issue was raised by @jsn.sherman on slack.
The current model in production is https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/.
I'm not sure if this is the one that was tested as there is also a newer model available in https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20250605145811/
Ideally the local setup should work with both.
Oct 21 2025
Oct 10 2025
What we're doing here is actually a really common pattern (really common) —we've faced these same trade-offs over and over again— and what I've proposed here (an MVCC, basically), is a solution that I think we should start standardizing on, and perhaps even build some tooling around. The only reason we haven't so far, is that each time we reach a decision point like this, we opt for the quicker/easier route
Very nicely put Eric! I agree that this is exactly what we're doing here. The decision has not been finalized yet, but it is not that we're trying to just take the easier/quicker way out but the one that gets the job done. I agree that the event/stream based approach is the one that guarantees correctness but it also means that there is an additional stream to monitor and maintain. Having outdated recommendations is not a big issue as long as there are enough recommendations in the queue. There is however a need for an additional step on the serving side to invalidate such recommendations by checking if the revision_id is the latest revision for a given page.
If possible I would be interested in decoupling the schema decision discussed in the task from the update/ingestion mechanism and its architecture and my understanding is that your latest recommendation allows us to achieve this.
More specifically using PRIMARY KEY((wiki_id, page_id), model_version, revision_id) allows us to get latest revision for a given page and the computation can be done either in a batch way using an event table or via an event stream. In the batch scenario the downstream application that uses it needs to invalidate the recommendation before showing it to the user (here it is GrowthExperiments)
Am I missing something?
Oct 9 2025
@Eevans Aiko has suggested a way to query for page_id,revision_id & model_version in T401021#11190742
PRIMARY KEY((wiki, page_id, revision_id), model_version)
My understanding was that we only need to store a single row per page or article (representing the latest revision), and including revision_id as a regular column should be sufficient to indicate which revision the row corresponds to.
Oct 7 2025
Thanks for clearing that up! I assume data eng are the ones that deploy this but we could open the patch for it.
Can we verify that the tables exist before we resolve this? I ran a quick check and the table event_sanitized.mediawiki_page_outlink_topic_prediction_change_v1 doesn't seem to exist. iiuc from the documentation there is a cron job that runs every hours but perhaps something else is needed the first time (?)
Oct 6 2025
Update: Growth team won't be doing the testwiki PoC this quarter, so we don't have an urgent timeline to ingest a one-off dataset to staging Cassandra
I'm circling back on this to figure out if we can align on the timelines. We would like to have the instance by Mid October (15th) so we can work on ingesting data that would enable an A/B test. Is this possible?
What are the blocking decisions that we would need to make in order to proceed?
I have updated ownership and expiration date
@Eevans There has been a change of plans regarding the integration of this work with this years Year In Review so although we still need this Cassandra instance the request that we have filed for the improve tone structured task in T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task is of higher priority .I just wanted to mention this so you can handle your priorities and timelines accordingly.
Until we reach a proper way to build-push & update these images can we have an image in the registry to unblock us from starting to deploy services and iterate on them? I'm talking about the image that has been also built on ml-lab and is described in this patch https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891
Oct 3 2025
@gkyziridis the error is quite self explanatory the file is just not there :)
We could remove the zhwiki deployment and free up some resources. In staging it makes sense to have 1 deployment for each family of models or an additional one if there is something different about it that would be worth testing (e.g. an additional deployment for on wikidata)
Oct 2 2025
Oct 1 2025
Since the events that are produced (prediction data) are ingested in a hive table event.mediawiki_page_outlink_topic_prediction_change_v1 we can utilize that for analytics purposes. The data are available there for 90 days and we are looking to increase the retention period in T405358: Add LiftWing streams data to event_sanitized (increase data retention)
+1 on querying directly and not going through the data gateway
Sep 30 2025
Sep 26 2025
@BWojtowicz-WMF let's adopt the wiki_id for the data field and we can continue to use lang to avoid altering the api parameters.
@Eevans Is there anything else required from the ML team for the Design review? Is there an estimate about when this can delivered so that we can plan the appropriate integration with the service and the necessary backfill? Thanks!
Thanks for looking into that @BWojtowicz-WMF
Given that we use the same source in both request types (mwapi) I don't think it is worth further investigating the difference in response times between page_title and page_id as in any case this feature is an extension to the API and not a replacement.
I suggest we just move forward and add it so that we can use it for caching.
Sep 24 2025
Ok! so I'm pasting the modified queries for the availability and latency metrics using the last 21d
The first one results in 99.99% availability:
(
sum by (destination_canonical_service) (
increase(istio_requests_total{prometheus=~"k8s-mlserve",
source_workload_namespace="istio-system",
app="istio-ingressgateway",
destination_service_namespace=~"edit-check",
destination_service_name=~"edit-check-predictor.*",response_code="200"}[21d])
)
)/(
sum by (destination_canonical_service) (
increase(istio_requests_total{prometheus=~"k8s-mlserve",
source_workload_namespace="istio-system",
app="istio-ingressgateway",
destination_service_namespace=~"edit-check",
destination_service_name=~"edit-check-predictor.*"}[21d])
)
)*100We have faced some difficulties reporting on 6. and 7. which is related to how the prometheus metrics and the functions are defined. One can read more about the challenges on the Wikitech SLO page.
In T405338: Calculate tone check model service metrics for fixed calendar window we have managed to extract some results for these indicators for the previous 21 days that the experiment has been running and althought they are not 100% accurate due to the aforementioned issues they still provide good insights into what has happened with the service during that period and are both above the defined thresholds.
Thank you for the clarification! The above query responds to the availability SLI (1st item in task description). I tried to tackle #2 which is more difficult since we did face increased latencies.
I agree with al the points except the following:
This assumes that the base to compute the 200-rate against is _all_ status codes
Our base is indeed _all_statu_codes but the rate we want is non 5xx. So the above query can be transformed to the following:
(
sum by (destination_canonical_service) (
increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check", response_code!~"5.."}[21d])
)
)/(
sum by (destination_canonical_service) (
increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check"}[21d])
)
)*100The result of the above is 100% (which is also what the SLO dashboards tell us).
Regarding #2 The percentage of all successful requests (2xx) that complete within 1000 milliseconds
What would the equivalent query be?
@isarantopoulos it is not that easy :)
I assumed so :)
Starting from the Istio grafana dashboard that presents the p90 latency of the service I came up with the following query which gives us what % of the requests receive a 2xx response in less or equal than 1000ms.
Sep 23 2025
Thanks for the suggestion Andrew! Indeed the above way is a really bad practice.
cc: @AikoChou