Page MenuHomePhabricator

isarantopoulos (Ilias Sarantopoulos)
Machine Learning/MLOps Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Nov 1 2022, 12:34 PM (162 w, 6 d)
Availability
Busy Busy until Mar 5 2026.
LDAP User
Ilias Sarantopoulos
MediaWiki User
ISarantopoulos-WMF [ Global Accounts ]

Recent Activity

Oct 29 2025

isarantopoulos created T408690: Move inference-services repo from Gerrit to GitLab.
Oct 29 2025, 1:59 PM · Machine-Learning-Team

Oct 27 2025

isarantopoulos added a comment to T408388: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models.

Will you be using the same definition for the filters as described in T392148: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to? In that work a single filter was defined, and its threshold was configured to ensure fewer than 15% false positives.

Oct 27 2025, 12:54 PM · OKR-Work (WE1 FY2025-26), Machine-Learning-Team, MediaWiki-extensions-ORES, PersonalDashboard, MediaWiki-Recent-changes, Moderator-Tools-Team

Oct 24 2025

isarantopoulos closed T407839: Add Dawid Pogorzelski to WMF GitHub organization as Resolved.

Thanks Sam! Resolving this then.

Oct 24 2025, 7:41 AM · Machine-Learning-Team, Wikimedia-GitHub

Oct 23 2025

isarantopoulos moved T403378: Review Tone Check Latency SLO and its targets from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Oct 23 2025, 9:27 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos closed T403378: Review Tone Check Latency SLO and its targets as Resolved.

Change #1186447 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: enable GPU for edit-check in prod

https://gerrit.wikimedia.org/r/1186447

After enabling the GPU in the above patch the latencies have been stable and SLO targets are now met and are close to 99% vs the 90% target. No further actions need to be taken at this point.
https://grafana.wikimedia.org/goto/BsgK3OgDR?orgId=1
https://slo.wikimedia.org/objectives?expr={__name__=%22tonecheck-latency-v1%22,%20revision=%221%22,%20service=%22tonecheck%22,%20team=%22ml%22}&grouping={}&from=now-1h&to=now

Oct 23 2025, 9:27 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T401968: Analyze samples of articles to see how many structured tasks we might be able to generate from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Oct 23 2025, 9:01 AM · Research, Revise-Tone-Structured-Task, Growth-Team, OKR-Work, Goal, Machine-Learning-Team
isarantopoulos added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

Let's also update the API GW documentation and then resolve this https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction

Oct 23 2025, 8:27 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T408068: Revertrisk multilingual fails locally when ran with docker compose.

This is issue was raised by @jsn.sherman on slack.
The current model in production is https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20230810110019/.
I'm not sure if this is the one that was tested as there is also a newer model available in https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/multilingual/20250605145811/
Ideally the local setup should work with both.

Oct 23 2025, 8:19 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos assigned T408068: Revertrisk multilingual fails locally when ran with docker compose to BWojtowicz-WMF.
Oct 23 2025, 8:17 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos created T408068: Revertrisk multilingual fails locally when ran with docker compose.
Oct 23 2025, 8:17 AM · Patch-For-Review, Machine-Learning-Team

Oct 21 2025

isarantopoulos moved T395823: [batch #2] Enable revertrisk filters in recent changes in multiple wikis from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:42 PM · User-notice-archive, Patch-For-Review, Wikimedia-Extension-setup, Wikimedia-Site-requests, MediaWiki-extensions-ORES, MediaWiki-Recent-changes, Moderator-Tools-Team, Machine-Learning-Team
isarantopoulos moved T396788: Persist historical Revertrisk Multilingual data for threshold analysis from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · MediaWiki-Recent-changes, Machine-Learning-Team, Moderator-Tools-Team
isarantopoulos moved T393865: Simplify pre-commit hooks within inference-services repository. from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team
isarantopoulos moved T391465: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team, sre-alert-triage
isarantopoulos moved T396466: Security Issue Access Request for Machine Learning team from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · SecTeam-Processed, Machine-Learning-Team, Security-Team, Security
isarantopoulos moved T394455: Ensure all ORES i18n messages are available for idwiki from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · I18n, Machine-Learning-Team, Moderator-Tools-Team (Kanban)
isarantopoulos moved T391103: DBA Review of Tables that ORES Extension will create from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Data-Persistence, Moderator-Tools-Team (Kanban), MediaWiki-Recent-changes, MediaWiki-extensions-ORES, Machine-Learning-Team
isarantopoulos moved T377609: Failure in PageTriage extension on CheckUser test GlobalBlockingHandlerWithDatabaseRowsTest::testRetroactiveAutoblockWhenLocalUserNotAttached from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · MW-1.43-notes (1.43.0-wmf.28; 2024-10-22), Trust and Safety Product Sprint (Sprint Cello (Oct 7 - 18)), Trust and Safety Product Team, ci-test-error (WMF-deployed Build Failure), ORES, CheckUser, Moderator-Tools-Team, PageTriage, Machine-Learning-Team
isarantopoulos moved T372298: [SPIKE]Perform a load test for Multilingual Revert Risk on LiftWing[4H] from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Moderator-Tools-Team (Kanban), Machine-Learning-Team, Automoderator
isarantopoulos moved T380258: Create an Airflow instance for ML from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Machine-Learning-Team, Data-Platform
isarantopoulos moved T374077: [SPIKE] Investigate how to install ORES in idwiki [8HRS] from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · WE4.2 Anti-abuse, Moderator-Tools-Team (Kanban), Machine-Learning-Team, ORES, Spike
isarantopoulos moved T362503: ORES doesn't work (at least for ru- and ukwiki) from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Patch-For-Review, Machine-Learning-Team, ORES
isarantopoulos moved T375280: PopulateDatabase errors out and stops processing revisions when any revertRiskLiftWingRequest request fails from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · MW-1.45-notes (1.45.0-wmf.2; 2025-05-20), Moderator-Tools-Team (Kanban), Machine-Learning-Team, MediaWiki-extensions-ORES
isarantopoulos moved T370759: [M] Create the logo detection model card from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team, Structured-Data-Backlog (Current Work)
isarantopoulos moved T377331: the error message from gapfinder service refers to a deleted rev from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · GapFinder, Language-Team, Machine-Learning-Team
isarantopoulos moved T393595: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Bartosz Wójtowicz from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · LDAP-Access-Requests, Machine-Learning-Team, SRE, SRE-Access-Requests
isarantopoulos moved T381569: [SPIKE] How could we add topic filtering to Recent Changes? [8H] from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Moderator-Tools-Team (Kanban), MediaWiki-extensions-ORES, MediaWiki-Recent-changes, Edit-Review-Improvements-RC-Page, Machine-Learning-Team
isarantopoulos moved T393475: ML Services causing log spam from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team
isarantopoulos moved T396461: ORES Extension master branch is failing tests from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · ci-test-error (WMF-deployed Build Failure), ORES, MediaWiki-extensions-ORES, Machine-Learning-Team
isarantopoulos moved T382171: Install ORES extension on idwiki from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Wikimedia-Extension-setup, MediaWiki-extensions-ORES, Wikimedia-Site-requests, Machine-Learning-Team, Moderator-Tools-Team
isarantopoulos moved T395253: Improve ORES extension table backfill script from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · MW-1.45-notes (1.45.0-wmf.6; 2025-06-17), Patch-For-Review, MediaWiki-extensions-ORES, Machine-Learning-Team
isarantopoulos moved T395256: [Spike] Investigate why filtering wasn't working on testwiki from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · MW-1.45-notes (1.45.0-wmf.4; 2025-06-03), Machine-Learning-Team, MediaWiki-Recent-changes, Moderator-Tools-Team (Kanban)
isarantopoulos moved T391958: Create a new S3 bucket for MinT from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Language and Product Localization, Machine-Learning-Team
isarantopoulos moved T395074: ParserFunctionsTest::testIfexist failure by run of ORESFetchScoreJob in CI from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · MW-1.45-notes (1.45.0-wmf.3; 2025-05-27), ci-test-error (WMF-deployed Build Failure), PageTriage, ORES, Moderator-Tools-Team, ParserFunctions, Machine-Learning-Team
isarantopoulos moved T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team
isarantopoulos moved T392148: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Moderator-Tools-Team, Machine-Learning-Team, MediaWiki-Recent-changes
isarantopoulos moved T393154: Peacock detection model GPU deployment returns inconsistent results from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Editing-team (Tracking), Machine-Learning-Team
isarantopoulos moved T394779: Deploy tone check model to production from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team
isarantopoulos moved T393876: [Fix]: Documentation for ORES and MediaWiki Docker from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Documentation, Machine-Learning-Team
isarantopoulos moved T366670: hw troubleshooting: memory errors during boot for ml-serve2001.codfw.wmnet from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · SRE, ops-codfw, Machine-Learning-Team, DC-Ops
isarantopoulos moved T382343: [LLM] ML-lab benchmarking from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Machine-Learning-Team
isarantopoulos moved T326179: Emit revision revert risk scores as a stream and expose in EventStreams API from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:41 PM · Event-Platform, Data-Engineering, Machine-Learning-Team, Research
isarantopoulos moved T379052: Test the feasibility of deployment of Aya-expanse model in LiftWing from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Machine-Learning-Team
isarantopoulos moved T388805: Load test the language agnostic article-quality model from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Wikimedia Enterprise - Content Integrity, Lift-Wing, Machine-Learning-Team
isarantopoulos moved T392460: [FIX]: Edit-check peacock detection locust tests from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Machine-Learning-Team
isarantopoulos moved T370149: [LLM] Use vllm for ROCm in huggingface image from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos moved T391679: [onboarding] Improving language agnostic articlequality model + service from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
isarantopoulos moved T388817: Load test the peacock edit check service from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · EditCheck, Lift-Wing, Machine-Learning-Team
isarantopoulos moved T390855: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Machine-Learning-Team, SRE-Access-Requests, SRE, LDAP-Access-Requests
isarantopoulos moved T391229: Local peacock model server doesn't send CORS headers allowing all origins from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Machine-Learning-Team, EditCheck, Lift-Wing
isarantopoulos moved T389768: LiftWing model-servers log improper JSON in stderr from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:40 PM · Machine-Learning-Team, Lift-Wing
isarantopoulos moved T386100: Create and deploy peacock detection model server from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:39 PM · Patch-For-Review, EditCheck, Lift-Wing, Machine-Learning-Team
isarantopoulos moved T384172: Issues with Reference Need and Reference Risk models from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:39 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T384651: Evaluate efficacy of Tone Check model output (internal review) from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:39 PM · Editing-team (Tracking), Machine-Learning-Team, EditCheck, VisualEditor
isarantopoulos moved T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:39 PM · User-notice-archive, Patch-For-Review, Wikimedia-Extension-setup, Wikimedia-Site-requests, MediaWiki-extensions-ORES, MediaWiki-Recent-changes, Moderator-Tools-Team, Machine-Learning-Team
isarantopoulos moved T394910: Investigate null scores being returned by revertrisk language agnostic from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:39 PM · Machine-Learning-Team
isarantopoulos moved T387984: Use SHAP values to highlight peacock words from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:39 PM · Editing-team (Tracking), EditCheck, Machine-Learning-Team
isarantopoulos moved T388215: Verify cost of gathering peacock training/evaluation data for top 20 languages from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:38 PM · Editing-team (Tracking), EditCheck, Machine-Learning-Team
isarantopoulos moved T386645: Evaluate the existing peacock detection model from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:38 PM · EditCheck, Machine-Learning-Team
isarantopoulos moved T369493: Migrate ml-staging/ml-serve clusters off of Pod Security Policies from 2025-2026 Q2 Done to Task Archive on the Machine-Learning-Team board.
Oct 21 2025, 3:38 PM · Patch-For-Review, Machine-Learning-Team, Kubernetes
isarantopoulos moved T406958: Enable Airflow triggerer process for deferrable operators in airflow-ml and airflow-devenv from Unsorted to 2025-2026 Q1 Done on the Machine-Learning-Team board.
Oct 21 2025, 3:34 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Machine-Learning-Team
isarantopoulos created T407839: Add Dawid Pogorzelski to WMF GitHub organization.
Oct 21 2025, 8:52 AM · Machine-Learning-Team, Wikimedia-GitHub

Oct 10 2025

isarantopoulos added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

What we're doing here is actually a really common pattern (really common) —we've faced these same trade-offs over and over again— and what I've proposed here (an MVCC, basically), is a solution that I think we should start standardizing on, and perhaps even build some tooling around. The only reason we haven't so far, is that each time we reach a decision point like this, we opt for the quicker/easier route

Very nicely put Eric! I agree that this is exactly what we're doing here. The decision has not been finalized yet, but it is not that we're trying to just take the easier/quicker way out but the one that gets the job done. I agree that the event/stream based approach is the one that guarantees correctness but it also means that there is an additional stream to monitor and maintain. Having outdated recommendations is not a big issue as long as there are enough recommendations in the queue. There is however a need for an additional step on the serving side to invalidate such recommendations by checking if the revision_id is the latest revision for a given page.
If possible I would be interested in decoupling the schema decision discussed in the task from the update/ingestion mechanism and its architecture and my understanding is that your latest recommendation allows us to achieve this.
More specifically using PRIMARY KEY((wiki_id, page_id), model_version, revision_id) allows us to get latest revision for a given page and the computation can be done either in a batch way using an event table or via an event stream. In the batch scenario the downstream application that uses it needs to invalidate the recommendation before showing it to the user (here it is GrowthExperiments)
Am I missing something?

Oct 10 2025, 11:12 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence

Oct 9 2025

isarantopoulos added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

@Eevans Aiko has suggested a way to query for page_id,revision_id & model_version in T401021#11190742

PRIMARY KEY((wiki, page_id, revision_id), model_version)

My understanding was that we only need to store a single row per page or article (representing the latest revision), and including revision_id as a regular column should be sufficient to indicate which revision the row corresponds to.

Oct 9 2025, 9:48 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence
isarantopoulos moved T405891: Add support for K8s 1.23 on Trixie from Unsorted to 2025-2026 Q1 Done on the Machine-Learning-Team board.
Oct 9 2025, 8:53 AM · Machine-Learning-Team

Oct 7 2025

isarantopoulos reopened T405358: Add LiftWing streams data to event_sanitized (increase data retention) as "Open".

Thanks for clearing that up! I assume data eng are the ones that deploy this but we could open the patch for it.

Oct 7 2025, 10:33 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T405358: Add LiftWing streams data to event_sanitized (increase data retention).

Can we verify that the tables exist before we resolve this? I ran a quick check and the table event_sanitized.mediawiki_page_outlink_topic_prediction_change_v1 doesn't seem to exist. iiuc from the documentation there is a cron job that runs every hours but perhaps something else is needed the first time (?)

Oct 7 2025, 8:16 AM · Lift-Wing, Machine-Learning-Team

Oct 6 2025

isarantopoulos added a comment to T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task.

Update: Growth team won't be doing the testwiki PoC this quarter, so we don't have an urgent timeline to ingest a one-off dataset to staging Cassandra

I'm circling back on this to figure out if we can align on the timelines. We would like to have the instance by Mid October (15th) so we can work on ingesting data that would enable an A/B test. Is this possible?
What are the blocking decisions that we would need to make in order to proceed?

Oct 6 2025, 3:50 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence
isarantopoulos added a comment to T402984: Data Persistence Design Review: Article topic model caching.

I have updated ownership and expiration date
@Eevans There has been a change of plans regarding the integration of this work with this years Year In Review so although we still need this Cassandra instance the request that we have filed for the improve tone structured task in T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task is of higher priority .I just wanted to mention this so you can handle your priorities and timelines accordingly.

Oct 6 2025, 1:54 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence
isarantopoulos updated the task description for T402984: Data Persistence Design Review: Article topic model caching.
Oct 6 2025, 1:47 PM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence
isarantopoulos added a comment to T394778: Build and push images to the docker registry from ml-lab.

Until we reach a proper way to build-push & update these images can we have an image in the registry to unblock us from starting to deploy services and iterate on them? I'm talking about the image that has been also built on ml-lab and is described in this patch https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1146891

Oct 6 2025, 10:59 AM · Machine-Learning-Team

Oct 3 2025

isarantopoulos added a comment to T403236: Fix revscoring load tests to match staging deployments.

@gkyziridis the error is quite self explanatory the file is just not there :)

Oct 3 2025, 3:10 PM · Essential-Work, Machine-Learning-Team
isarantopoulos added a comment to T403236: Fix revscoring load tests to match staging deployments.

We could remove the zhwiki deployment and free up some resources. In staging it makes sense to have 1 deployment for each family of models or an additional one if there is something different about it that would be worth testing (e.g. an additional deployment for on wikidata)

Oct 3 2025, 11:25 AM · Essential-Work, Machine-Learning-Team

Oct 2 2025

isarantopoulos moved T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task from In Progress to Watching on the Machine-Learning-Team board.
Oct 2 2025, 3:16 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Persistence-Design-Review, Revise-Tone-Structured-Task, OKR-Work, Machine-Learning-Team, Growth-Team, Data-Persistence
isarantopoulos closed T405083: Remove old nsfw model from inference-services repo as Resolved.
Oct 2 2025, 3:16 PM · Lift-Wing, Machine-Learning-Team

Oct 1 2025

isarantopoulos updated the task description for T405358: Add LiftWing streams data to event_sanitized (increase data retention).
Oct 1 2025, 12:47 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos moved T405358: Add LiftWing streams data to event_sanitized (increase data retention) from Unsorted to In Progress on the Machine-Learning-Team board.
Oct 1 2025, 12:45 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T402984: Data Persistence Design Review: Article topic model caching.

Since the events that are produced (prediction data) are ingested in a hive table event.mediawiki_page_outlink_topic_prediction_change_v1 we can utilize that for analytics purposes. The data are available there for 90 days and we are looking to increase the retention period in T405358: Add LiftWing streams data to event_sanitized (increase data retention)
+1 on querying directly and not going through the data gateway

Oct 1 2025, 11:52 AM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence
isarantopoulos updated the task description for T405358: Add LiftWing streams data to event_sanitized (increase data retention).
Oct 1 2025, 11:49 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos assigned T405358: Add LiftWing streams data to event_sanitized (increase data retention) to gkyziridis.
Oct 1 2025, 8:55 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos updated the task description for T405358: Add LiftWing streams data to event_sanitized (increase data retention).
Oct 1 2025, 8:54 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos moved T403254: Article topic cache backfilling using article_topic hive table from Ready To Go to Blocked on the Machine-Learning-Team board.
Oct 1 2025, 8:37 AM · Machine-Learning-Team
isarantopoulos moved T405324: Create a notebook for revise tone structured task generation logic from Ready To Go to In Progress on the Machine-Learning-Team board.
Oct 1 2025, 8:37 AM · Machine-Learning-Team
isarantopoulos moved T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs from In Progress to 2025-2026 Q1 Done on the Machine-Learning-Team board.
Oct 1 2025, 8:36 AM · Patch-For-Review, Essential-Work, Machine-Learning-Team
isarantopoulos closed T405338: Calculate tone check model service metrics for fixed calendar window as Resolved.
Oct 1 2025, 8:35 AM · Lift-Wing, Machine-Learning-Team

Sep 30 2025

isarantopoulos closed T404717: Fix CI/CD on ml-pipelines repository as Resolved.
Sep 30 2025, 9:31 AM · Machine-Learning-Team

Sep 26 2025

isarantopoulos added a comment to T402984: Data Persistence Design Review: Article topic model caching.

@BWojtowicz-WMF let's adopt the wiki_id for the data field and we can continue to use lang to avoid altering the api parameters.
@Eevans Is there anything else required from the ML team for the Design review? Is there an estimate about when this can delivered so that we can plan the appropriate integration with the service and the necessary backfill? Thanks!

Sep 26 2025, 11:36 AM · Data-Persistence-Design-Review, Machine-Learning-Team, Data-Persistence
isarantopoulos added a comment to T371021: [articletopic-outlink] Allow using `page_id` as alternative to the `page_title` parameter..

Thanks for looking into that @BWojtowicz-WMF
Given that we use the same source in both request types (mwapi) I don't think it is worth further investigating the difference in response times between page_title and page_id as in any case this feature is an extension to the API and not a replacement.
I suggest we just move forward and add it so that we can use it for caching.

Sep 26 2025, 8:13 AM · Lift-Wing, Machine-Learning-Team

Sep 24 2025

isarantopoulos updated the task description for T394463: [A/B Test] Report on Tone Check leading indicators.
Sep 24 2025, 3:42 PM · Editing-team (Kanban Board), OKR-Work, Goal, Product-Analytics (Kanban), EditCheck, VisualEditor
isarantopoulos added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

Ok! so I'm pasting the modified queries for the availability and latency metrics using the last 21d
The first one results in 99.99% availability:

(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*",response_code="200"}[21d])
	)
)/(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus=~"k8s-mlserve", 
			source_workload_namespace="istio-system",
			app="istio-ingressgateway",
			destination_service_namespace=~"edit-check",
			destination_service_name=~"edit-check-predictor.*"}[21d])
	)
)*100
Sep 24 2025, 3:41 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos updated the task description for T394463: [A/B Test] Report on Tone Check leading indicators.
Sep 24 2025, 3:41 PM · Editing-team (Kanban Board), OKR-Work, Goal, Product-Analytics (Kanban), EditCheck, VisualEditor
isarantopoulos added a comment to T394463: [A/B Test] Report on Tone Check leading indicators.

We have faced some difficulties reporting on 6. and 7. which is related to how the prometheus metrics and the functions are defined. One can read more about the challenges on the Wikitech SLO page.
In T405338: Calculate tone check model service metrics for fixed calendar window we have managed to extract some results for these indicators for the previous 21 days that the experiment has been running and althought they are not 100% accurate due to the aforementioned issues they still provide good insights into what has happened with the service during that period and are both above the defined thresholds.

Sep 24 2025, 3:40 PM · Editing-team (Kanban Board), OKR-Work, Goal, Product-Analytics (Kanban), EditCheck, VisualEditor
isarantopoulos updated the task description for T394463: [A/B Test] Report on Tone Check leading indicators.
Sep 24 2025, 3:33 PM · Editing-team (Kanban Board), OKR-Work, Goal, Product-Analytics (Kanban), EditCheck, VisualEditor
isarantopoulos assigned T405185: Introduce case sensitivity to machine learning model for Add a Link to OKarakaya-WMF.
Sep 24 2025, 1:59 PM · Community Feedback (Growth), Machine-Learning-Team, Growth-Team, Add-Link-Structured-Task
isarantopoulos added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

Thank you for the clarification! The above query responds to the availability SLI (1st item in task description). I tried to tackle #2 which is more difficult since we did face increased latencies.
I agree with al the points except the following:

This assumes that the base to compute the 200-rate against is _all_ status codes

Our base is indeed _all_statu_codes but the rate we want is non 5xx. So the above query can be transformed to the following:

(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check", response_code!~"5.."}[21d])
	)
)/(
	sum by (destination_canonical_service) (
		increase(istio_requests_total{prometheus="k8s-mlserve",destination_workload_namespace="edit-check"}[21d])
	)
)*100

The result of the above is 100% (which is also what the SLO dashboards tell us).
Regarding #2 The percentage of all successful requests (2xx) that complete within 1000 milliseconds
What would the equivalent query be?

Sep 24 2025, 11:57 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos updated subscribers of T405338: Calculate tone check model service metrics for fixed calendar window.

@isarantopoulos it is not that easy :)

I assumed so :)

Sep 24 2025, 9:15 AM · Lift-Wing, Machine-Learning-Team
isarantopoulos added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

Starting from the Istio grafana dashboard that presents the p90 latency of the service I came up with the following query which gives us what % of the requests receive a 2xx response in less or equal than 1000ms.

Sep 24 2025, 7:56 AM · Lift-Wing, Machine-Learning-Team

Sep 23 2025

isarantopoulos created T405358: Add LiftWing streams data to event_sanitized (increase data retention).
Sep 23 2025, 2:01 PM · Lift-Wing, Machine-Learning-Team
isarantopoulos updated subscribers of T405067: prediction_classification_change stream schema change causes model server failures.

Thanks for the suggestion Andrew! Indeed the above way is a really bad practice.
cc: @AikoChou

Sep 23 2025, 1:52 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
isarantopoulos created T405338: Calculate tone check model service metrics for fixed calendar window.
Sep 23 2025, 10:22 AM · Lift-Wing, Machine-Learning-Team