Page MenuHomePhabricator

gkyziridis (George Kyziridis)
ML-Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Jan 6 2025, 12:21 PM (74 w, 1 d)
Availability
Available
IRC Nick
georgekyz
LDAP User
Gkyziridis
MediaWiki User
GKyziridis-WMF [ Global Accounts ]

Recent Activity

Fri, Jun 5

gkyziridis added a comment to T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files..

@Clement_Goubert thank for your comments, they were super informative and helpful.
All the necessary actions are already taken and the corresponding patches are already under review.
The plan is to start deployments Monday next week following this order:

  1. Merge and publish the image of liftwing-openapi-server in registry: patch-inference-services
  2. Kubeconfig files (+mesh that could be done later but is not a problem to do right there): patch-puppet
  3. Deployment both the admin_ng part and the actual service: patch-deployment-charts
  4. Ingress configuration and DNS: patch-operations/DNS
  5. LVS in service_setup, configuring the liftwing-openapi-server in "hieradata/common/service.yaml"  in a separated puppet patch
Fri, Jun 5, 12:54 PM · Patch-For-Review, ServiceOps new, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Wed, Jun 3

gkyziridis added a comment to T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files..

Thank you both @isarantopoulos and @Clement_Goubert for you help.
So just to make sure that I've understood what you suggest:

  1. Create a repo in gitlab under: repos/sre/miscweb/liftwing-openapi-specs
  2. Add the /docs files in there and configure the pipeline similar to static-codereview repo.
  3. Create a new helmfile in helmfile.d/services/miscweb for the deployment.
Wed, Jun 3, 10:58 AM · Patch-For-Review, ServiceOps new, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis added a comment to T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files..

We decided to go with the option 1, having a dedicated endpoint liftwing-openapi-server which serves the umbrella yaml for all endpoints. I've tested it locally and it works fine with the RestSandbox.

LiftWing configure on RestSandbox (1×2 px, 462 KB)

Wed, Jun 3, 9:29 AM · Patch-For-Review, ServiceOps new, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Tue, Jun 2

gkyziridis added a comment to T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files..

Thnx for your comments @apaskulin!
I was investigating @Clement_Goubert's option for exposing the openapi specs on the Kserve level by overriding the standard specs with the custom ones.
This works fine I tested it locally configuring it on the RestSandbox.

Tue, Jun 2, 3:12 PM · Patch-For-Review, ServiceOps new, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis created T427902: Expose LiftWing API for serving the openapi-specs through the /docs yaml files..
Tue, Jun 2, 10:10 AM · Patch-For-Review, ServiceOps new, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis added a comment to T426081: Configure LiftWing's openAPI specs into mediawiki-config.
Ideal

The solution we should be working towards is to serve the spec files directly from the service. There's a few ways to approach it, like using the underlying python FastAPI classes to serve these files, possibly using its openapi_url configuration option, or catching the /docs path in the model_server.py code and returning the file.
We can then point the RestSandbox configuration to the rest-gateway with possibly a slight adjustment to the rules to accept the new /docs URLs.
I understand this can take time and efforts, but it will result in more autonomy, less cross-talk between services, and a way lighter risk burden

Tue, Jun 2, 9:25 AM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Mon, Jun 1

gkyziridis added a comment to T426081: Configure LiftWing's openAPI specs into mediawiki-config.

Thank you all for your advices and comments. It is much appreciated!
I would like to add some clarification to some specific points:

  1. We would like to deliver something that is working and ship it fast, we have a hard deadline on 10 of June.
  2. We are not making changes on the services and on the model's schemas, which means that we will not need multiple deployments.
Mon, Jun 1, 1:47 PM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Thu, May 28

gkyziridis added a comment to T384584: Adding uv as a package manager on Lift Wing/blubber.

I would like to move some of the python services(example) to uv based workflows. @gkyziridis I see that the merge request is not closed yet. Are you planning to work on this?

Thu, May 28, 2:27 PM · Release Pipeline (Blubber), Patch-For-Review, Machine-Learning-Team
gkyziridis added a comment to T426081: Configure LiftWing's openAPI specs into mediawiki-config.
# Clone mediawiki-config and fetch the patch.
git clone "https://gerrit.wikimedia.org/r/operations/mediawiki-config"     
cd mediawiki-config                                               
git fetch origin refs/changes/88/1294988/1                                           
git checkout FETCH_HEAD
cd ..
Thu, May 28, 1:55 PM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis added a comment to T419455: Generate OpenAPI descriptions for Lift Wing APIs.

Hi @apaskulin thnx for your work in the openapi-specs, it is much appreciated.
Based on the discussion we had in this slack-thread and after investigating multiple options, we ended up to add these /docs yaml files in the mediawiki-config repository under the /static/liftwing-openapi-specs/ directory, I filed this patch for doing that.
When this patch is merged and deployed, then we can add the rest specs for the models over there.

Thu, May 28, 1:16 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Fri, May 22

gkyziridis added a comment to T426081: Configure LiftWing's openAPI specs into mediawiki-config.

Hi @hashar thank you very much for your comment.
We would like ideally to go with the easiest way using something that already exists such as fetching the specs from the main repo e.g. github.raw (tested on this comment). This way we avoid implement extra services.

Fri, May 22, 2:45 PM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Thu, May 21

gkyziridis added a comment to T419455: Generate OpenAPI descriptions for Lift Wing APIs.

Hey @apaskulin, thnx for your comments.
We are still figuring out how we will expose the openapi-specs docs.
We decided to go first with the current models that are configured in /docs and avoid revscoring/ORES models with (multi-wiki pattern) for now. We will discuss about them in a second iteration.

Thu, May 21, 1:59 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Wed, May 20

gkyziridis added a comment to T425680: Host Qwen 3.6-27B as an inference service.

Classic Kserve v1 API call:

curl -s -i https://inference.svc.eqiad.wmnet:30443/v1/models/qwen3-14b:predict \
-X POST \
-H "Content-Type: application/json" \
-H "Host: qwen3-14b.experimental.wikimedia.org" \
-d '{"prompt": "What is the capital of France?", "max_tokens": 50}'
Wed, May 20, 12:05 PM · ServiceOps new, Traffic, ServiceOps-SharedInfra, Patch-For-Review, Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
gkyziridis moved T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change from Q4 FY2025-26 to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Wed, May 20, 10:59 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis closed T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change as Resolved.
Wed, May 20, 10:58 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis closed T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change, a subtask of T374698: Quantitative analysis comparing anti-vandalism bots and Automoderator, as Resolved.
Wed, May 20, 10:58 AM · Product Safety and Integrity (Sprint Tulip (Apr 13 - May 1)), Temporary accounts (4.8 TA Patrolling), Product-Analytics, Moderator-Tools-Team, Automoderator

Tue, May 19

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I can query event_sanitized.mediawiki_page_revert_risk_multilingual_prediction_change_v1 and see results.
Thank you very much for your help @Ottomata !

mediawiki_page_revert_risk_multilingual_prediction_change_v1 (1×3 px, 395 KB)

Tue, May 19, 3:41 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis created T426766: Upgrade production vLLM image to use vLLM version >= 0.19.
Tue, May 19, 3:26 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
gkyziridis added a comment to T426081: Configure LiftWing's openAPI specs into mediawiki-config.

The usual pattern for this is to create an API endpoint that serves the OpenAPI file. For example: https://api.wikimedia.org/service/lw/recommendation/api/openapi.json although YAML will also work

Tue, May 19, 10:06 AM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Mon, May 18

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I cannot see results in`event_sanitized.mediawiki_page_revert_risk_multilingual_prediction_change_v1`:

event_sanitized query (926×3 px, 201 KB)

Mon, May 18, 12:16 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Wed, May 13

gkyziridis added a comment to T425680: Host Qwen 3.6-27B as an inference service.

This is an initial report for qwen model deployment using the optimize-model skill.
I found it pretty detailed I am pasting it here.

  1. Qwen36-27B Optimization Report
Wed, May 13, 2:40 PM · ServiceOps new, Traffic, ServiceOps-SharedInfra, Patch-For-Review, Lift-Wing, Machine-Learning-Team (Q4 FY2025-26)
gkyziridis added a comment to T426081: Configure LiftWing's openAPI specs into mediawiki-config.

You can pull the mediawiki-config patch and follow these steps to reproduce and test it locally.

Wed, May 13, 11:40 AM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Configuration of 46 wikis + testwiki on changeprop is merged and deployed.
We can see streams coming from multiple wikis in https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.page_revert_risk_multilingual_prediction_change.v1.

Wed, May 13, 7:46 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Tue, May 12

gkyziridis updated the task description for T426081: Configure LiftWing's openAPI specs into mediawiki-config.
Tue, May 12, 3:40 PM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis added a comment to T419455: Generate OpenAPI descriptions for Lift Wing APIs.

I created a subtask for configuring the openapi specs in the mediawiki-config: https://phabricator.wikimedia.org/T426081
I will paste there my findings.

Tue, May 12, 3:35 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis created T426081: Configure LiftWing's openAPI specs into mediawiki-config.
Tue, May 12, 3:15 PM · collaboration-services, ServiceOps-Mediawiki, ServiceOps-SharedInfra, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Mon, May 11

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Thank you very much @Ottomata !

Mon, May 11, 2:15 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis added a comment to T419455: Generate OpenAPI descriptions for Lift Wing APIs.

I went ahead and tested a public, api.wikimedia.org Lift Wing POST endpoint in Swagger UI using the anonymous example from the docs, and the API request successfully returned the expected response. Unless I'm missing something, this means that a complete OpenAPI spec for the public Lift Wing endpoints will work in Swagger UI even though the endpoints are v1 and POST.

This is tested by @apaskulin and by me (in the comment above) using a Swagger-UI, so we can move forward and test how we can use these yaml files in the Rest Sandbox.

Mon, May 11, 2:10 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
gkyziridis added a comment to T332602: Investigate how to enable the swagger UI for InferenceService resources.

Update

Since Kserve does not allow the v1 post endpoints tested in the default Swagger-UI, we decided to move towards the direction of retrieving the openapi specs from LiftWing model server, and make them available for the Rest Sandbox.
We can close this task.
More information can be found in T419455 .

Mon, May 11, 1:54 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I +1 the patch event-sanitisation: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1283749
I think DE needs to deploy? I do not have permissions for +2 in the patch.

Mon, May 11, 9:50 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis added a comment to T419455: Generate OpenAPI descriptions for Lift Wing APIs.

Hey @apaskulin thnx for your update.

Would this be one API spec in the repository or multiple specs, such as one under each model directory?

I think it would be better to have a /docs directory on the top level of the repo where we can add/configure examples and openapi specs for the all of the models.

Mon, May 11, 9:42 AM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

May 6 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.
May 6 2026, 1:41 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

May 4 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

🎉 Indeed there are events in stream.wikimedia.org.
Maybe I'm not of much help but judging from the error above Cannot connect to host eventgate-main.discovery.wmnet:4480 and the fact that we have events in the streams could these errors be transient or handled by a retry mechanism? I'm trying to say that if the errors to eventgate were persistent we wouldn't see any events published at all.

May 4 2026, 11:42 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

The new stream is deployed on Eventstreams.
I can see streams at: stream.wikimedia.org.
I am still getting some errors:

1RuntimeError: Unexpected error happened while the event was posted to EventGate, there is the possibility that it never reached it. Please contact the ML team if the issue persists.
22026-05-04 09:58:11.528 1 uvicorn.error ERROR: Exception in ASGI application
3Traceback (most recent call last):
4 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
5 resp = await handler(req)
6 ^^^^^^^^^^^^^^^^^^
7 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
8 conn = await self._connector.connect(
9 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
10 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
11 proto = await self._create_connection(req, traces, timeout)
12 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
13 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
14 _, proto = await self._create_direct_connection(req, traces, timeout)
15 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
17 hosts = await self._resolve_host(host, port, traces=traces)
18 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
19 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
20 await future
21asyncio.exceptions.CancelledError
22
23The above exception was the direct cause of the following exception:
24
25Traceback (most recent call last):
26 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
27 async with aio_http_client.post(
28 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
29 self._resp: _RetType = await self._coro
30 ^^^^^^^^^^^^^^^^
31 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 624, in _request
32 with timer:
33 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/helpers.py", line 713, in __exit__
34 raise asyncio.TimeoutError from exc_val
35TimeoutError
36
37During handling of the above exception, another exception occurred:
38
39Traceback (most recent call last):
40 File "/opt/lib/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
41 result = await app( # type: ignore[func-returns-value]
42 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43 File "/opt/lib/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
44 return await self.app(scope, receive, send)
45 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1159, in __call__
47 await super().__call__(scope, receive, send)
48 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/applications.py", line 90, in __call__
49 await self.middleware_stack(scope, receive, send)
50 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
51 raise exc
52 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
53 await self.app(scope, receive, _send)
54 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
55 await self.app(scope, receive, send_wrapper)
56 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
57 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
58 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
59 raise exc
60 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
61 await app(scope, receive, sender)
62 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
63 await self.app(scope, receive, send)
64 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
65 await self.middleware_stack(scope, receive, send)
66 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
67 await route.handle(scope, receive, send)
68 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
69 await self.app(scope, receive, send)
70 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
71 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
72 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
73 raise exc
74 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
75 await app(scope, receive, sender)
76 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
77 response = await f(request)
78 ^^^^^^^^^^^^^^^^
79 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
80 raw_response = await run_endpoint_function(
81 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
82 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
83 return await dependant.call(**values)
84 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
85 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
86 response, response_headers = await self.dataplane.infer(
87 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
88 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
89 response, res_headers = await model(request, headers=headers)
90 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
91 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
92 (await self.predict(payload, headers))
93 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
94 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
95 await self.send_event(
96 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
97 await events.send_event(
98 File "/srv/revert_risk_model/python/events.py", line 268, in send_event
99 raise RuntimeError(
100RuntimeError: Unexpected error happened while the event was posted to EventGate, there is the possibility that it never reached it. Please contact the ML team if the issue persists.
101ERROR:root:Connection error while sending an event to EventGate: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f8c89d934a0> [Temporary failure in name resolution]
1022026-05-04 09:58:12.816 1 kserve ERROR [errors.py:generic_exception_handler():143] Exception:
103Traceback (most recent call last):
104 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
105 hosts = await self._resolve_host(host, port, traces=traces)
106 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
107 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
108 await future
109 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1209, in _resolve_host_with_throttle
110 addrs = await self._resolver.resolve(host, port, family=self._family)
111 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
112 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/resolver.py", line 40, in resolve
113 infos = await self._loop.getaddrinfo(
114 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
115 File "/usr/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
116 return await self.run_in_executor(
117 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
118 File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
119 result = self.fn(*self.args, **self.kwargs)
120 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
121 File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
122 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
123 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
124socket.gaierror: [Errno -3] Temporary failure in name resolution
125
126The above exception was the direct cause of the following exception:
127
128Traceback (most recent call last):
129 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
130 async with aio_http_client.post(
131 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
132 self._resp: _RetType = await self._coro
133 ^^^^^^^^^^^^^^^^
134 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
135 resp = await handler(req)
136 ^^^^^^^^^^^^^^^^^^
137 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
138 conn = await self._connector.connect(
139 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
140 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
141 proto = await self._create_connection(req, traces, timeout)
142 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
143 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
144 _, proto = await self._create_direct_connection(req, traces, timeout)
145 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
146 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1568, in _create_direct_connection
147 raise ClientConnectorDNSError(req.connection_key, exc) from exc
148aiohttp.client_exceptions.ClientConnectorDNSError: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f8c89d934a0> [Temporary failure in name resolution]
149
150During handling of the above exception, another exception occurred:
151
152Traceback (most recent call last):
153 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
154 await self.app(scope, receive, _send)
155 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
156 await self.app(scope, receive, send_wrapper)
157 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
158 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
159 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
160 raise exc
161 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
162 await app(scope, receive, sender)
163 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
164 await self.app(scope, receive, send)
165 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
166 await self.middleware_stack(scope, receive, send)
167 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
168 await route.handle(scope, receive, send)
169 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
170 await self.app(scope, receive, send)
171 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
172 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
173 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
174 raise exc
175 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
176 await app(scope, receive, sender)
177 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
178 response = await f(request)
179 ^^^^^^^^^^^^^^^^
180 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
181 raw_response = await run_endpoint_function(
182 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
183 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
184 return await dependant.call(**values)
185 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
186 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
187 response, response_headers = await self.dataplane.infer(
188 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
189 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
190 response, res_headers = await model(request, headers=headers)
191 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
192 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
193 (await self.predict(payload, headers))
194 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
195 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
196 await self.send_event(
197 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
198 await events.send_event(
199 File "/srv/revert_risk_model/python/events.py", line 260, in send_event
200 raise RuntimeError(
201RuntimeError: Connection error while trying to post the event to EventGate. Please contact the ML team if the issue persists.
2022026-05-04 09:58:12.817 1 uvicorn.error ERROR: Exception in ASGI application
203Traceback (most recent call last):
204 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
205 hosts = await self._resolve_host(host, port, traces=traces)
206 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
208 await future
209 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1209, in _resolve_host_with_throttle
210 addrs = await self._resolver.resolve(host, port, family=self._family)
211 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
212 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/resolver.py", line 40, in resolve
213 infos = await self._loop.getaddrinfo(
214 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
215 File "/usr/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
216 return await self.run_in_executor(
217 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
218 File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
219 result = self.fn(*self.args, **self.kwargs)
220 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
221 File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
222 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
223 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
224socket.gaierror: [Errno -3] Temporary failure in name resolution
225
226The above exception was the direct cause of the following exception:
227
228Traceback (most recent call last):
229 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
230 async with aio_http_client.post(
231 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
232 self._resp: _RetType = await self._coro
233 ^^^^^^^^^^^^^^^^
234 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
235 resp = await handler(req)
236 ^^^^^^^^^^^^^^^^^^
237 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
238 conn = await self._connector.connect(
239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
241 proto = await self._create_connection(req, traces, timeout)
242 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
243 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
244 _, proto = await self._create_direct_connection(req, traces, timeout)
245 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
246 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1568, in _create_direct_connection
247 raise ClientConnectorDNSError(req.connection_key, exc) from exc
248aiohttp.client_exceptions.ClientConnectorDNSError: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f8c89d934a0> [Temporary failure in name resolution]
249
250During handling of the above exception, another exception occurred:
251
252Traceback (most recent call last):
253 File "/opt/lib/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
254 result = await app( # type: ignore[func-returns-value]
255 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
256 File "/opt/lib/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
257 return await self.app(scope, receive, send)
258 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
259 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1159, in __call__
260 await super().__call__(scope, receive, send)
261 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/applications.py", line 90, in __call__
262 await self.middleware_stack(scope, receive, send)
263 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
264 raise exc
265 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
266 await self.app(scope, receive, _send)
267 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
268 await self.app(scope, receive, send_wrapper)
269 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
270 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
271 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
272 raise exc
273 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
274 await app(scope, receive, sender)
275 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
276 await self.app(scope, receive, send)
277 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
278 await self.middleware_stack(scope, receive, send)
279 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
280 await route.handle(scope, receive, send)
281 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
282 await self.app(scope, receive, send)
283 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
284 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
285 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
286 raise exc
287 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
288 await app(scope, receive, sender)
289 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
290 response = await f(request)
291 ^^^^^^^^^^^^^^^^
292 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
293 raw_response = await run_endpoint_function(
294 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
295 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
296 return await dependant.call(**values)
297 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
298 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
299 response, response_headers = await self.dataplane.infer(
300 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
301 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
302 response, res_headers = await model(request, headers=headers)
303 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
304 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
305 (await self.predict(payload, headers))
306 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
307 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
308 await self.send_event(
309 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
310 await events.send_event(
311 File "/srv/revert_risk_model/python/events.py", line 260, in send_event
312 raise RuntimeError(
313RuntimeError: Connection error while trying to post the event to EventGate. Please contact the ML team if the issue persists.
314ERROR:root:Connection error while sending an event to EventGate: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f8c4bee3ec0> [Temporary failure in name resolution]
3152026-05-04 09:58:12.819 1 kserve ERROR [errors.py:generic_exception_handler():143] Exception:
316Traceback (most recent call last):
317 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
318 hosts = await self._resolve_host(host, port, traces=traces)
319 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
320 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
321 await future
322 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1209, in _resolve_host_with_throttle
323 addrs = await self._resolver.resolve(host, port, family=self._family)
324 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
325 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/resolver.py", line 40, in resolve
326 infos = await self._loop.getaddrinfo(
327 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
328 File "/usr/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
329 return await self.run_in_executor(
330 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
331 File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
332 result = self.fn(*self.args, **self.kwargs)
333 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
334 File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
335 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
336 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337socket.gaierror: [Errno -3] Temporary failure in name resolution
338
339The above exception was the direct cause of the following exception:
340
341Traceback (most recent call last):
342 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
343 async with aio_http_client.post(
344 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
345 self._resp: _RetType = await self._coro
346 ^^^^^^^^^^^^^^^^
347 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
348 resp = await handler(req)
349 ^^^^^^^^^^^^^^^^^^
350 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
351 conn = await self._connector.connect(
352 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
353 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
354 proto = await self._create_connection(req, traces, timeout)
355 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
356 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
357 _, proto = await self._create_direct_connection(req, traces, timeout)
358 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
359 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1568, in _create_direct_connection
360 raise ClientConnectorDNSError(req.connection_key, exc) from exc
361aiohttp.client_exceptions.ClientConnectorDNSError: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f8c4bee3ec0> [Temporary failure in name resolution]
362
363During handling of the above exception, another exception occurred:
364
365Traceback (most recent call last):
366 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
367 await self.app(scope, receive, _send)
368 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
369 await self.app(scope, receive, send_wrapper)
370 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
371 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
372 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
373 raise exc
374 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
375 await app(scope, receive, sender)
376 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
377 await self.app(scope, receive, send)
378 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
379 await self.middleware_stack(scope, receive, send)
380 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
381 await route.handle(scope, receive, send)
382 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
383 await self.app(scope, receive, send)
384 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
385 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
386 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
387 raise exc
388 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
389 await app(scope, receive, sender)
390 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
391 response = await f(request)
392 ^^^^^^^^^^^^^^^^
393 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
394 raw_response = await run_endpoint_function(
395 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
396 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
397 return await dependant.call(**values)
398 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
399 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
400 response, response_headers = await self.dataplane.infer(
401 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
402 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
403 response, res_headers = await model(request, headers=headers)
404 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
405 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
406 (await self.predict(payload, headers))
407 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
408 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
409 await self.send_event(
410 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
411 await events.send_event(
412 File "/srv/revert_risk_model/python/events.py", line 260, in send_event
413 raise RuntimeError(
414RuntimeError: Connection error while trying to post the event to EventGate. Please contact the ML team if the issue persists.
4152026-05-04 09:58:12.820 1 uvicorn.error ERROR: Exception in ASGI application
416Traceback (most recent call last):
417 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
418 hosts = await self._resolve_host(host, port, traces=traces)
419 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
420 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
421 await future
422 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1209, in _resolve_host_with_throttle
423 addrs = await self._resolver.resolve(host, port, family=self._family)
424 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
425 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/resolver.py", line 40, in resolve
426 infos = await self._loop.getaddrinfo(
427 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
428 File "/usr/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
429 return await self.run_in_executor(
430 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
431 File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
432 result = self.fn(*self.args, **self.kwargs)
433 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
434 File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
435 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
436 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
437socket.gaierror: [Errno -3] Temporary failure in name resolution
438
439The above exception was the direct cause of the following exception:
440
441Traceback (most recent call last):
442 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
443 async with aio_http_client.post(
444 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
445 self._resp: _RetType = await self._coro
446 ^^^^^^^^^^^^^^^^
447 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
448 resp = await handler(req)
449 ^^^^^^^^^^^^^^^^^^
450 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
451 conn = await self._connector.connect(
452 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
453 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
454 proto = await self._create_connection(req, traces, timeout)
455 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
456 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
457 _, proto = await self._create_direct_connection(req, traces, timeout)
458 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
459 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1568, in _create_direct_connection
460 raise ClientConnectorDNSError(req.connection_key, exc) from exc
461aiohttp.client_exceptions.ClientConnectorDNSError: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f8c4bee3ec0> [Temporary failure in name resolution]
462
463During handling of the above exception, another exception occurred:
464
465Traceback (most recent call last):
466 File "/opt/lib/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
467 result = await app( # type: ignore[func-returns-value]
468 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
469 File "/opt/lib/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
470 return await self.app(scope, receive, send)
471 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
472 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1159, in __call__
473 await super().__call__(scope, receive, send)
474 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/applications.py", line 90, in __call__
475 await self.middleware_stack(scope, receive, send)
476 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
477 raise exc
478 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
479 await self.app(scope, receive, _send)
480 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
481 await self.app(scope, receive, send_wrapper)
482 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
483 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
484 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
485 raise exc
486 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
487 await app(scope, receive, sender)
488 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
489 await self.app(scope, receive, send)
490 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
491 await self.middleware_stack(scope, receive, send)
492 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
493 await route.handle(scope, receive, send)
494 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
495 await self.app(scope, receive, send)
496 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
497 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
498 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
499 raise exc
500 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
501 await app(scope, receive, sender)
502 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
503 response = await f(request)
504 ^^^^^^^^^^^^^^^^
505 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
506 raw_response = await run_endpoint_function(
507 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
508 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
509 return await dependant.call(**values)
510 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
511 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
512 response, response_headers = await self.dataplane.infer(
513 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
514 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
515 response, res_headers = await model(request, headers=headers)
516 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
517 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
518 (await self.predict(payload, headers))
519 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
520 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
521 await self.send_event(
522 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
523 await events.send_event(
524 File "/srv/revert_risk_model/python/events.py", line 260, in send_event
525 raise RuntimeError(
526RuntimeError: Connection error while trying to post the event to EventGate. Please contact the ML team if the issue persists.

May 4 2026, 10:15 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis created P92175 Errors RevertRisk-Multilingual .
May 4 2026, 10:14 AM

Apr 30 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I am having errors on the current deployment, please check these logs in the paste:

1ERROR:root:Connection error while sending an event to EventGate: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f5630172180> [Temporary failure in name resolution]
22026-04-30 15:27:23.530 1 kserve ERROR [errors.py:generic_exception_handler():143] Exception:
3Traceback (most recent call last):
4 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
5 hosts = await self._resolve_host(host, port, traces=traces)
6 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
8 await future
9 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1209, in _resolve_host_with_throttle
10 addrs = await self._resolver.resolve(host, port, family=self._family)
11 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
12 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/resolver.py", line 40, in resolve
13 infos = await self._loop.getaddrinfo(
14 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
15 File "/usr/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
16 return await self.run_in_executor(
17 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
18 File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
19 result = self.fn(*self.args, **self.kwargs)
20 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21 File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
22 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
23 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24socket.gaierror: [Errno -3] Temporary failure in name resolution
25
26The above exception was the direct cause of the following exception:
27
28Traceback (most recent call last):
29 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
30 async with aio_http_client.post(
31 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
32 self._resp: _RetType = await self._coro
33 ^^^^^^^^^^^^^^^^
34 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
35 resp = await handler(req)
36 ^^^^^^^^^^^^^^^^^^
37 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
38 conn = await self._connector.connect(
39 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
40 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
41 proto = await self._create_connection(req, traces, timeout)
42 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
44 _, proto = await self._create_direct_connection(req, traces, timeout)
45 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1568, in _create_direct_connection
47 raise ClientConnectorDNSError(req.connection_key, exc) from exc
48aiohttp.client_exceptions.ClientConnectorDNSError: Cannot connect to host eventgate-main.discovery.wmnet:4480 ssl:<ssl.SSLContext object at 0x7f5630172180> [Temporary failure in name resolution]
49
50During handling of the above exception, another exception occurred:
51
52Traceback (most recent call last):
53 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
54 await self.app(scope, receive, _send)
55 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
56 await self.app(scope, receive, send_wrapper)
57 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
58 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
59 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
60 raise exc
61 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
62 await app(scope, receive, sender)
63 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
64 await self.app(scope, receive, send)
65 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
66 await self.middleware_stack(scope, receive, send)
67 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
68 await route.handle(scope, receive, send)
69 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
70 await self.app(scope, receive, send)
71 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
72 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
73 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
74 raise exc
75 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
76 await app(scope, receive, sender)
77 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
78 response = await f(request)
79 ^^^^^^^^^^^^^^^^
80 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
81 raw_response = await run_endpoint_function(
82 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
83 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
84 return await dependant.call(**values)
85 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
87 response, response_headers = await self.dataplane.infer(
88 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
89 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
90 response, res_headers = await model(request, headers=headers)
91 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
92 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
93 (await self.predict(payload, headers))
94 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
95 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
96 await self.send_event(
97 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
98 await events.send_event(
99 File "/srv/revert_risk_model/python/events.py", line 260, in send_event
100 raise RuntimeError(
101RuntimeError: Connection error while trying to post the event to EventGate. Please contact the ML team if the issue persists.
102
103====================================================================================================
104====================================================================================================
105
106
107ERROR:root:Unexpected error while trying to send an event to EventGate:
1082026-04-30 15:30:52.362 1 kserve ERROR [errors.py:generic_exception_handler():143] Exception:
109Traceback (most recent call last):
110 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 779, in _request
111 resp = await handler(req)
112 ^^^^^^^^^^^^^^^^^^
113 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 734, in _connect_and_send_request
114 conn = await self._connector.connect(
115 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 672, in connect
117 proto = await self._create_connection(req, traces, timeout)
118 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1239, in _create_connection
120 _, proto = await self._create_direct_connection(req, traces, timeout)
121 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
122 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1562, in _create_direct_connection
123 hosts = await self._resolve_host(host, port, traces=traces)
124 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/connector.py", line 1153, in _resolve_host
126 await future
127asyncio.exceptions.CancelledError
128
129The above exception was the direct cause of the following exception:
130
131Traceback (most recent call last):
132 File "/srv/revert_risk_model/python/events.py", line 228, in send_event
133 async with aio_http_client.post(
134 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1510, in __aenter__
135 self._resp: _RetType = await self._coro
136 ^^^^^^^^^^^^^^^^
137 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/client.py", line 624, in _request
138 with timer:
139 File "/opt/lib/venv/lib/python3.11/site-packages/aiohttp/helpers.py", line 713, in __exit__
140 raise asyncio.TimeoutError from exc_val
141TimeoutError
142
143During handling of the above exception, another exception occurred:
144
145Traceback (most recent call last):
146 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
147 await self.app(scope, receive, _send)
148 File "/opt/lib/venv/lib/python3.11/site-packages/timing_asgi/middleware.py", line 70, in __call__
149 await self.app(scope, receive, send_wrapper)
150 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
151 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
152 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
153 raise exc
154 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
155 await app(scope, receive, sender)
156 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
157 await self.app(scope, receive, send)
158 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 660, in __call__
159 await self.middleware_stack(scope, receive, send)
160 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in app
161 await route.handle(scope, receive, send)
162 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
163 await self.app(scope, receive, send)
164 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 134, in app
165 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
166 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
167 raise exc
168 File "/opt/lib/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
169 await app(scope, receive, sender)
170 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 120, in app
171 response = await f(request)
172 ^^^^^^^^^^^^^^^^
173 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 674, in app
174 raw_response = await run_endpoint_function(
175 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
176 File "/opt/lib/venv/lib/python3.11/site-packages/fastapi/routing.py", line 328, in run_endpoint_function
177 return await dependant.call(**values)
178 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
179 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/rest/v1_endpoints.py", line 84, in predict
180 response, response_headers = await self.dataplane.infer(
181 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
182 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/protocol/dataplane.py", line 461, in infer
183 response, res_headers = await model(request, headers=headers)
184 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
185 File "/opt/lib/venv/lib/python3.11/site-packages/kserve/model.py", line 285, in __call__
186 (await self.predict(payload, headers))
187 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
188 File "/srv/revert_risk_model/model_server/gpu_model.py", line 126, in predict
189 await self.send_event(
190 File "/srv/revert_risk_model/model_server/gpu_model.py", line 158, in send_event
191 await events.send_event(
192 File "/srv/revert_risk_model/python/events.py", line 268, in send_event
193 raise RuntimeError(
194RuntimeError: Unexpected error happened while the event was posted to EventGate, there is the possibility that it never reached it. Please contact the ML team if the issue persists.

Apr 30 2026, 4:06 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis created P92101 Fetching revesion from MediaWikiAPI error.
Apr 30 2026, 4:05 PM
gkyziridis created P92097 Errors from RevertRisk-Multilingual .
Apr 30 2026, 3:40 PM

Apr 29 2026

gkyziridis edited P91946 Locust Tests RevertRisk-Multilingual on staging.
Apr 29 2026, 2:47 PM
gkyziridis created P91946 Locust Tests RevertRisk-Multilingual on staging.
Apr 29 2026, 2:45 PM
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

36 POST /v1/models/revertrisk-multilingual:predict: LocustBadStatusCode(code=422)

Regarding the load test results, have you looked at these 422 errors? Although it's not a lot, it would be great to figure out why those are popping up

I think that those errors are due to missing revisions which are randomly selected from the "data/revisions_lang_and_id.tsv" during the locust test.
Basically is a specific revision_id that is missing, check the logs:

INFO:root:Model Server: RevertRiskMultilingualGPU
INFO:root:Successfully loaded 342 canonical wiki languages.
WARNING:root:CUDA is not available or PyTorch is CPU-only; using CPU instead.
2026-04-29 14:47:07.078 1 kserve INFO [model_server.py:register_model():402] Registering model: revertrisk-multilingual
2026-04-29 14:47:07.079 1 kserve INFO [model_server.py:setup_event_loop():282] Setting max asyncio worker threads as 32
2026-04-29 14:47:07.130 1 kserve INFO [server.py:_register_endpoints():110] OpenAI endpoints not registered
2026-04-29 14:47:07.130 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
2026-04-29 14:47:07.144 1 uvicorn.error INFO:     Started server process [1]
2026-04-29 14:47:07.144 1 uvicorn.error INFO:     Waiting for application startup.
2026-04-29 14:47:07.148 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
2026-04-29 14:47:07.149 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
2026-04-29 14:47:07.149 1 uvicorn.error INFO:     Application startup complete.
2026-04-29 14:47:07.149 1 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO:root:Opening a new Asyncio session for mwapi.
INFO:root:revision 1096365864 (en): revision_missing
INFO:root:revision 1096365864 (en): revision_missing
Apr 29 2026, 2:05 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Hey @achou, I think that this change is adding some extra latency as well.

Apr 29 2026, 8:49 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Apr 28 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/revertrisk-multilingual:predict 820 1300 1800 2200 3200 4200 5400 5900 8200 10000 10000 1743
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------

Aggregated                                                                            820   1300   1800   2200   3200   4200   5400   5900   8200  10000  10000   1743
Apr 28 2026, 1:07 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Apr 21 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I am pasting here some results from loading tests.
Locust Test results:

locust RevertriskMultilingual --headless   --users 35   --spawn-rate 5   --run-time 120s   --only-summary
[2026-04-21 15:17:11,442] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2026-04-21 15:17:11,442] stat1010/INFO/locust.runners: Ramping to 35 users at a rate of 5.00 per second
[2026-04-21 15:17:17,447] stat1010/INFO/locust.runners: All users spawned: {"RevertriskMultilingual": 35} (35 total users)
[2026-04-21 15:19:10,981] stat1010/INFO/locust.main: Shutting down (exit code 1)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/revertrisk-multilingual:predict                                       931     7(0.75%) |   1351      65   24093    430 |    7.81        0.06
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       931     7(0.75%) |   1351      65   24093    430 |    7.81        0.06
Apr 21 2026, 3:26 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Apr 2 2026

gkyziridis claimed T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.
Apr 2 2026, 3:24 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Testing again on staging
I used this event for testing:

1{"changelog_kind":"update","page_change_kind":"edit","dt":"2026-04-02T13:15:57Z","wiki_id":"enwiki","page":{"page_id":27045134,"page_title":"Indonesian_orthography","namespace_id":0,"is_redirect":false},"performer":{"user_text":"Ramkarlo82","groups":["extendedconfirmed","*","user","autoconfirmed"],"is_bot":false,"is_system":false,"is_temp":false,"user_id":48476947,"registration_dt":"2024-09-21T16:45:02Z","edit_count":1085,"user_central_id":76548874},"revision":{"rev_id":1346719934,"rev_dt":"2026-04-02T13:15:57Z","is_minor_edit":false,"rev_sha1":"b8mpd89jbz4r96ra9xwq7d5ez0cvz85","rev_size":12810,"rev_parent_id":1334816050,"comment":"ce + change bulleted list to prose","editor":{"user_text":"Ramkarlo82","groups":["extendedconfirmed","*","user","autoconfirmed"],"is_bot":false,"is_system":false,"is_temp":false,"user_id":48476947,"registration_dt":"2024-09-21T16:45:02Z","edit_count":1085,"user_central_id":76548874},"is_content_visible":true,"is_editor_visible":true,"is_comment_visible":true,"content_slots":{"main":{"slot_role":"main","content_model":"wikitext","content_sha1":"b8mpd89jbz4r96ra9xwq7d5ez0cvz85","content_size":12810,"content_format":"text/x-wiki","origin_rev_id":1346719934}}},"prior_state":{"revision":{"rev_id":1334816050,"rev_dt":"2026-01-25T20:24:48Z","is_minor_edit":false,"rev_sha1":"1pt2p9slm17bi989386sz2l9px3938y","rev_size":12859,"rev_parent_id":1332090077,"comment":"/* Q and X */","editor":{"user_text":"Isla","groups":["extendedconfirmed","reviewer","*","user","autoconfirmed"],"is_bot":false,"is_system":false,"is_temp":false,"user_id":43126470,"registration_dt":"2022-01-03T19:49:05Z","edit_count":3120,"user_central_id":68558834},"is_content_visible":true,"is_editor_visible":true,"is_comment_visible":true,"content_slots":{"main":{"slot_role":"main","content_model":"wikitext","content_sha1":"1pt2p9slm17bi989386sz2l9px3938y","content_size":12859,"content_format":"text/x-wiki","origin_rev_id":1334816050}}}},"$schema":"/mediawiki/page/change/1.3.0","meta":{"stream":"mediawiki.page_change.v1","uri":"https://en.wikipedia.org/wiki/Indonesian_orthography","id":"26fd274c-268d-4ab5-bd5f-7e8c9b6a81ba-TEST-2","request_id":"e00bcc98-ad4e-4c92-a893-bed1e974f07d","domain":"en.wikipedia.org","dt":"2026-04-02T13:16:00.761Z"}}

Apr 2 2026, 3:22 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis created P90240 prediction_change_event.
Apr 2 2026, 2:55 PM
gkyziridis updated the language for P90234 ChangeProp logs from autodetect to json.
Apr 2 2026, 2:37 PM
gkyziridis updated the language for P90235 Event test for rr-multilingual from autodetect to json.
Apr 2 2026, 2:34 PM
gkyziridis created P90235 Event test for rr-multilingual.
Apr 2 2026, 2:34 PM
gkyziridis created P90234 ChangeProp logs.
Apr 2 2026, 2:31 PM
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

The latest version of revertrisk-multilingual model handling the stream is deployed on production.

$ kube_env revertrisk ml-serve-eqiad
$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
revertrisk-language-agnostic-predictor-00004-deployment-b477w7x   3/3     Running   0          11m
revertrisk-language-agnostic-predictor-00004-deployment-b4b6mnj   3/3     Running   0          9m36s
revertrisk-language-agnostic-predictor-00004-deployment-b4t4jlt   3/3     Running   0          11m
revertrisk-language-agnostic-predictor-00004-deployment-b4tfdlw   3/3     Running   0          9m36s
revertrisk-language-agnostic-predictor-00004-deployment-b4vz7cp   3/3     Running   0          11m
revertrisk-language-f284bff08aba54bd309680bad6316c0a-deplos7tj6   3/3     Running   0          11m
revertrisk-multilingual-pre-save-predictor-00003-deploymen8xsr4   3/3     Running   0          11m
revertrisk-multilingual-predictor-00004-deployment-5c575f98bbms   3/3     Running   0          11m
revertrisk-multilingual-predictor-00004-deployment-5c575f9flqt8   3/3     Running   0          11m
revertrisk-multilingual-predictor-00004-deployment-5c575f9n78wg   3/3     Running   0          11m
revertrisk-wikidata-predictor-00003-deployment-6558b4b65-k8wvg    3/3     Running   0          11m
revertrisk-wikidata-predictor-00003-deployment-6558b4b65-q4lhh    3/3     Running   0          11m
Apr 2 2026, 7:41 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Mar 31 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I recall we verified it works on staging. Is there anything left to do before we move it to production?

Hey @achou, yes indeed, I will work on this tomorrow, I do not think that we miss anything else in order to go on production.

Mar 31 2026, 8:50 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Mar 17 2026

gkyziridis created T420327: Improve logging on Liftwing.
Mar 17 2026, 10:23 AM · Essential-Work, Machine-Learning-Team (Q4 FY2025-26)
gkyziridis closed T419527: Increase batch size in edit-check service, a subtask of T413026: [MILESTONE] Offer Tone Check as default-on feature at partner wikis, as Resolved.
Mar 17 2026, 10:13 AM · Editing QA, OKR-Work, Editing-team (Editing-Q4-30Mar-10Apr-2026), Epic, EditCheck, VisualEditor
gkyziridis closed T419527: Increase batch size in edit-check service as Resolved.
Mar 17 2026, 10:13 AM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team
gkyziridis moved T419527: Increase batch size in edit-check service from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Mar 17 2026, 10:13 AM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team

Mar 12 2026

gkyziridis added a comment to T419527: Increase batch size in edit-check service.

Building on the above, what (if any) other metrics do you think we ought to be monitoring to evaluate the impact of increasing the max_batch_size from 100 to 300?

I do not think that there are any other different metrics to measure for this change. We will keep monitoring the latency and throughput as we already do, and the error rates as well.
If we see that we are still having issues with big batches we can higher that number.
I think 300 seems ok for now.

Mar 12 2026, 8:39 AM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team

Mar 11 2026

gkyziridis added a comment to T419527: Increase batch size in edit-check service.
gkyziridis@deploy2002:$ kube_env edit-check ml-serve-codfw
gkyziridis@deploy2002:$ kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
edit-check-predictor-00003-deployment-ff659c867-nhqm2   4/4     Running   0          7m58s
Mar 11 2026, 2:17 PM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team

Mar 10 2026

gkyziridis updated the task description for T419527: Increase batch size in edit-check service.
Mar 10 2026, 12:19 PM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team
gkyziridis added projects to T419527: Increase batch size in edit-check service: Lift-Wing, ml-model-requests.
Mar 10 2026, 12:17 PM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team
gkyziridis created T419527: Increase batch size in edit-check service.
Mar 10 2026, 12:16 PM · Editing-team (Tracking), OKR-Work (WE1 FY2025-26), ml-model-requests, Lift-Wing, Machine-Learning-Team

Feb 24 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Build Image:

docker build -f .pipeline/revertrisk/multilingual.yaml --target production --platform=linux/amd64 -t multilingual:events .
Feb 24 2026, 3:32 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Feb 18 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

I was experimenting with the option:

Another option: make a .v2 stream with a different/new or just new major version 2.0.0 schema that supports multiple model predictions per event, either via a array of them, or a map of them. The downside would be that evolving the items in the array or map would not be easily supported (it's complicated).

And I found it kinda complicated. I think we can go with the option that we are creating a different (dedicated) stream for the rr-multilingual predictions, something like: EVENTGATE_STREAM=mediawiki.page_revert_risk_multilingual_prediction_change.v1, this will separate the stream right ?
This way we have two different streams pointing to the same schema, and in the deployment charts we set the corresponding EVENT_STREAM value for each of the rr models. We also set the correct values under the changeprop so we maintain two different streams.

Feb 18 2026, 12:17 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Feb 11 2026

gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

@Ottomata thank you for the comments.
We are not in the state to deploy this on production. I just built it like this in order to understand the flow and test it on staging as well.
Currently many people from our team are absent, so we will make the final decisions when they are back.
For now I just implemented this and we can test things on staging.
I will experiment with the alternatives as well:

we may need to produce predictions to a separate stream instead of mediawiki.page_revert_risk_prediction_change.

Another option: make a .v2 stream with a different/new or just new major version 2.0.0 schema that supports multiple model predictions per event, either via a array of them, or a map of them. The downside would be that evolving the items in the array or map would not be easily supported (it's complicated).

Feb 11 2026, 2:51 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering
gkyziridis closed T405358: Add LiftWing streams data to event_sanitized (increase data retention) as Resolved.
Feb 11 2026, 11:58 AM · Lift-Wing, Machine-Learning-Team
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Finished the implementation of the event mechanism in inference-services for the rr-multilingual model. \
This is the local testing on my machine:

Feb 11 2026, 9:38 AM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Feb 6 2026

gkyziridis added a comment to T396495: Build model training pipeline for tone check using WMF ML Airflow instance.

Update

Since the task: T406217 is finished we have a first version of end-to-end pipeline including all the basic steps of an ML-Lifecycle: Data Generation -> Model Training -> Export model in S3 bucket.
More info could be found here: https://phabricator.wikimedia.org/T398970

Feb 6 2026, 1:18 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Editing-team (Tracking), Machine-Learning-Team
gkyziridis added a comment to T398970: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model.

Generate Data (SparkSubmitOperator) -> Train/Validation/Test split (SparkSubmitOperator) -> Copy from HDFS to a PVC (WMFKubernetesPodOperator) -> Train model on GPU pod (WMFKubernetesPodOperator) -> Copy retrained model to S3 (PythonOperator)

Feb 6 2026, 1:13 PM · Goal, Machine-Learning-Team
gkyziridis closed T406217: Export retrained Tone-check model to an S3 bucket, a subtask of T398970: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model, as Resolved.
Feb 6 2026, 1:07 PM · Goal, Machine-Learning-Team
gkyziridis closed T406217: Export retrained Tone-check model to an S3 bucket as Resolved.
Feb 6 2026, 1:07 PM · Patch-For-Review, Machine-Learning-Team
gkyziridis closed T396495: Build model training pipeline for tone check using WMF ML Airflow instance, a subtask of T365301: Tone Check: Prompt people to revise promotional language, as Resolved.
Feb 6 2026, 1:06 PM · Epic, EditCheck, VisualEditor
gkyziridis closed T396495: Build model training pipeline for tone check using WMF ML Airflow instance, a subtask of T391940: FY2024-25 Q4 Goal: Productionize tone check model, as Resolved.
Feb 6 2026, 1:05 PM · Goal, Machine-Learning-Team
gkyziridis closed T396495: Build model training pipeline for tone check using WMF ML Airflow instance, a subtask of T398970: Q1 FY2025-26 Goal: Airflow training pipeline for Tone check model, as Resolved.
Feb 6 2026, 1:05 PM · Goal, Machine-Learning-Team
gkyziridis closed T396495: Build model training pipeline for tone check using WMF ML Airflow instance as Resolved.
Feb 6 2026, 1:05 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Editing-team (Tracking), Machine-Learning-Team
gkyziridis added a comment to T415892: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change.

Hey, I am working on this, I think that I have finished the implementation for publishing the predictions in events. I am now testing it locally.
Based on this: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams I think there are these steps:

  1. Implementation on inference-services side (this is what I am testing).
  2. Test it and deploy the new model server versions.
  3. Configure Changeprop.
  4. Configure the new changes in the mediawiki-config repo.
Feb 6 2026, 12:43 PM · Machine-Learning-Team (Q4 FY2025-26), Data-Engineering-Radar, Event-Platform, Data-Engineering

Feb 3 2026

gkyziridis closed T411786: ORES is not working on testwiki as Resolved.
Feb 3 2026, 10:31 AM · Automoderator, Moderator-Tools-Team, Machine-Learning-Team, ORES

Jan 30 2026

gkyziridis moved T406217: Export retrained Tone-check model to an S3 bucket from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Jan 30 2026, 12:20 PM · Patch-For-Review, Machine-Learning-Team
gkyziridis moved T396495: Build model training pipeline for tone check using WMF ML Airflow instance from Ready To Go to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Jan 30 2026, 12:20 PM · Data-Platform-SRE (2026.01.23 - 2026.02.13), Essential-Work, Editing-team (Tracking), Machine-Learning-Team

Jan 29 2026

gkyziridis added a comment to T412357: Install AMD GPU + torch version of ML Labs machines.

Hey @Isaac, this ticket is assigned to @klausman but he is currently on his sabbatical. He will start working on this when he is back, I think around next month (???).
I am tagging @DPogorzelski-WMF here for visibility, maybe he has something more to add.

Jan 29 2026, 12:25 PM · Machine-Learning-Team
gkyziridis added a comment to T406217: Export retrained Tone-check model to an S3 bucket.

Update

The end-to-end tone-check retraining pipeline succeeded, we solved the issues of Multy-Attach PVC.

image.png (946×2 px, 112 KB)

The new version of the retrained tone-check model is successfully copied to the dedicated S3 bucket under: s3://wmf-ml-models/retrained-models/tone-check/, here are the logs of the export step:
1tone-check-training-dag-move-model-to-s3-nv8wgsew
2 ▶ Log message source details
3[2026-01-28, 22:24:03 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
4[2026-01-28, 22:24:04 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
5[2026-01-28, 22:24:05 UTC] {tone_check_training_dag.py:101} INFO - [+] S3 client loaded !
6[2026-01-28, 22:24:05 UTC] {tone_check_training_dag.py:103} INFO - Searching files in /mnt/model-training/tone_check/20260128T134152/output_model:
7[2026-01-28, 22:24:05 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/config.json | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/config.json
8[2026-01-28, 22:24:05 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/model.safetensors | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/model.safetensors
9[2026-01-28, 22:24:12 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/special_tokens_map.json | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/special_tokens_map.json
10[2026-01-28, 22:24:12 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/rng_state.pth | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/rng_state.pth
11[2026-01-28, 22:24:12 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/tokenizer_config.json | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/tokenizer_config.json
12[2026-01-28, 22:24:13 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/vocab.txt | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/vocab.txt
13[2026-01-28, 22:24:13 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/tokenizer.json | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/tokenizer.json
14[2026-01-28, 22:24:13 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/training_args.bin | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/training_args.bin
15[2026-01-28, 22:24:14 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/scheduler.pt | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/scheduler.pt
16[2026-01-28, 22:24:14 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/trainer_state.json | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/trainer_state.json
17[2026-01-28, 22:24:14 UTC] {tone_check_training_dag.py:109} INFO - - File: /mnt/model-training/tone_check/20260128T134152/output_model/checkpoint-21530/optimizer.pt | wmf-ml-models | retrained-models/tone-check/checkpoint-21530/optimizer.pt
18[2026-01-28, 22:24:29 UTC] {tone_check_training_dag.py:112} INFO - [+] Files uploded correctly at: s3://wmf-ml-models/retrained-models/tone-check//
19[2026-01-28, 22:24:29 UTC] {python.py:240} INFO - Done. Returned value was: None

Here are the content of the S3 bucket:

$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/
2026-01-28 22:24   865   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/config.json
2026-01-28 22:24   678M  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/model.safetensors
2026-01-28 22:24  1357M  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/optimizer.pt
2026-01-28 22:24    13K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/rng_state.pth
2026-01-28 22:24  1064   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/scheduler.pt
2026-01-28 22:24   695   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/special_tokens_map.json
2026-01-28 22:24     2M  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/tokenizer.json
2026-01-28 22:24  1330   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/tokenizer_config.json
2026-01-28 22:24     9K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/trainer_state.json
2026-01-28 22:24     5K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/training_args.bin
2026-01-28 22:24   972K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/vocab.txt
Jan 29 2026, 8:53 AM · Patch-For-Review, Machine-Learning-Team
gkyziridis created P88090 Logs of move_model_to_s3_task from Airflow.
Jan 29 2026, 8:50 AM · Machine-Learning-Team

Jan 28 2026

gkyziridis added a comment to T405358: Add LiftWing streams data to event_sanitized (increase data retention).

We are currently do not store anywhere the predictions from the rr-multilingual model so we cannot export them in the same way that we are doing for the rr-language-agnostic one.
If there is this necessity, I can open a new Phabricator task in order to start developing the first step of saving the slice of the rr-multilingual predictions into the event stream, and then we can add them to the refinery and export them into the event_sanitized as we do for the rr-langugage-agnostic.

Jan 28 2026, 2:32 PM · Lift-Wing, Machine-Learning-Team
gkyziridis added a comment to T405358: Add LiftWing streams data to event_sanitized (increase data retention).

@gkyziridis I'm testing this out today but only seeing revertrisk-language-agnostic for an example revision on enwiki, is that expected?

spark-sql (default)> select predicted_classification from event.mediawiki_page_revert_risk_prediction_change_v1 where revision.rev_id = 1333904928;
predicted_classification
{"model_name":"revertrisk-language-agnostic","model_version":"3","predictions":["false"],"probabilities":{"false":0.7348057627677917,"true":0.26519423723220825}}
Jan 28 2026, 1:54 PM · Lift-Wing, Machine-Learning-Team
gkyziridis created P87991 Error during training pipeline.
Jan 28 2026, 11:27 AM

Jan 27 2026

gkyziridis added a comment to T406217: Export retrained Tone-check model to an S3 bucket.

I also checked the PVC using kubectl and I see that the PVC is "RWO": "ReadWriteOnce" I am not sure if this makes the problem:

$ kube_env airflow-ml-deploy dse-k8s-eqiad
$ kubectl get pvc airflow-ml-model-training -n airflow-dev
NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
airflow-ml-model-training   Bound    pvc-8a6a2920-8d7e-4616-8ab6-a6a70b26d116   20Gi       RWO            ceph-rbd-ssd   151d
Jan 27 2026, 8:56 AM · Patch-For-Review, Machine-Learning-Team

Jan 21 2026

gkyziridis added a comment to T406217: Export retrained Tone-check model to an S3 bucket.
$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H --recursive s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/
2026-01-20 13:33   865   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/config.json
2026-01-20 13:33   678M  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/model.safetensors
2026-01-20 13:33  1357M  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/optimizer.pt
2026-01-20 13:33    13K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/rng_state.pth
2026-01-20 13:33  1064   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/scheduler.pt
2026-01-20 13:33   695   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/special_tokens_map.json
2026-01-20 13:33     2M  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/tokenizer.json
2026-01-20 13:33  1330   s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/tokenizer_config.json
2026-01-20 13:33    24K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/trainer_state.json
2026-01-20 13:33     5K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/training_args.bin
2026-01-20 13:33   972K  s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/vocab.txt
Jan 21 2026, 2:43 PM · Patch-For-Review, Machine-Learning-Team
gkyziridis created P87832 Airflow logs for move_model_to_s3_task.
Jan 21 2026, 2:42 PM

Jan 20 2026

gkyziridis moved T411786: ORES is not working on testwiki from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Jan 20 2026, 2:56 PM · Automoderator, Moderator-Tools-Team, Machine-Learning-Team, ORES
gkyziridis moved T406217: Export retrained Tone-check model to an S3 bucket from Ready To Go to In Progress on the Machine-Learning-Team board.
Jan 20 2026, 2:56 PM · Patch-For-Review, Machine-Learning-Team

Jan 15 2026

gkyziridis created P87559 Promptathon - LLM gemma3:4b model results.
Jan 15 2026, 3:52 PM

Jan 12 2026

gkyziridis created P87372 Error in isvc during rr-multilingual deployment.
Jan 12 2026, 3:36 PM
gkyziridis added a comment to T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.

Weekly Update:

  • The Wikimedia Enterprise team conducted load tests to simulate their traffic and shared results in T409388#11483570
  • We are working on optimizing the revertrisk-wikidata inference service to achieve the Enterprise team's latency target in T414060
Jan 12 2026, 3:28 PM · Patch-For-Review, OKR-Work, Goal, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise, Wikidata, Lift-Wing, Machine-Learning-Team

Jan 9 2026

gkyziridis added a comment to T411786: ORES is not working on testwiki.
curl -s -X \
POST "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-language-agnostic:predict" \
-d '{"rev_id": 2, "lang": "test"}' \
-H "Host: revertrisk-language-agnostic.revertrisk.wikimedia.org"
Jan 9 2026, 3:38 PM · Automoderator, Moderator-Tools-Team, Machine-Learning-Team, ORES

Jan 6 2026

gkyziridis added a comment to T411786: ORES is not working on testwiki.

Things we need to keep in mind:

  • Testwiki is not a canonical/normal wiki so it is excluded from canonical_wikis list
  • Testwiki is not a supported wiki for the revertrisk model, so predictions will be completely inaccurate.
  • We treat testwiki as enwiki on the fly in order for the revert-risk model server to accept such API hits posting {"lang"="test"}
Jan 6 2026, 11:20 AM · Automoderator, Moderator-Tools-Team, Machine-Learning-Team, ORES
gkyziridis edited P86768 Testing locally revertrisk base model treating testwiki' as 'enwiki' on the fly..
Jan 6 2026, 10:58 AM · Machine-Learning-Team, ml-model-requests
gkyziridis created P86768 Testing locally revertrisk base model treating testwiki' as 'enwiki' on the fly..
Jan 6 2026, 10:57 AM · Machine-Learning-Team, ml-model-requests
gkyziridis moved T407155: [SPIKE] Define process for validating Tone Check model eval data for languages staff members do not speak from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Jan 6 2026, 9:40 AM · Machine-Learning-Team, EditCheck, VisualEditor

Dec 18 2025

gkyziridis moved T411786: ORES is not working on testwiki from 2025-2026 Q2 Done to In Progress on the Machine-Learning-Team board.
Dec 18 2025, 10:21 AM · Automoderator, Moderator-Tools-Team, Machine-Learning-Team, ORES
gkyziridis moved T411786: ORES is not working on testwiki from Unsorted to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Dec 18 2025, 10:21 AM · Automoderator, Moderator-Tools-Team, Machine-Learning-Team, ORES