Isvc pods sometimes fail to serve HTTP requests and blackhole traffic
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Sep 15 2023, 1:23 PM

Description

We found out about this problem in T346175

The current symptom are:

Pods shown in kubectl with container health status 2/3.
- Example: eswiki-damaging-predictor-default-00011-deployment-754dcd86j6kb 2/3 Running 0 2d22h
The queue-proxy container seems to fail health checks, and it logs stuff like: aggressive probe error (failed 72 times): dial tcp 127.0.0.1:8080: i/o timeout
The kserve container doesn't log anything suspicious.
Changeprop and other clients hang when calling the pod blackholing traffic, eventually timing out (client side) and giving it up.

We don't know exactly why this is happening, maybe it could be due to autoscaling settings (since this is what we have changed recently).

Since we don't know many logs in kserve (due to a bug in 0.11) we should prioritize upgrading our docker images to 0.11 to get all logs that we should, hopefully getting some clue about what's happening.

Details

Subject	Repo	Branch	Lines +/-
ml-services: remove special resource settings for eswiki-damaging	operations/deployment-charts	master	+0 -7
ml-services: fix memory leak in revscoring servers	operations/deployment-charts	master	+7 -7
ml-services: upgrade docker images for revscoring-based isvs	operations/deployment-charts	master	+7 -8
revscoring: upgrade mwparserfromhell to solve memory leak	machinelearning/liftwing/inference-services	main	+1 -1
revscoring: consume all data from the extractor	machinelearning/liftwing/inference-services	main	+3 -1
ml-services: update kserve 0.11 in ml staging	operations/deployment-charts	master	+2 -0
ml-services: deploy eswiki in staging	operations/deployment-charts	master	+2 -0
ml-services: increase memory for eswiki	operations/deployment-charts	master	+14 -0
httpbb: fix ml-staging eswikiquote	operations/puppet	production	+2 -2
ml-services: update revscoring isvcs	operations/deployment-charts	master	+7 -7
python: remove unnecessary self attributes in revscoring's model svc	machinelearning/liftwing/inference-services	main	+5 -5

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T346445 Isvc pods sometimes fail to serve HTTP requests and blackhole traffic
		Resolved		isarantopoulos	T346446 Upgrade revscoring Docker images to KServe 0.11

Event Timeline

elukey created this task.Sep 15 2023, 1:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 15 2023, 1:23 PM

Tried to log on a ml-serve node running a pod hanging, and tried to run gdb (without nsenter):

(gdb) thread apply all py-bt

Thread 5 (Thread 0x7f0571338700 (LWP 3781688) "python3"):
Traceback (most recent call first):
  <built-in method _poll_wrapper of grpc._cython.cygrpc.PollerCompletionQueue object at remote 0x7f05b5bf53c0>
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 912, in _bootstrap
    self._bootstrap_inner()

Thread 4 (Thread 0x7f0570b37700 (LWP 3781687) "grpc_global_tim"):
Unable to locate python frame

Thread 3 (Thread 0x7f056c336700 (LWP 3781686) "resolver-execut"):
Unable to locate python frame

Thread 2 (Thread 0x7f0569b35700 (LWP 3781685) "default-executo"):
Unable to locate python frame

Thread 1 (Thread 0x7f06070fd740 (LWP 3781208) "python3"):
Traceback (most recent call first):
  File "/usr/lib/python3.9/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 2366, in _run_once
  File "/usr/lib/python3.9/asyncio/base_events.py", line 852, in run_forever
    getaddr_func = self._getaddrinfo_debug
  File "/usr/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.9/asyncio/runners.py", line 300, in run
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 404, in start
  File "/srv/rev/model-server/model.py", line 24, in <module>

Not sure if it is the correct procedure but from the result it seems that nothing is out of the ordinary. Even strace doesn't show anything weird.
Pods memory/cpu/throttling metrics are all fine.

As desperate attempt, we restarted all the pods in goodfaith/damaging (eqiad and codfw). It is likely not gonna help but worth trying.

Killed eswiki goodfaith and damaging today.

Killed eswiki damaging again.

There seems to be a metric to look for, namely response-flags:

Screenshot from 2023-09-17 10-33-18.png (1×2 px, 96 KB)

We can find traces of this in the logs, namely "DC" (Downstream closes). The above metric is related to DC from the Istio Gateway to the isvc pod, meanwhile we also have "DC"s from Clients to Istio Gateways:

https://logstash.wikimedia.org/goto/eceeb5baab2d8232f42820b0f22f8d01

So everything points out to revscoring pods getting somehow unresponsive and failing to answer responses, up to the point that Istio GW and clients of it give up due to timeouts.

The last pod that I deleted on ml-serve-eqiad was eswiki-damaging-predictor-default-00012-deployment-754bf46tdg6p. Something interesting is that hours before an OOM event occurred:

Screenshot from 2023-09-17 16-25-02.png (1×2 px, 92 KB)

It seems to match with the rise of DC (Downstream Closes) events shown above fro eswiki-damaging, so I am wondering if an OOM event leaves the kserve pod into an inconsistent state.

The current pod is eswiki-damaging-predictor-default-00012-deployment-754bf46hhsxd (namely, the running pod not showing issues etc..) and I can see the following in kubectl describe:

State:          Running
  Started:      Sun, 17 Sep 2023 08:07:05 +0000
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    0
  Started:      Sun, 17 Sep 2023 08:07:01 +0000
  Finished:     Sun, 17 Sep 2023 08:07:04 +0000

That confirms the timings. But I can also see from the metrics that right now a spike in memory/cpu usage happened, apparently not severe enough to trigger the OOM and the weird inconsistent state. In the logs I see:

2023-09-17 14:36:30.015 72 root INFO [__call__():128] requestId: 7718a7d6-39c7-4cba-8a5a-163ad1af4d71, preprocess_ms: 30819.737195969, explain_ms: 0, predict_ms: 0.978708267, postprocess_ms: 0.003576279
2023-09-17 14:36:30.016 72 root INFO [timing():49] kserve.io.kserve.protocol.rest.v1_endpoints.predict 30.821245670318604, ['http_status:200', 'http_method:POST', 'time:wall']
2023-09-17 14:36:30.016 72 root INFO [timing():49] kserve.io.kserve.protocol.rest.v1_endpoints.predict 30.770472000000154, ['http_status:200', 'http_method:POST', 'time:cpu']
2023-09-17 14:36:30.016 72 root INFO [elapsed_time_wrapper():18] Function get_revscoring_extractor_cache took 30.8179 seconds to execute

This seems to indicate that some request causes this inconsistent issue, maybe triggering a revscoring bug?

Killed eswiki-damaging again.

Change 958052 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase memory for eswiki

https://gerrit.wikimedia.org/r/958052

gerritbot added a project: Patch-For-Review.Sep 17 2023, 4:27 PM

Killed eswiki damaging/goodfaith, same pattern.

In the goodfaith use case, this time the rise in memory is not as sharp as before, but it seems a gradual leak over hours of service:

{F37730239}

Change 958393 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] python: remove unnecessary self attributes in revscoring's model svc

https://gerrit.wikimedia.org/r/958393

Change 958393 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] python: remove unnecessary self attributes in revscoring's model svc

https://gerrit.wikimedia.org/r/958393

elukey mentioned this in rMLISc594367b49a0: python: remove unnecessary self attributes in revscoring's model svc.Sep 18 2023, 8:03 AM

Change 958398 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update revscoring isvcs

https://gerrit.wikimedia.org/r/958398

Change 958398 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revscoring isvcs

https://gerrit.wikimedia.org/r/958398

Change 958429 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/puppet@production] httpbb: fix ml-staging eswikiquote

https://gerrit.wikimedia.org/r/958429

Change 958429 merged by Elukey:

[operations/puppet@production] httpbb: fix ml-staging eswikiquote

https://gerrit.wikimedia.org/r/958429

Ilias rolled out https://gerrit.wikimedia.org/r/958393 to damaging/goodfaith pods in ml-serve-eqiad, so far we haven't seen any occurrence of the memory leak. Let's keep it monitored.

Change 958052 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase memory for eswiki

https://gerrit.wikimedia.org/r/958052

Maintenance_bot removed a project: Patch-For-Review.Sep 18 2023, 8:11 PM

Change 958598 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy eswiki in staging

https://gerrit.wikimedia.org/r/958598

gerritbot added a project: Patch-For-Review.Sep 19 2023, 4:28 AM

Change 958598 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy eswiki in staging

https://gerrit.wikimedia.org/r/958598

eswiki damaging and goodfaith in eqiad have been bumped to use 4Gi as a limit to delay hitting OOM issues until we fix the issue.
Also eswiki-damaging has been deployed to staging so that we can make more extensive tests.

Maintenance_bot removed a project: Patch-For-Review.Sep 19 2023, 5:10 AM

Change 958808 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update kserve 0.11 in ml staging

https://gerrit.wikimedia.org/r/958808

gerritbot added a project: Patch-For-Review.Sep 19 2023, 6:37 AM

Change 958808 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update kserve 0.11 in ml staging

https://gerrit.wikimedia.org/r/958808

Maintenance_bot removed a project: Patch-For-Review.Sep 19 2023, 7:11 AM

elukey claimed this task.Sep 19 2023, 2:39 PM

calbon moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Sep 19 2023, 2:41 PM

Using this graph to find the busiest pods to kill before leaving for EOD (to avoid issues during the EU night).

Killed {es,ko,ru}wiki-{goodfaith,damaging}.

I created the following test environment (locally):

Python script that reads from mediawiki.revision-create in Event Stream and calls Docker locally.
Instrument the Docker container running with eswiki-damaging with Python's tracemalloc calls.

The top call seems to be the following (slowly grows over time):

/opt/lib/python/site-packages/mwparserfromhell/parser/__init__.py:84: size=34.5 MiB, count=435284, average=83 B

There should be away to get the full traceback, working on it.

The only reference I find in revscoring package is here

One trace sample from tracemalloc:

File "/srv/rev/model-server/model.py", line 24
  kserve.ModelServer(workers=1).start([model])
File "/opt/lib/python/site-packages/kserve/model_server.py", line 191
  asyncio.run(servers_task())
File "/usr/lib/python3.9/asyncio/runners.py", line 44
  return loop.run_until_complete(main)
File "/usr/lib/python3.9/asyncio/base_events.py", line 629
  self.run_forever()
File "/usr/lib/python3.9/asyncio/base_events.py", line 596
  self._run_once()
File "/usr/lib/python3.9/asyncio/base_events.py", line 1890
  handle._run()
File "/usr/lib/python3.9/asyncio/events.py", line 80
  self._context.run(self._callback, *self._args)
File "/opt/lib/python/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419
  result = await app(  # type: ignore[func-returns-value]
File "/opt/lib/python/site-packages/uvicorn/middleware/proxy_headers.py", line 78
  return await self.app(scope, receive, send)
File "/opt/lib/python/site-packages/fastapi/applications.py", line 276
  await super().__call__(scope, receive, send)
File "/opt/lib/python/site-packages/starlette/applications.py", line 122
  await self.middleware_stack(scope, receive, send)
File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 162
  await self.app(scope, receive, _send)
File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 70
  await self.app(scope, receive, send_wrapper)
File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 68
  await self.app(scope, receive, sender)
File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 18
  await self.app(scope, receive, send)
File "/opt/lib/python/site-packages/starlette/routing.py", line 718
  await route.handle(scope, receive, send)
File "/opt/lib/python/site-packages/starlette/routing.py", line 276
  await self.app(scope, receive, send)
File "/opt/lib/python/site-packages/starlette/routing.py", line 66
  response = await func(request)
File "/opt/lib/python/site-packages/fastapi/routing.py", line 237
  raw_response = await run_endpoint_function(
File "/opt/lib/python/site-packages/fastapi/routing.py", line 163
  return await dependant.call(**values)
File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 76
  response, response_headers = await self.dataplane.infer(model_name=model_name,
File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 311
  response = await model(request, headers=headers)
File "/opt/lib/python/site-packages/kserve/model.py", line 108
  payload = await self.preprocess(body, headers) if inspect.iscoroutinefunction(self.preprocess) \
File "/srv/rev/model-server/model_servers.py", line 163
  inputs[self.FEATURE_VAL_KEY] = self.fetch_features(
File "/srv/rev/model-server/model_servers.py", line 89
  return extractor_utils.fetch_features(rev_id, features, extractor, cache)
File "/srv/rev/model-server/decorators.py", line 32
  result = func(*args, **kwargs)
File "/srv/rev/model-server/extractor_utils.py", line 199
  feature_values = list(extractor.extract(rev_id, model_features, cache=cache))
File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 268
  value, cache, history = _solve(dependent, context=context, cache=cache,
File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
  value, cache, history = _solve(dependency, context=context,
File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
  value, cache, history = _solve(dependency, context=context,
File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
  value, cache, history = _solve(dependency, context=context,
File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 244
  value = dependent(*args)
File "/opt/lib/python/site-packages/revscoring/dependencies/dependent.py", line 54
  return self.process(*args, **kwargs)
File "/opt/lib/python/site-packages/revscoring/features/wikitext/datasources/parsed.py", line 267
  for node in wikicode.filter():
File "/opt/lib/python/site-packages/mwparserfromhell/wikicode.py", line 565
  return list(self.ifilter(*args, **kwargs))
File "/opt/lib/python/site-packages/mwparserfromhell/wikicode.py", line 558
  return (node for i, node in gen)
File "/opt/lib/python/site-packages/mwparserfromhell/wikicode.py", line 118
  for i, node in inodes:
File "/opt/lib/python/site-packages/mwparserfromhell/wikicode.py", line 112
  for ch in self._get_children(node, restrict=restrict):

I'm testing with a specific rev_id I found that causes a memory increase: 153877972
https://phabricator.wikimedia.org/P52539

As you can see it seems that _process_wiki method is being called twice per request. first request lowers memory usage but second increases it. Both decrease and increase happen in a non-consistent way. I am digging a bit deeper in wikicode.py and as you also found @elukey

I left the code running for a bit, I wanted to test disabling caching in revscoring, I ended up with the following (more precise) trace:

File "/opt/lib/python/site-packages/kserve/model.py", line 108
   payload = await self.preprocess(body, headers) if inspect.iscoroutinefunction(self.preprocess) \
 File "/srv/rev/model-server/model_servers.py", line 163
   inputs[self.FEATURE_VAL_KEY] = self.fetch_features(
 File "/srv/rev/model-server/model_servers.py", line 89
   return extractor_utils.fetch_features(rev_id, features, extractor, cache)
 File "/srv/rev/model-server/decorators.py", line 32
   result = func(*args, **kwargs)
 File "/srv/rev/model-server/extractor_utils.py", line 199
   feature_values = list(extractor.extract(rev_id, model_features, cache=cache))
 File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 268
   value, cache, history = _solve(dependent, context=context, cache=cache,
 File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
   value, cache, history = _solve(dependency, context=context,
 File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
   value, cache, history = _solve(dependency, context=context,
 File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
   value, cache, history = _solve(dependency, context=context,
 File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 235
   value, cache, history = _solve(dependency, context=context,
 File "/opt/lib/python/site-packages/revscoring/dependencies/functions.py", line 244
   value = dependent(*args)
 File "/opt/lib/python/site-packages/revscoring/dependencies/dependent.py", line 54
   return self.process(*args, **kwargs)
 File "/opt/lib/python/site-packages/revscoring/features/wikitext/datasources/parsed.py", line 262
   return mwparserfromhell.parse(text)
 File "/opt/lib/python/site-packages/mwparserfromhell/utils.py", line 53
   return Parser().parse(value, context, skip_style_tags)
 File "/opt/lib/python/site-packages/mwparserfromhell/parser/__init__.py", line 84
   tokens = self._tokenizer.tokenize(text, context, skip_style_tags)

Code used in predict():

import tracemalloc as tm
snapshot = tm.take_snapshot()  

top_stats = snapshot.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
    print(stat)

snapshot = snapshot.filter_traces([tm.Filter(inclusive=True, filename_pattern="*mwparserfromhell*")])
top_stats = snapshot.statistics('traceback')
stat = top_stats[0]
print("TRACE for mwparserfromhell")
print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
for line in stat.traceback.format():
    print(line)

Change 959221 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revscoring: consume all data from the extractor

https://gerrit.wikimedia.org/r/959221

gerritbot added a project: Patch-For-Review.Sep 20 2023, 1:41 PM

Something really interesting: the following rev-id (https://es.wikipedia.org/w/index.php?diff=153880256) causes a big jump in the size of memory stored by mwparserfromhell:

# Before the call
[ Top 10 ]
<frozen importlib._bootstrap_external>:587: size=357 KiB, count=3161, average=116 B
...

# After the first call
[ Top 10 ]
/opt/lib/python/site-packages/mwparserfromhell/parser/__init__.py:84: size=2658 KiB, count=39270, average=69 B
...

# After the second call
[ Top 10 ]
/opt/lib/python/site-packages/mwparserfromhell/parser/__init__.py:84: size=5296 KiB, count=78413, average=69 B
...

The interesting thing is that hitting eswiki-damaging with 153880256 causes the mwparserfromhell's leak to roughly double every time. The diff seems related to a template change..

It seems that we don't use mwparserfromhell version 0.6.5 that has this improvement about the memory leak. (@elukey spotted this change earlier today)
We were using 0.6.4 which was release on Feb 22 while the latest release was made 2 weeks ago.
I ran a test and upgrading it keeps memory stable at 276 MB on my localhost (otherwise it goes up)

here is a profiling example:
Before the mwparserfromhell upgrade

153880256
Filename: /Users/isaranto/repoz/inference-services/python/revscoring/extractor_utils.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   183    266.6 MiB    266.6 MiB           1   @elapsed_time
   184                                         @profile
   185                                         def fetch_features(
   186                                             rev_id, model_features: tuple, extractor: Extractor, cache: Optional[Dict] = None
   187                                         ) -> Dict:
   188                                             """Retrieve model features using a Revscoring extractor provided
   189                                             as input.
   190                                              Parameters:
   191                                                  rev_id: The MediaWiki revision id to check.
   192                                                  model_features: The tuple representing the Revscoring model's features.
   193                                                  extractor: The Revscoring extractor instance to use.
   194                                                  cache: Optional revscoring cache to ease recomputation of features
   195                                                         for the same rev-id.
   196                                         
   197                                              Returns:
   198                                                  The feature values computed by the Revscoring extractor.
   199                                             """
   200    266.6 MiB      0.0 MiB           1       try:
   201    266.6 MiB      0.0 MiB           1           feature_values = []
   202    266.6 MiB      0.0 MiB           1           feature_generator = extractor.extract(rev_id, model_features, cache=cache)
   203    271.0 MiB      4.4 MiB          93           for feature in feature_generator:
   204    271.0 MiB      0.0 MiB          92               feature_values.append(feature)
   205    271.0 MiB      0.0 MiB           1           feature_generator = None
   206                                             except MissingResource as e:
   207                                                 raise InvalidInput(
   208                                                     f"Missing resource for rev-id {rev_id}: {e}",
   209                                                 )
   210                                             except UnexpectedContentType as e:
   211                                                 raise InvalidInput(
   212                                                     f"Unexpected content type for rev-id {rev_id}: {e}",
   213                                                 )
   214                                             except Exception as e:
   215                                                 raise InvalidInput(
   216                                                     "Generic error while extracting features " f"for rev-id {rev_id}: {e}"
   217                                                 )
   218                                         
   219    271.0 MiB      0.0 MiB           1       return feature_values

After the upgrade

2023-09-20 18:11:01.924 uvicorn.access INFO:     127.0.0.1:62857 46720 - "POST /v1/models/eswiki-damaging%3Apredict HTTP/1.1" 200 OK
153880256
Filename: /Users/isaranto/repoz/inference-services/python/revscoring/extractor_utils.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   183    272.4 MiB    272.4 MiB           1   @elapsed_time
   184                                         @profile
   185                                         def fetch_features(
   186                                             rev_id, model_features: tuple, extractor: Extractor, cache: Optional[Dict] = None
   187                                         ) -> Dict:
   188                                             """Retrieve model features using a Revscoring extractor provided
   189                                             as input.
   190                                              Parameters:
   191                                                  rev_id: The MediaWiki revision id to check.
   192                                                  model_features: The tuple representing the Revscoring model's features.
   193                                                  extractor: The Revscoring extractor instance to use.
   194                                                  cache: Optional revscoring cache to ease recomputation of features
   195                                                         for the same rev-id.
   196                                         
   197                                              Returns:
   198                                                  The feature values computed by the Revscoring extractor.
   199                                             """
   200    272.4 MiB      0.0 MiB           1       try:
   201    272.4 MiB      0.0 MiB           1           feature_values = []
   202    272.4 MiB      0.0 MiB           1           feature_generator = extractor.extract(rev_id, model_features, cache=cache)
   203    272.5 MiB      0.0 MiB          93           for feature in feature_generator:
   204    272.5 MiB      0.0 MiB          92               feature_values.append(feature)
   205    272.5 MiB      0.0 MiB           1           feature_generator = None
   206                                             except MissingResource as e:
   207                                                 raise InvalidInput(
   208                                                     f"Missing resource for rev-id {rev_id}: {e}",
   209                                                 )
   210                                             except UnexpectedContentType as e:
   211                                                 raise InvalidInput(
   212                                                     f"Unexpected content type for rev-id {rev_id}: {e}",
   213                                                 )
   214                                             except Exception as e:
   215                                                 raise InvalidInput(
   216                                                     "Generic error while extracting features " f"for rev-id {rev_id}: {e}"
   217                                                 )
   218                                         
   219    272.5 MiB      0.0 MiB           1       return feature_values

We'll have to make it explicit in revscoring package to use above version 0.6.5

We set it in our requirements.txt :(

revscoring/requirements.txt:mwparserfromhell==0.6.4

Change 959221 abandoned by Elukey:

[machinelearning/liftwing/inference-services@main] revscoring: consume all data from the extractor

Reason:

https://gerrit.wikimedia.org/r/959221

Change 959271 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] revscoring: upgrade mwparserfromhell to solve memory leak

https://gerrit.wikimedia.org/r/959271

Change 959271 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revscoring: upgrade mwparserfromhell to solve memory leak

https://gerrit.wikimedia.org/r/959271

isarantopoulos mentioned this in rMLIS07db0f35ecce: revscoring: upgrade mwparserfromhell to solve memory leak.Sep 20 2023, 3:54 PM

Change 959309 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: upgrade docker images for revscoring-based isvs

https://gerrit.wikimedia.org/r/959309

Change 959310 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: fix memory leak in revscoring servers

https://gerrit.wikimedia.org/r/959310

Change 959309 abandoned by Elukey:

[operations/deployment-charts@master] ml-services: upgrade docker images for revscoring-based isvs

Reason:

https://gerrit.wikimedia.org/r/959309

Change 959310 merged by Elukey:

[operations/deployment-charts@master] ml-services: fix memory leak in revscoring servers

https://gerrit.wikimedia.org/r/959310

Fix deployed to goodfaith/damaing pod environments. Let's double check tomorrow that the memory metrics are stable and then we can roll out the changes even further.

All revscoring model servers have been re-deployed with this fix and successfully tested by running the httpbb tests.

elukey closed this task as Resolved.Sep 21 2023, 8:32 AM

Change 959893 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: remove special resource settings for eswiki-damaging

https://gerrit.wikimedia.org/r/959893

Change 959893 merged by Elukey:

[operations/deployment-charts@master] ml-services: remove special resource settings for eswiki-damaging

https://gerrit.wikimedia.org/r/959893

elukey closed subtask T346446: Upgrade revscoring Docker images to KServe 0.11 as Resolved.Nov 1 2023, 7:29 AM

isarantopoulos moved this task from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 2 2023, 10:48 PM

	F37729830: Screenshot from 2023-09-17 16-25-02.png
	Sep 17 2023, 2:44 PM

	F37729704: Screenshot from 2023-09-17 10-33-18.png
	Sep 17 2023, 8:36 AM

Isvc pods sometimes fail to serve HTTP requests and blackhole trafficClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Isvc pods sometimes fail to serve HTTP requests and blackhole traffic
Closed, ResolvedPublic
Actions

Related Objects
Search...