Page MenuHomePhabricator

ORES command line service sometimes hangs
Closed, ResolvedPublic

Description

The ORES command line utility seems to hang fairly often for me.

This occurs when I enter the following command on a machine running Ubuntu 14.04.5:

cat <input json lines file> | \
	ores score_revisions https://ores.wikimedia.org wikidatawiki itemquality --verbose \
	> <output json lines file>

Below is the traceback after I hit ctrl-c. I'm happy to provide any additional details as needed!

File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 108, in _score_request
    doc = response.json()
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/models.py", line 866, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.4/json/__init__.py", line 318, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.4/json/decoder.py", line 361, in raw_decode
    raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:



Traceback (most recent call last):
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 92, in _score
    for score in future.result():
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 402, in result
    return self.__get_result()
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 354, in __get_result
    raise self._exception
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 110, in _score_request
    raise RuntimeError("Non-json response: " + response.text[:100])
RuntimeError: Non-json response: <!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>Wikimedia Error</title>
<style>
* { margi

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/scratch2/wmf/scripts/venv/bin/ores", line 11, in <module>
    sys.exit(main())
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/ores.py", line 55, in main
    module.main(sys.argv[2:])
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/utilities/score_revisions.py", line 63, in main
    run(ores_host, context, model_names, input, output, verbose)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/utilities/score_revisions.py", line 73, in run
    for rev_doc, score_doc in zip(rev_docs, scores):
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 93, in _score
    yield score
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 574, in __exit__
    self.shutdown(wait=True)
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 131, in shutdown
    t.join()
  File "/usr/lib/python3.4/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.4/threading.py", line 1076, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 38, in _python_exit
    t.join()
  File "/usr/lib/python3.4/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.4/threading.py", line 1076, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^C
foreach ($list as $item) {
  work_miracles($item);
}

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I talked to @Hall1467 about this issue and advised him to update his ORES utility and try again. We'll see how that goes.

I updated the ORES utility to 1.2.0 and rerun. It ran for a little bit and then hung for 20 minutes. I then hit control-c and received the following traceback:

    conn.connect()
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fa9be36aa58>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/adapters.py", line 423, in send
    timeout=timeout
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 649, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 376, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='ores.wikimedia.org', port=443): Max retries exceeded with url: /v3/scores/wikidatawiki/?models=itemquality&revids=116033522%7C84779512%7C115826343%7C61614490%7C108087554%7C92575545%7C121117530%7C96717679%7C68377038%7C66468458%7C83633304%7C94897235%7C105329175%7C48736084%7C66862220%7C110225233%7C71763593%7C97133047%7C99902559%7C45875366%7C55438714%7C99648371%7C94293481%7C69584179%7C35819965%7C76135829%7C112304471%7C54261450%7C59348994%7C79767221%7C55746663%7C90097592%7C33882301%7C103227383%7C122766730%7C55769681%7C86935231%7C63592040%7C97731734%7C75524422%7C84230328%7C80597049%7C114623351%7C38468994%7C102109440%7C77376903%7C113820765%7C79102363%7C86176123%7C36478344 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fa9be36aa58>: Failed to establish a new connection: [Errno 110] Connection timed out',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 92, in _score
    for score in future.result():
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 402, in result
    return self.__get_result()
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 354, in __get_result
    raise self._exception
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 106, in _score_request
    verify=True, stream=True)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='ores.wikimedia.org', port=443): Max retries exceeded with url: /v3/scores/wikidatawiki/?models=itemquality&revids=116033522%7C84779512%7C115826343%7C61614490%7C108087554%7C92575545%7C121117530%7C96717679%7C68377038%7C66468458%7C83633304%7C94897235%7C105329175%7C48736084%7C66862220%7C110225233%7C71763593%7C97133047%7C99902559%7C45875366%7C55438714%7C99648371%7C94293481%7C69584179%7C35819965%7C76135829%7C112304471%7C54261450%7C59348994%7C79767221%7C55746663%7C90097592%7C33882301%7C103227383%7C122766730%7C55769681%7C86935231%7C63592040%7C97731734%7C75524422%7C84230328%7C80597049%7C114623351%7C38468994%7C102109440%7C77376903%7C113820765%7C79102363%7C86176123%7C36478344 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fa9be36aa58>: Failed to establish a new connection: [Errno 110] Connection timed out',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/scratch2/wmf/scripts/venv/bin/ores", line 11, in <module>
    sys.exit(main())
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/ores.py", line 57, in main
    module.main(sys.argv[2:])
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/utilities/score_revisions.py", line 71, in main
    run(ores_host, context, model_names, input, output, verbose)
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/utilities/score_revisions.py", line 81, in run
    for rev_doc, score_doc in zip(rev_docs, scores):
  File "/export/scratch2/wmf/scripts/venv/lib/python3.4/site-packages/ores/api.py", line 93, in _score
    yield score
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 574, in __exit__
    self.shutdown(wait=True)
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 131, in shutdown
    t.join()
  File "/usr/lib/python3.4/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.4/threading.py", line 1076, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

@Ladsgroup, I think this might be related to poolcounter. Can you remind me of the constraints that will result in 429 responses?

I plan to take a pass through the code and do a sample run to see if I can get some 429 responses too with default settings.

When running the score_revisions utility with 2 workers (max of 2 parallel requests) I still get some 429 responses from ORES. Maybe poolcounter doesn't release the lock as fast as I can start a new connection. Hmm.

This is a basic hammering device I built:

hammertime.py
import requests
import random
import sys
import threading
import time

def thread():
    response_times = []
    j = 8179679 + int(20000*random.random())
    url = 'https://ores.wikimedia.org/v3/scores/enwiki/{0}/drafttopic'
    for i in range(100):
        start = time.time()
        r = requests.get(url.format(i + 1000 + j))
        response_times.append(time.time()-start)
        try:
            print(r.json()['enwiki']['scores'].keys())
        except:
            print(r.json())
    print(sum(response_times) / len(response_times))

for i in range(int(sys.argv[1])):
    threading.Thread(target=thread).start()

If you run it like:

python hammertime.py 6

It doesn't give you any errors, and if you run it with 15 threads, it starts to give you error for some not most.
I suggest making sure that it's really two parallel connections first.

So what exactly are the constraints we should expect on ores.wikimedia.org?

Less than 5, everything should be fine. Between 5 and 7, you should get slower responses (the average response time should increase but you should not get 429s) and after 7 you should get 429 for extra connections but since pool counter releases the locks really fast, you basically need to go as high as around 12 to see 429s.

I just cleaned some things up in my analysis and I was able to confirm that I was getting no 429's with two parallel connections. So I'm going to close this task. I'll open another for some superficial improvements I made to score_revisions during my testing.

I was just working on a big batch job that got hung at score 300k. When I ^C'd the process, I got this:

Traceback (most recent call last):
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/ores-1.2.1-py3.5.egg/ores/api.py", line 96, in _score
    for score in future.result():
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/ores-1.2.1-py3.5.egg/ores/api.py", line 122, in _score_request
    raise RuntimeError(doc['error'])
RuntimeError: {'message': 'Traceback (most recent call last):\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 384, in _make_request\n    six.raise_from(e, None)\n  File "<string>", line 2, in raise_from\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 380, in _make_request\n    httplib_response = conn.getresponse()\n  File "/usr/lib/python3.5/http/client.py", line 1198, in getresponse\n    response.begin()\n  File "/usr/lib/python3.5/http/client.py", line 297, in begin\n    version, status, reason = self._read_status()\n  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status\n    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")\n  File "/usr/lib/python3.5/socket.py", line 576, in readinto\n    return self._sock.recv_into(b)\n  File "/usr/lib/python3.5/ssl.py", line 937, in recv_into\n    return self.read(nbytes, buffer)\n  File "/usr/lib/python3.5/ssl.py", line 799, in read\n    return self._sslobj.read(len, buffer)\n  File "/usr/lib/python3.5/ssl.py", line 583, in read\n    v = self._sslobj.read(len, buffer)\nsocket.timeout: The read operation timed out\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/adapters.py", line 449, in send\n    timeout=timeout\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 638, in urlopen\n    _stacktrace=sys.exc_info()[2])\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/util/retry.py", line 367, in increment\n    raise six.reraise(type(error), error, _stacktrace)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/packages/six.py", line 686, in reraise\n    raise value\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 600, in urlopen\n    chunked=chunked)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 386, in _make_request\n    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout\n    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)\nurllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host=\'www.wikidata.org\', port=443): Read timed out. (read timeout=5.0)\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 101, in _request\n    auth=auth)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 524, in request\n    resp = self.send(prep, **send_kwargs)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 659, in send\n    history = [resp for resp in gen] if allow_redirects else []\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 659, in <listcomp>\n    history = [resp for resp in gen] if allow_redirects else []\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 238, in resolve_redirects\n    **adapter_kwargs\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 637, in send\n    r = adapter.send(request, **kwargs)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/adapters.py", line 529, in send\n    raise ReadTimeout(e, request=request)\nrequests.exceptions.ReadTimeout: HTTPSConnectionPool(host=\'www.wikidata.org\', port=443): Read timed out. (read timeout=5.0)\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "./ores/wsgi/routes/v3/util.py", line 101, in process_score_request\n    score_response = scoring_system.score(score_request)\n  File "./ores/scoring_systems/scoring_system.py", line 59, in score\n    response = self._score(request)\n  File "./ores/scoring_systems/celery_queue.py", line 192, in _score\n    return super()._score(*args, **kwargs)\n  File "./ores/scoring_systems/scoring_system.py", line 104, in _score\n    request, missing_model_set_revs)\n  File "./ores/scoring_systems/scoring_system.py", line 151, in _extract_root_caches\n    model_set, rev_ids, injection_caches=request.injection_caches)\n  File "./ores/scoring_context.py", line 173, in extract_root_dependency_caches\n    for rev_id, (error, _) in zip(rev_ids, error_root_vals):\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 123, in _extract_many\n    rev_docs = self.get_rev_doc_map(revids_to_lookup, rvprop=rvprop)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 230, in get_rev_doc_map\n    return {rd[\'revid\']: rd for rd in rev_docs}\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 230, in <dictcomp>\n    return {rd[\'revid\']: rd for rd in rev_docs}\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 241, in query_revisions_by_revids\n    **params)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 309, in get\n    continuation=continuation)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 171, in request\n    files=files)\n  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 103, in _request\n    raise TimeoutError(str(e)) from e\nmwapi.errors.TimeoutError: HTTPSConnectionPool(host=\'www.wikidata.org\', port=443): Read timed out. (read timeout=5.0)\n', 'code': 'internal server error'}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/home/halfak/venv/3.5/bin/ores", line 9, in <module>
    load_entry_point('ores==1.2.1', 'console_scripts', 'ores')()
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/ores-1.2.1-py3.5.egg/ores/ores.py", line 57, in main
    module.main(sys.argv[2:])
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/ores-1.2.1-py3.5.egg/ores/utilities/score_revisions.py", line 90, in main
    parallel_requests, retries, input, output, verbose)
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/ores-1.2.1-py3.5.egg/ores/utilities/score_revisions.py", line 103, in run
    for rev_doc, score_doc in zip(rev_docs, scores):
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/ores-1.2.1-py3.5.egg/ores/api.py", line 97, in _score
    yield score
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 581, in __exit__
    self.shutdown(wait=True)
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 139, in shutdown
    t.join()
  File "/usr/lib/python3.5/threading.py", line 1054, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.5/threading.py", line 1070, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Here's a formatted version of the internal error that ORES produced:

Traceback (most recent call last):
  File "./ores/wsgi/routes/v3/util.py", line 101, in process_score_request
    score_response = scoring_system.score(score_request)
  File "./ores/scoring_systems/scoring_system.py", line 59, in score
    response = self._score(request)
  File "./ores/scoring_systems/celery_queue.py", line 192, in _score
    return super()._score(*args, **kwargs)
  File "./ores/scoring_systems/scoring_system.py", line 104, in _score
    request, missing_model_set_revs)
  File "./ores/scoring_systems/scoring_system.py", line 151, in _extract_root_caches
    model_set, rev_ids, injection_caches=request.injection_caches)
  File "./ores/scoring_context.py", line 173, in extract_root_dependency_caches
    for rev_id, (error, _) in zip(rev_ids, error_root_vals):
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 123, in _extract_many
    rev_docs = self.get_rev_doc_map(revids_to_lookup, rvprop=rvprop)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 230, in get_rev_doc_map
    return {rd['revid']: rd for rd in rev_docs}
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 230, in <dictcomp>
    return {rd['revid']: rd for rd in rev_docs}
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 241, in query_revisions_by_revids
    **params)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 309, in get
    continuation=continuation)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 171, in request
    files=files)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 103, in _request
    raise TimeoutError(str(e)) from e
mwapi.errors.TimeoutError: HTTPSConnectionPool(host='www.wikidata.org', port=443): Read timed out. (read timeout=5.0)

So it looks like we got a general error from ORES and while reading from the ThreadPool everything hung. We can tell by the elif lock.acquire(block, timeout) before the KeyboardInterrupt that the whole system was blocked on getting a return value of some sort. Why doesn't it continue in the case that an exception was thrown within ORES? Will need to investigate that.

https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future.result suggests that our "result()" call should raise an exception and that would cause the script itself to crash. But that doesn't seem to be happening. Instead, things just block for a long time.

Ladsgroup raised the priority of this task from High to Needs Triage.
Ladsgroup moved this task from Unorganized to Maintenance/cleanup on the Machine-Learning-Team board.
Ladsgroup triaged this task as Medium priority.Dec 5 2018, 2:29 PM
awight added a subscriber: awight.

Hi, I'm adopting this bug because it seems related to a recent glitch we discovered today. A researcher using the ORES Python client with a parallelism of 20 is finding that the network connections are swamped with the TIME_WAIT TCP status, and very few "ESTABLISHED" connections. It's possible that we're not closing the socket properly.

I'll make sure we're using context managers to guarantee that resources are released regardless of exception or success.

I think my last comment is a different bug, so splitting into T213582