Page MenuHomePhabricator

Crawler unable to reach https://toolhub.toolforge.org/toolinfo.json from eqiad k8s cluster
Closed, ResolvedPublicBUG REPORT

Description

ERROR: Timeout connecting to https://toolhub.toolforge.org/toolinfo.json
Traceback (most recent call last):
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connection.py", line 177, in _new_conn
    % (self.host, self.timeout),
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fb389ddc5c0>, 'Connection to toolhub.toolforge.org timed out. (connect timeout=5)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='toolhub.toolforge.org', port=443): Max retries exceeded with url: /toolinfo.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fb389ddc5c0>, 'Connection to toolhub.toolforge.org timed out. (connect timeout=5)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/app/toolhub/apps/crawler/tasks.py", line 178, in fetch_content
    timeout=(5, 13),
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/adapters.py", line 504, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='toolhub.toolforge.org', port=443): Max retries exceeded with url: /toolinfo.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fb389ddc5c0>, 'Connection to toolhub.toolforge.org timed out. (connect timeout=5)'))
ERROR: Failed to fetch https://toolhub.toolforge.org/toolinfo.json: Connect Timeout

Event Timeline

The *.toolforge.org ingress is not behind the text-lb CDN edge, so this should be attempting to route through the url-downloader proxy. Having an environment with matching network restrictions to test things from (T290357: Maintenance environment needed for running one-off commands) would be helpful for working out what is really going wrong here.

bd808 triaged this task as High priority.Sep 28 2021, 10:15 PM
bd808 moved this task from Backlog to Research needed on the Toolhub board.

Change 724851 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] crawler: set explicit proxy configuration

https://gerrit.wikimedia.org/r/724851

bd808 changed the task status from Open to In Progress.Sep 29 2021, 9:53 PM
bd808 claimed this task.
bd808 moved this task from Research needed to In Progress on the Toolhub board.
bd808 changed the subtype of this task from "Task" to "Bug Report".

Change 724851 merged by jenkins-bot:

[wikimedia/toolhub@main] crawler: set explicit proxy configuration

https://gerrit.wikimedia.org/r/724851

Change 724859 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2021-09-29-223524-production

https://gerrit.wikimedia.org/r/724859

Change 724859 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2021-09-29-223524-production

https://gerrit.wikimedia.org/r/724859

Change 725060 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Do not force cronjob envvars to uppercase

https://gerrit.wikimedia.org/r/725060

Change 725060 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Do not force cronjob envvars to uppercase

https://gerrit.wikimedia.org/r/725060

Change 725180 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: set https_proxy envvar

https://gerrit.wikimedia.org/r/725180

Change 725181 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 021-10-01-024845-production

https://gerrit.wikimedia.org/r/725181

Change 725180 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: set https_proxy envvar

https://gerrit.wikimedia.org/r/725180

Change 725181 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 021-10-01-024845-production

https://gerrit.wikimedia.org/r/725181

Change 725376 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add \"localhost\" to no_proxy envvar

https://gerrit.wikimedia.org/r/725376

Change 725376 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add \"localhost\" to no_proxy envvar

https://gerrit.wikimedia.org/r/725376

Change 725384 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add envoy and mcrouter sidecars to cronjob

https://gerrit.wikimedia.org/r/725384

I now have the eqiad deployment configured to try to crawl 4 different URLs to try and get a better picture of what works and what fails:

Runs are still not completing, but I do at least keep getting slightly different errors as I continue to try and find the root problem.

Change 725384 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add envoy and mcrouter sidecars to cronjob

https://gerrit.wikimedia.org/r/725384

Change 725428 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Set CronJob's backoffLimit back to 1

https://gerrit.wikimedia.org/r/725428

Change 725428 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Set CronJob's backoffLimit back to 1

https://gerrit.wikimedia.org/r/725428

Change 725430 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Set concurrencyPolicy=Replace for CronJob

https://gerrit.wikimedia.org/r/725430

Change 725430 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Set concurrencyPolicy=Replace for CronJob

https://gerrit.wikimedia.org/r/725430

Lots of things were wrong from the start of this task. We needed to set https_proxy in the environment, add the envoy and mcrouter side cars, add 'localhost' to the no_proxy exception list, and tell Kubernetes that it was ok to replace the prior job's pod with a new one when the schedule trips (workaround for sidecars not knowing to terminate with the main container).