Page MenuHomePhabricator

Wikibugs' gitlab connector stops working without a strong sign of why
Closed, ResolvedPublicBUG REPORT

Description

Around 2024-05-08 16:30Z I am looking at the logs for the gitlab component and do not see any activity since 2024-05-07 20:49:15Z. Previously the service was restarted at 2024-05-07 17:36Z with a message of "Nothing logged since 2024-05-06T02:01:08Z". The event loop for reading from gitlab-webhook's SSE endpoint apparently can lock up without crashing and triggering a restart?

Log messages have also been captured multiple times showing detection of unhandled read problems that have triggered a restart.

2024-05-06T00:22:16Z aiorun ERROR: Unhandled exception; stopping loop: 'Task exception was never retrieved'
Traceback (most recent call last):
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/client_proto.py", line 83, in connection_lost
    uncompleted = self._parser.feed_eof()
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "aiohttp/_http_parser.pyx", line 507, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
  Not enough data for satisfy transfer length header.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiorun.py", line 219, in new_coro
    await coro
  File "/workspace/src/wikibugs2/gitlab.py", line 48, in async_main
    async for event in reader:
  File "/workspace/src/wikibugs2/events.py", line 111, in __anext__
    async for bline in self._sse_data_feed.content:
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 44, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 307, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 341, in readuntil
    await self._wait("readuntil")
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 302, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

Details

TitleReferenceAuthorSource BranchDest Branch
gitlab and concurrency fixestoolforge-repos/wikibugs2!37bd808work/bd808/gitlabmain
Customize query in GitLab

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2024-05-08T16:45:19Z] <wmbot~bd808@tools-bastion-12> Restarted gitlab task. Nothing logged since 2024-05-07T20:49:15Z (T364490)

Possibly caused by https://sal.toolforge.org/log/Aq_tWI8BGiVuUzOd6v44 (GitLab rebuild/restart for upstream security release)?

Figuring out how to add a reasonable health check to the continuous job would be good while we also look for lockup causes that we can correct.

Possibly caused by https://sal.toolforge.org/log/Aq_tWI8BGiVuUzOd6v44 (GitLab rebuild/restart for upstream security release)?

Restarts of gitlab.wikimedia.org should not have any way of directly influencing wikibugs.

I haven't updated the component diagram at https://www.mediawiki.org/wiki/Wikibugs#Components for the new gitlab component yet, but the connection between wikibugs and gitlab.wikimedia.org is doubly indirect. The wikibugs2 gitlab task connects to https://gitlab-webhooks.toolforge.org/sse/ and then receives server-sent events telling it about activity in gitlab. Those SSE packets are propagating data posted to https://gitlab-webhooks.toolforge.org/hooks by gitlab.wikimedia.org.

Mentioned in SAL (#wikimedia-cloud) [2024-05-09T15:17:05Z] <wmbot~bd808@tools-bastion-12> Restarted gitlab task. Nothing logged since 2024-05-09T06:37:49Z (T364490)

bd808 changed the task status from Open to In Progress.May 9 2024, 7:48 PM
bd808 claimed this task.
bd808 triaged this task as High priority.
bd808 moved this task from Backlog to Doing on the Wikibugs board.

Mentioned in SAL (#wikimedia-cloud) [2024-05-10T17:22:34Z] <wmbot~bd808@tools-bastion-12> Built new image from git hash 35ad598f (T364490).

Mentioned in SAL (#wikimedia-cloud) [2024-05-10T17:25:12Z] <wmbot~bd808@tools-bastion-12> Restarted gitlab job to enable health check (T364490)

Let's see what happens with the health-check band-aid in place for a while.

Let's see what happens with the health-check band-aid in place for a while.

$ kubectl get po | grep -E 'NAME|gitlab'
NAME                        READY   STATUS    RESTARTS         AGE
gitlab-5c579bb7bc-xj54k     1/1     Running   28 (5h20m ago)   2d20h

The process is seeing a lot of restarts, but has not hung again yet. Based on log grepping, I think that 19 of these restarts were in response to short reads from the SSE server. I guess that would mean the other 9 were health-check failures? I'm not having great luck yet in finding reporting on failed checks. This might be related to our Kubernetes version. It looks like some non-trivial work was done on event reporting in v1.29; Toolforge is currently running v1.24.17.

It looks like the gitlab-webhooks server is seeing connections closed at least sometimes at the same time that the wikibugs client sees a premature end of response:

wikibugs logs
2024-05-13T23:51:26Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
Traceback (most recent call last):
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/client_proto.py", line 83, in connection_lost
    uncompleted = self._parser.feed_eof()
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "aiohttp/_http_parser.pyx", line 507, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
  Not enough data for satisfy transfer length header.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/src/wikibugs2/gitlab.py", line 109, in async_main
    async for event in reader:
  File "/workspace/src/wikibugs2/events.py", line 111, in __anext__
    async for bline in self._sse_data_feed.content:
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 44, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 307, in readline
    return await self.readuntil()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 341, in readuntil
    await self._wait("readuntil")
  File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/aiohttp/streams.py", line 302, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
gilab-webhooks logs
2024-05-13T23:51:26Z asyncio DEBUG: <_SelectorSocketTransport fd=11 read=polling write=<idle, bufsize=0>> received EOF
2024-05-13T23:51:26Z asyncio DEBUG: <_SelectorSocketTransport fd=12 read=polling write=<idle, bufsize=0>> received EOF
2024-05-13T23:51:26Z glwebhooks.sinks.sse INFO: Disconnect from 8778807158837
2024-05-13T23:51:26Z glwebhooks.sinks.sse INFO: Disconnect from 8778806535719

The 2 clients that gitlab-webhooks decided had disconnected are most likely wikibugs and wikibugs-testing.

This makes it look to me like something sitting between the client and server is closing the connection. The data flow looks something like gitlab-webhooksk8s egress nginxtoolforge nginxwikibugs. It is very possible that either of those intermediate nginx instances could be terminating the connection.

$ kubectl logs gitlab-webhooks-5b4d9cddd6-bb2z8 | grep "Disconnect from"
2024-05-13T20:03:18Z glwebhooks.sinks.sse INFO: Disconnect from 8778805946219
2024-05-13T20:11:11Z glwebhooks.sinks.sse INFO: Disconnect from 8778805945646
2024-05-13T20:18:16Z glwebhooks.sinks.sse INFO: Disconnect from 8778806536064
2024-05-13T20:19:15Z glwebhooks.sinks.sse INFO: Disconnect from 8778805949877
2024-05-13T20:24:27Z glwebhooks.sinks.sse INFO: Disconnect from 8778806529176
2024-05-13T20:27:19Z glwebhooks.sinks.sse INFO: Disconnect from 8778805945784
2024-05-13T20:27:19Z glwebhooks.sinks.sse INFO: Disconnect from 8778805947204
2024-05-13T20:33:52Z glwebhooks.sinks.sse INFO: Disconnect from 8778805940865
2024-05-13T20:33:52Z glwebhooks.sinks.sse INFO: Disconnect from 8778805945610
2024-05-13T20:41:03Z glwebhooks.sinks.sse INFO: Disconnect from 8778805973258
2024-05-13T20:41:03Z glwebhooks.sinks.sse INFO: Disconnect from 8778805973144
2024-05-13T20:48:49Z glwebhooks.sinks.sse INFO: Disconnect from 8778805968405
2024-05-13T20:48:49Z glwebhooks.sinks.sse INFO: Disconnect from 8778807153813
2024-05-13T20:51:46Z glwebhooks.sinks.sse INFO: Disconnect from 8778805940679
2024-05-13T20:54:49Z glwebhooks.sinks.sse INFO: Disconnect from 8778805940889
2024-05-13T20:54:50Z glwebhooks.sinks.sse INFO: Disconnect from 8778805973852
2024-05-13T21:07:25Z glwebhooks.sinks.sse INFO: Disconnect from 8778805973474
2024-05-13T21:15:01Z glwebhooks.sinks.sse INFO: Disconnect from 8778807154653
2024-05-13T21:15:02Z glwebhooks.sinks.sse INFO: Disconnect from 8778805940775
2024-05-13T23:15:57Z glwebhooks.sinks.sse INFO: Disconnect from 8778805945913
2024-05-13T23:21:06Z glwebhooks.sinks.sse INFO: Disconnect from 8778807159095
2024-05-13T23:21:47Z glwebhooks.sinks.sse INFO: Disconnect from 8778807154164
2024-05-13T23:51:26Z glwebhooks.sinks.sse INFO: Disconnect from 8778807158837
2024-05-13T23:51:26Z glwebhooks.sinks.sse INFO: Disconnect from 8778806535719
$ grep ERROR logs/gitlab.log.2024-05-13
2024-05-12T00:32:02Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-12T01:07:51Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-12T01:37:39Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-12T02:13:31Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-12T11:54:23Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-12T23:59:54Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T00:06:23Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T11:59:51Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T12:42:30Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T20:27:19Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T20:33:52Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T20:41:03Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T20:48:49Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T20:54:50Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T21:15:02Z wikibugs2.gitlab ERROR: Client failure reading from SSE service
2024-05-13T23:51:26Z wikibugs2.gitlab ERROR: Client failure reading from SSE service

I wonder what would happen if instead of connecting via the Toolforge proxy and ingress nginx wikibugs reached across namespaces to connect directly to the gitlab-webhooks service?

I have no name!@shell-1715647529:~$ curl -v --no-buffer -H 'Accept: text/event-stream' 'http://gitlab-webhooks.tool-gitlab-webhooks.svc.tools.local:8000/sse/'
*   Trying 10.99.122.84:8000...
* Connected to gitlab-webhooks.tool-gitlab-webhooks.svc.tools.local (10.99.122.84) port 8000 (#0)
> GET /sse/ HTTP/1.1
> Host: gitlab-webhooks.tool-gitlab-webhooks.svc.tools.local:8000
> User-Agent: curl/7.81.0
> Accept: text/event-stream
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200
< content-type: text/event-stream
< cache-control: no-cache
< connection: keep-alive
< x-accel-buffering: no
< date: Tue, 14 May 2024 00:54:28 GMT
< Transfer-Encoding: chunked
<
: Initial connect ping
event: keepalive
data: {"keepalive": 8318002.496348747}

Mentioned in SAL (#wikimedia-cloud) [2024-05-14T16:46:58Z] <wmbot~bd808@tools-bastion-12> Configure GITLAB_EVENTS_URL to point to http://gitlab-webhooks.tool-gitlab-webhooks.svc.tools.local:8000/sse/ as a test of bypassing k8s ingress (T364490)

Mentioned in SAL (#wikimedia-cloud) [2024-05-14T17:00:42Z] <wmbot~bd808@tools-bastion-12> Configure GITLAB_EVENTS_URL to point to http://gitlab-webhooks.tool-gitlab-webhooks.svc.tools.local:8000/sse/ to bypass k8s ingress and Toolforge front proxy. (T364490)

No connection loss triggering restarts or mystery hangs triggering the health-check since cutting the Kubernetes ingress layer out of the connection between wikibugs and gitlab-webhooks with T364490#9796259.

$ kubectl get po | grep -E 'NAME|gitlab'
NAME                        READY   STATUS    RESTARTS      AGE
gitlab-5c579bb7bc-q56g5     1/1     Running   0             19h

This project has been a real tutorial on weak points in the Toolforge networking layer recently.

$ kubectl get po | grep -E 'NAME|gitlab'
NAME                        READY   STATUS    RESTARTS      AGE
gitlab-5c579bb7bc-q56g5     1/1     Running   0             39h
$ zgrep ERROR -c logs/gitlab.log* | sort -h
logs/gitlab.log:0
logs/gitlab.log.2024-05-08:2
logs/gitlab.log.2024-05-09:1
logs/gitlab.log.2024-05-10:5
logs/gitlab.log.2024-05-11:16
logs/gitlab.log.2024-05-13:16
logs/gitlab.log.2024-05-14:6
logs/gitlab.log.2024-05-15:0

I am going to call this resolved, but will reopen if things regress. It looks like the issue was related to Kubernetes ingress reload/restart cycles breaking the long running server-sent events response in both the case of the mysterious hanging client (likely tcp connection not closed, but layer 7 content stream not working) and also the payload error (likely tcp connection closed prematurely).

I think this can be classified as a noisy neighbors problem with Toolforge webservices generally. The Toolforge Kubernetes ingress works based on a limited number (3 today) of nginx instances grouped behind a load balancer. Each nginx instance can handle any active IngressService inbound request. To do this each nginx will reload it's config when an Ingress object is added or removed from a tool namespace. These reloads are done in a very optimized and graceful way, but it seems there is an eventual limit on how long a worker on the old config is allowed to stay alive.

It is possible that that limit is actually the OOM killer eventually reaping the entire nginx process tree. At the current moment, each of the 3 nginx-ingress Pods in https://k8s-status.toolforge.org/namespaces/ingress-nginx-gen2/ has been restarted following an OOM kill at least once. This frequency however doesn't align with the prior failure rate for the wikibugs<->gitlab-webhooks connection.

$ kubectl sudo -n ingress-nginx-gen2 get po
NAME                                             READY   STATUS    RESTARTS        AGE
ingress-nginx-gen2-controller-5476b745d7-85nd8   1/1     Running   5 (2d20h ago)   59d
ingress-nginx-gen2-controller-5476b745d7-fx8cc   1/1     Running   8 (12d ago)     59d
ingress-nginx-gen2-controller-5476b745d7-qt7l5   1/1     Running   6 (2d20h ago)   59d