Page MenuHomePhabricator

intake-analytics is responsible for up to a 85% of varnish backend fetch errors
Closed, ResolvedPublic

Assigned To
Authored By
Vgutierrez
Apr 14 2022, 12:16 PM
Referenced Files
F35211521: image.png
Jun 6 2022, 11:05 AM
F35130167: image.png
May 11 2022, 12:46 PM
F35129821: image.png
May 11 2022, 10:24 AM
F35129825: image.png
May 11 2022, 10:24 AM
F35125542: image.png
May 10 2022, 1:58 PM
F35125430: eventgate-analytics-external.svg
May 10 2022, 1:29 PM
F35120741: image.png
May 9 2022, 4:00 PM
F35096777: image.png
May 3 2022, 2:30 PM
Tokens
"Burninate" token, awarded by bking.

Description

Taking a look to varnish fetch errors by Host header I've noticed that intake-analytics.wikimedia.org takes up to a 85% of the errors (https://logstash.wikimedia.org/goto/cb80118b55842dee3d17be39d4a82275):

image.png (1×1 px, 73 KB)

Considering eventsgate dashboard data (https://grafana.wikimedia.org/goto/_R3QsPUnz?orgId=1) we're talking about 3k-4k POST requests per second, so from the service perspective it isn't a big percentage of traffic but I'm wondering if we should try to decrease this figure or just acknowledge it as an expected value and live with it

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

My suspicion is that these workers need more CPU and/or memory.
We recently doubled the number of replica pods for the eventgate-analytics-external service from 6 to 12, on March 4th.
This did reduce the overall latency a little, but not by very much. link

image.png (883×1 px, 399 KB)

Looking at the pod resources, we can see that the memory is much nearer to the limit than the CPU.

image.png (704×1 px, 169 KB)

So I wonder if we should increase the amount of RAM available to each eventgate-analytics-external pod to see if it makes a difference to this latency and to the overall error rate.

If it doesn't, then perhaps there is something within eventgate that is the cause. Perhaps it is calling out to the stream_config_uri that is taking the extra time, or something similar.

Change 785151 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the RAM request and limit for eventgate pods

https://gerrit.wikimedia.org/r/785151

Change 785151 merged by jenkins-bot:

[operations/deployment-charts@master] Increase the RAM request and limit for eventgate pods

https://gerrit.wikimedia.org/r/785151

I have increased the amount of RAM available to the eventgate-analytics-external deployment from 300M to 400M.
I'll now monitor to see if improves the error rate.

The RAM upgrade has not resulted in any improvement.

image.png (471×1 px, 121 KB)

I am now investigating by capturing network traffic from the eventgate-analytics-external pods and looking for 503 responses in Wireshark.

Method of capture:

  • ssh to deploy1002.eqiad.wmnet
  • kube_env eventgate-analytics-external eqiad
  • kubectl get pods -o wide to find out where the individual pods are running
  • ssh to a worker node where a pod is running, identified in the previous step
  • sudo docker ps|grep eventgate-analytics-external|grep POD to obtain the container ID
  • sudo docker top $container_id to obtain the pid of the running process
  • sudo nsenter -n -t $pid to enter the network namespace of the pod
  • sudo tcpdump -i any -w eventgate-analytics-external-kubernetes10xx.pcap to write the pcap file

Well, this is a bit confusing.

I've examined packet captures from two pods in eqiad and another in codfw. In each of these I can see the unencrypted traffic between the envoy tls-proxy container and the main eventgate node app.

I was searching for 503 response generated from the app, which would match the 503 responses seen by varnish, but there weren't any.

I examined the the logs from the app with: kubectl logs -f eventgate-analytics-external-production-6d665c7cdc-4l55h -c eventgate-analytics-external|grep "error/5" whilst at the same time capturing logs from the container with tcpdump.

We see errors in the log like this fairly frequently.

btullis@marlin-wsl:/mnt/c/Users/Ben Tullis/tmp$ cat response.txt |jq .
{
  "name": "eventgate-analytics-external",
  "hostname": "eventgate-analytics-external-production-6d665c7cdc-4l55h",
  "pid": 141,
  "level": "ERROR",
  "message": "500: internal_error",
  "stack": "BadRequestError: request aborted\n    at IncomingMessage.onAborted (/srv/service/node_modules/raw-body/index.js:231:10)\n    at IncomingMessage.emit (events.js:189:13)\n    at abortIncoming (_http_server.js:449:9)\n    at socketOnClose (_http_server.js:442:3)\n    at Socket.emit (events.js:194:15)\n    at TCP._handle.close (net.js:600:12)",
  "status": 500,
  "type": "internal_error",
  "detail": "request aborted",
  "request_id": "78e5ba26-506e-4334-ba45-965d1466ca9f",
  "request": {
    "url": "/v1/events?hasty=true",
    "headers": {
      "content-length": "163",
      "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
      "content-type": "text/plain;charset=UTF-8",
      "x-request-id": "78e5ba26-506e-4334-ba45-965d1466ca9f"
    },
    "method": "POST",
    "params": {
      "0": "/v1/events"
    },
    "query": {
      "hasty": "true"
    },
    "remoteAddress": "127.0.0.1",
    "remotePort": 58320
  },
  "levelPath": "error/500",
  "msg": "500: internal_error",
  "time": "2022-04-29T16:35:02.605Z",
  "v": 0
}

This states that it's an error 500, but in fact when we look at the network traffic from the same time, we can see that an error 400 Bad Request is sent to the client. The client in this case is the envoy tls-proxy container.

image.png (464×927 px, 48 KB)

The 400 error always occurs after a POST message with no payload and the TCP connection is torn down after this error with a RST flagged packet being sent by the client.

image.png (760×1 px, 128 KB)

My first thoughts are:

  • perhaps this is a client browser opening a connection but sending an empty POST body
  • perhaps it is an issue with the way that we are batching messages from varnish to the back-end, somehow ending a batch on the empty line between a POST header and its payload
  • why does the app think it's sending a 500, when in fact a 400 is sent over the wire?
  • where is that 400 turned into a 503 that varnish sees? Would this be in ats-be or in envoy-tls?

Continuing to investigate.

I have a few errors logged by ats-be attempting to connect to eventgate-analytics-external.discovery.wmnet - but fewer than I would have expected.

Take this host cp3050 as an example:

btullis@cp3050:/var/log/trafficserver$ tail -f error.log
20220429.06h09m07s CONNECT:[0] could not connect [CONNECTION_ERROR] to 10.2.2.32 for 'https://wdqs.discovery.wmnet/sparql'
20220429.06h14m36s CONNECT: could not connect to 10.2.2.52 for 'https://eventgate-analytics-external.discovery.wmnet:4692/v1/events?hasty=true' (setting last failure time) connect_result=5
20220429.07h37m13s CONNECT: could not connect to 10.2.2.1 for 'https://appservers-rw.discovery.wmnet/w/index.php?title=Spezial:Importieren&action=submit' (setting last failure time) connect_result=5
20220429.08h03m32s CONNECT: could not connect to 10.2.2.22 for 'https://api-rw.discovery.wmnet/w/api.php' (setting last failure time) connect_result=5
20220429.08h52m28s CONNECT: could not connect to 10.2.2.22 for 'https://api-rw.discovery.wmnet/w/api.php' (setting last failure time) connect_result=5
20220429.09h51m48s CONNECT: could not connect to 10.2.2.22 for 'https://api-rw.discovery.wmnet/w/api.php' (setting last failure time) connect_result=5
20220429.11h39m03s CONNECT: could not connect to 10.2.2.52 for 'https://eventgate-analytics-external.discovery.wmnet:4692/v1/events?hasty=true' (setting last failure time) connect_result=5
20220429.11h39m43s CONNECT: could not connect to 10.2.2.52 for 'https://eventgate-analytics-external.discovery.wmnet:4692/v1/events?hasty=true' (setting last failure time) connect_result=5
20220429.11h44m37s CONNECT: could not connect to 10.2.2.52 for 'https://eventgate-analytics-external.discovery.wmnet:4692/v1/events?hasty=true' (setting last failure time) connect_result=5

That shows 3 errors within 5 minutes, but nothing in the hours following that.

perhaps this is a client browser opening a connection but sending an empty POST body

This seems likely, the endpoint is public so anyone can do anything with it :)

why does the app think it's sending a 500, when in fact a 400 is sent over the wire?

Good q! EventGate uses service-template-node. Here's the request logging code. Just gave it a quick read, and there are serveral places in there where a HTTPError 500 might be constructed, but I can't say I follow it all. Would take some timemm to figure it out, but I'd guess somehow the 400 response is finished, but some error is thrown directly after, and the error handling is not being passed response info, so it just logs the error as a '500'.

where is that 400 turned into a 503 that varnish sees? Would this be in ats-be or in envoy-tls?

A mystery indeed!

perhaps this is a client browser opening a connection but sending an empty POST body

This seems likely, the endpoint is public so anyone can do anything with it :)

I agree that this seems likely, but on further inspection it turns out that we have a specific log message for an empty request body.
There are many more of the 'request aborted' log entries than there are these 'empty message body' entries.

image.png (982×1 px, 197 KB)

https://logstash.wikimedia.org/goto/3c01643d08222eff63ba16ff27743497

Within a specific 15 minute window there were 141 of these application logs about empty bodies:

image.png (502×1 px, 93 KB)

During the same window there werean almost identical number of 503 errors that varnish receives (719) and the 500 errors that are logged by the application (722):

image.png (573×1 px, 115 KB)

image.png (573×1 px, 120 KB)

This makes me think that the empty message body errors are being correctly handled by ats-be and varnish, whereas the 5xx errors are caused by something else.
Continuing to investigate.

It possible that the request aborted errors are actually requests being terminated mid-flight by the client? Client closes the connection before the server has a chance to respond?

It possible that the request aborted errors are actually requests being terminated mid-flight by the client? Client closes the connection before the server has a chance to respond?

Yes maybe that's it, but it seems like an unusually regular pattern to me. It's also difficult to pin down exactly what is the client and what is the server in this case, plus we have all of the keepalive timeouts to consider.

Let me walk through the connections, to make sure that I've got it straight in my head:

  1. Client browser (or app) opens a connection to intake-analytics.wikimedia.org - which is directed with gdns to a nearby caching proxy server.
    • That client browser does TLS negotiation with ats-fe
    • After TLS is complete, the client opens a connection to varnish-frontend and varnish accepts the POST request on behalf of the client
  2. Varnish sends the request on to traffic_manager (ats-be) running on port 3128 on that caching proxy server
  3. Traffic manager then opens a TLS connection to 10.2.2.52:4692 which goes through the low-traffic LVS master (lvs1019) and is allocated a kubernetes worker - A connection may already exist to be reused.
    • Port 4692 is picked up by the envoy TLS proxy within each pod, so traffic manager is now the client as far as envoy is concerned
    • envoy receives the POST request and forwards it to port 8192 in the eventgate pod

So in total we make use of four HTTP connections for each incoming event, two of which are wrapped in TLS:

  1. remote client -> varnish (TLS)
  2. varnish -> ats-be
  3. ats-be -> envoy (TLS)
  4. envoy -> eventgate

So if a remote client drops the connection suddenly (as is normal) then varnish will know that the connection has been torn down, but the upstream components might not yet. Sound about right?

Something is causing the errors in eventgate, that look from the logs like they're 500 internal error (but I haven't verified that with tcpdump).

These get back through envoy and are ported to varnish as 503 errors from ats-be.

I'm still trying to work out how this happens...

Currently I'm generating varnish logs with the following command on cp3056:

sudo varnishlog -a -A -w ~/varnishlog-cp3056.log -n frontend -q "RespStatus >= 500"

to be accurate, the remote client talks to HAProxy over a TLS connection and HAProxy handles the traffic to varnish in clear text via a unix socket. You also should take into account that HTTP/2 is being terminated on HAProxy, and everything else is HTTP/1.1 even if it's over a TLS socket (ats-be -> envoy)

If a client goes away suddenly I wouldn't expect a 5xx on the server side but a 400 (bad request) or something of the sort

It does seem that a 400 bad request is being sent to the client. I think that perhaps the 500 reported by eventgate in its logs is not correct. Would require grokking and tracing service-template-node's request error and logging handling code to be sure.

Thanks @Vgutierrez for the clarification on that. I hadn't picked up on the progress of the HAProxy migration, but that makes perfect sense.

One of the things that jumps out at me from the varnishlog extracts that I've been examining is this:

-   FetchError     Resource temporarily unavailable
-   FetchError     straight insufficient bytes

I wonder if perhaps evengtate could be sending a different amount of data from that mentioned in the Content-length header.

It does seem that a 400 bad request is being sent to the client. I think that perhaps the 500 reported by eventgate in its logs is not correct.

Yes, I agree. In all of the packet captures that I've seen so far I've yet to see a 500 or a 503 on the wire between eventgate and envoy, whereas I have seen 400s.

I had the beginnings of a theory, based on some reading around varnish, but now I don't think that it's valid. I'll write it down to see if people agree or if it triggers any other ideas.

It started when I came across this issue: https://github.com/varnishcache/varnish-cache/issues/2629
...which has some similarities to what we're seeing.

Firstly, it matches that it's a POST request and the FetchError entries from varnishlog match. (Resource temporarily unavailable and stright insufficient bytes)

It also mentions: req.body read error: 11 (Resource temporarily unavailable) - backend write error: 11 (Resource temporarily unavailable)
...of which we see a lot in the logstash logs.

The first commenter on the issue says:

But reading the client request body and passing the bytes along to the backend request happens in a loop during fetch (unless the client request body was cached). So it's an error that happens during fetch, and hence gets logged as a FetchError. Since it's happening during the loop that writes to the backend, Varnish logs something about a backend write error, even if the problem has nothing to do with writing to the backend.

I've seen this many, many times, and believe it possible that it never, not once, had anything to do with the fetch or the backend. But rather always some kind of hangup or timeout or cancel on the part of the client (which of course can happen all the time).

So this would suggest that these errors could be caused by the client misbehaving and truncating the message body, but under these conditions it could appear to be a backend server error because it happens during the fetch loop.

I then found this issue: https://stackoverflow.com/q/53523836/3956824
...where similar issues were seen when attempting to use the message body when requesting a custom error page.

I then wondered if it could be similar for us, in that POST requests were either:

  1. being retried (without the original request body being cached)
  2. being used to generate custom error pages

Well I'm pretty sure that:

...so that blows my theory out of the water somewhat.

The next thing I plan to investigate is whether it's some kind of issue with compression of the responses causing inconsistent content-length values.

Hi, haven't deeply read or understood this issue (sorry!) but I wanted to point out T264021 as potentially related?

Thanks @CDanis - Yes that looks very likely. Also I think that the latency ticket T294911: Apparent latency warning in 90th centile of eventgate-logging-external is also probably related.

It seems to me that that intake-logging and intake-analytics are affected by the same underlying issue(s) - although the throughput to intake-analytics is much greater so the effects are more visible.

We had some similar issue with eventgate-main in T249745 but that setup is different, since it is not public-facing.
In that case we ended up increasing the client (mediawiki) timeout as well as enabling retrying of the POST requests via envoy.

I guess we could look at those options for the external interfaces, but it would be nice to be able to work out why the timeouts are occurring if we can.

Change 790289 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Increase the connect_timeout for eventgate based services

https://gerrit.wikimedia.org/r/790289

I suggest that we increase the connect_timeout value for the local_service cluster definition within envoy, for which I have created a patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/790289

I know at the moment eventgate doesn't use the shared tls_helpers.tpl that is available in the common templates: T303543: eventgate chart should use common_templates
...so it has a local copy of these templates and the change that I have proposed is only to the local copy.

If we wished to migrate to the common templates we would have to parameterize connect_timeout here, I believe.

For now, I'd be happy to modify the local copy of the tls_helpers in the eventgate chart and redeploy all eventgate apps with a 5s connect_timeout, to see if this eradicates the issue.

Ideally we would be able to fix eventgate so that new connections were always available in less than 1.0 seconds, but that might be challenging.

If that fixes it, at the very least it will give us some information about what is going on. Added some comments on the patch, but +1 in general to try.

There's some discussion ongoing in #wikimedia-traffic about how to troubleshoot this further, possibly using tools such as stap, tcpaccept, and tcpconnlat
I'll continue with that work before trying out this increase in the connect_timeout.

Looking at https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPanel=37&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&editPanel=37 and temporarily adding a legend with max to the right:

image.png (695×2 px, 126 KB)

I notice that things that take like 1s to be served are things make 0 sense to take 1s to serve. Like /robots.txt or /_info. Those are extremely cheap endpoints (returning a small text file or a 1 liner of JSON) IIRC. p99 for GET /_info is always (regardless of point in time in the last 1h. The pattern doesn't change much btw if we zoom out to 30days) at the 1s higher bucket. Note that these metrics are generated from eventgate itself, this high latency is at the app level and unrelated to any resources eventgate may need to reach out to or the proxy levels upstream of it. I can't help but feel it's somehow related to the errors discussed in this task.

Mentioned in SAL (#wikimedia-operations) [2022-05-10T11:05:44Z] <btullis> temporarily install bpfcc-tools on kubernetes1013 (T306181)

Well I'm not getting anywhere very fast with this.
I now understand from @akosiaris that the 1.0 second upper limit in the quantiles graphs comes from the fact that the sampling buckets have a 1.0 second maximum.

https://github.com/wikimedia/eventgate/blob/00a110630dfc853deb0da8bec6455fe1a2b0cb5d/lib/util.js#L117

That configuration produces prometheus metrics like this.

curl -s http://10.64.66.166:9102/metrics |grep express_router_request_duration_seconds_bucket |grep POST
2:59 PM express_router_request_duration_seconds_bucket{le="0.01",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515
2:59 PM express_router_request_duration_seconds_bucket{le="0.05",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515
2:59 PM express_router_request_duration_seconds_bucket{le="0.1",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515
2:59 PM express_router_request_duration_seconds_bucket{le="0.3",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 194981515
2:59 PM express_router_request_duration_seconds_bucket{le="1",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 252402123
2:59 PM express_router_request_duration_seconds_bucket{le="+Inf",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"} 253557747

These are translated into the quantile values by k8s_rules when being scraped.

We have proposed creating a new bucket for requests that take up to 10 seconds but that hasn't happened yet.

In the meantime I could calculate the count of requests which take more than 1.0 second - but only by scraping the metrics myself and subtracting...
express_router_request_duration_seconds_bucket{le="1",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"}
...from:
express_router_request_duration_seconds_bucket{le="+Inf",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"}

I'd have to do this for every pod as well, so I'm not keen on doing that if I can help it.

I also now understand and agree with the statement:

these metrics are generated from eventgate itself, this high latency is at the app level and unrelated to any resources eventgate may need to reach out to or the proxy levels upstream of it

...so I've been delving into eventgate at the app level, profiling it with perf and creating flame charts - but I haven't been able to find a smoking gun in terms of a cause of the latency.

I identified the node process that was running eventgate-analytics-external, then ran:

sudo perf record -F99 -p 3424786 -g -- sleep 60

...which created a perf.data file. I could then generate the text version with: sudo perf script > perf.out

...Finally I used the flamegraph code from here: https://github.com/brendangregg/FlameGraph

...and ran ../src/FlameGraph/stackcollapse-perf.pl perf.out > perf.folded followed by ../src/FlameGraph/flamegraph.pl perf.folded > eventgate-analytics-external.svg

The resulting flame graph is as shown.

Unfortunately, I'm at a bit of a loss to work out from that why some connections are slow.

I've done more analysis of packet captures from eventgate-analytics-external and I still can't find any evidence of high latency. at least in the pod where I was running the capture.
Here is a representative graph of round-trip-time by sequence number for an individual TCP stream, showing that the packets returned from eventgate rapidly, with 20ms being an outlier.

image.png (832×1 px, 111 KB)

Change 790703 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Bump to eventgate 1.8.1 to get 10s latency bucket in metrics

https://gerrit.wikimedia.org/r/790703

We have proposed creating a new bucket for requests that take up to 10 seconds but that hasn't happened yet.

Just to be clear on this, all that is needed is to deploy a version of eventgate that emits the extra bucket. The infrastructure side will pick it up immediately, there is no need for any configuration there.

In the meantime I could calculate the count of requests which take more than 1.0 second - but only by scraping the metrics myself and subtracting...
express_router_request_duration_seconds_bucket{le="1",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"}
...from:
express_router_request_duration_seconds_bucket{le="+Inf",service="eventgate-analytics-external",path="v1/events",method="POST",status="202"}

I'd have to do this for every pod as well, so I'm not keen on doing that if I can help it.

Ask and you shall receive ;-). Is this https://w.wiki/59Nx what you want? Note the diurnal pattern btw, which is expected.

all that is needed is to deploy a version of eventgate

On it. I had issues with npm and 2FA, now resolved (thanks to Timo!).
https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/790703

Change 790703 merged by Ottomata:

[eventgate-wikimedia@master] Bump to eventgate 1.8.1 to get 10s latency bucket in metrics

https://gerrit.wikimedia.org/r/790703

Mentioned in SAL (#wikimedia-operations) [2022-05-10T15:35:17Z] <ottomata> rolling deploy/restart of all eventgate services to get 10s latency bucket metric - T306181

Change 790727 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Double the number of replicas of eventgate-analytics-external

https://gerrit.wikimedia.org/r/790727

Change 790727 merged by jenkins-bot:

[operations/deployment-charts@master] Double the number of replicas of eventgate-analytics-external

https://gerrit.wikimedia.org/r/790727

I have now deployed the change to double the number of replica pods for eventgate-analytics-external. Monitoring for any changes to the p90 and p99 response times.

I have now deployed the change to double the number of replica pods for eventgate-analytics-external. Monitoring for any changes to the p90 and p99 response times.

Yup, it's there. Subtle but noticeable. Shaved off ~1s from p99 and ~80-100ms from p90

p99:

image.png (697×2 px, 110 KB)

p90:

image.png (696×2 px, 101 KB)

As traffic increases according to the usual daily pattern it has, I expect that difference to become even less noticeable (it would take a few days for the pattern to become easier to see).

@BTullis, Double or even triple capacity again, I 'd say? It should validate the experiment and findings even more prominently.

Tangentially, this is about the point in time where the importance of having a latency SLO becomes apparent. Having an SLO e.g. 99% of all requests to eventgate-analytics-external complete in less than X ms for a sensible value of X (e.g. 100ms/200ms) would have raised the issue a long time ago. Tail latency issues tend to be an early warning for larger (e.g. errors) problems down the line.

Yup, it's there. Subtle but noticeable. Shaved off ~1s from p99 and ~80-100ms from p90

Agreed, but it's much more subtle than I had hoped. Also it doesn't seems to have made any difference at all to the number of errors being seen by the varnish caches.

image.png (511×1 px, 150 KB)

https://logstash.wikimedia.org/goto/7b4bf52ef3221135ac48564f784c6342

Double or even triple capacity again, I 'd say? It should validate the experiment and findings even more prominently.

OK, I'll double it again for now and see what happens.

Tangentially, this is about the point in time where the importance of having a latency SLO becomes apparent. Having an SLO e.g. 99% of all requests to eventgate-analytics-external complete in less than X ms for a sensible value of X (e.g. 100ms/200ms) would have raised the issue a long time ago.

Yes, I agree. We have historically had an Icinga alert in place for this tail latency, although not a formal SLO. When I was converting this Icinga check to a prometheus check I discovered that the check wasn't working and we were in a state whereby the new prometheus alert would be firing all the time. This was in November 2021 - T294911: Apparent latency warning in 90th centile of eventgate-logging-external but I couldn't fix it at the time, so just increased the threshold.

I have stumbled upon this issue with HAProxy, which seems to fit some of the symptoms we are seeing.
https://github.com/haproxy/haproxy/issues/171

Namely that the bodies are missing from a small percentage of the POST requests, particularly those using chunked encoding.
I'm investigating to see whether this issue might be applicable to our environment.

However, they speak about the option http-use-htx to revert to pre HAProxy 2.0 behaviour regarding chunked encoding. That option doesn't seem to be available any more in HAproxy 2.4.

Beginning with HAProxy 2.1 HTX is the only way to go. On another issue (https://github.com/haproxy/haproxy/issues/1301#issuecomment-866996040) Willy Tarreau mentions that on HTX mode HAProxy always dechunks the payload. Could we try to get a request PoC that triggers the error?

Change 791320 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Double the number of eventgate_analytics_external replicas

https://gerrit.wikimedia.org/r/791320

Change 791320 merged by jenkins-bot:

[operations/deployment-charts@master] Double the number of eventgate_analytics_external replicas

https://gerrit.wikimedia.org/r/791320

Change 791594 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] eventgate_analytics_external: Lower to 35 replicas

https://gerrit.wikimedia.org/r/791594

Change 791594 merged by Alexandros Kosiaris:

[operations/deployment-charts@master] eventgate_analytics_external: Lower to 35 replicas

https://gerrit.wikimedia.org/r/791594

The 50% bump in capacity didn't make any noticeable difference this time around. :-(

The 50% bump in capacity didn't make any noticeable difference this time around. :-(

Thanks for trying it and it's still a useful experiment.

There was some discussion on another ticket today, which seems likely to be related: T295427: Problem with delay caused by intake-analytics.wikimedia.org

My hope is that we might be able to use those reports to identify some test cases that always (or often) cause delayed responses or errors in intake-analytics.wikimedia.org
If we can do that then I can use some mitmproxy based investigation into why it's happening.

The 50% bump in capacity didn't make any noticeable difference this time around. :-(

Thanks for trying it and it's still a useful experiment.

There was some discussion on another ticket today, which seems likely to be related: T295427: Problem with delay caused by intake-analytics.wikimedia.org

My hope is that we might be able to use those reports to identify some test cases that always (or often) cause delayed responses or errors in intake-analytics.wikimedia.org
If we can do that then I can use some mitmproxy based investigation into why it's happening.

If you add 8.8.8.7 intake-analytics.wikimedia.org to your hosts file it should cause intake-analytics to timeout consistently I think? Maybe that would help?

Thanks @AlexisJazz for that suggestion. I think it might well help in terms of finding and fixing this because, as I understand it, these requests to intake-analytics.wikimedia.org should all be sent asynchronously using the Beacon API, as stated here [T295427#7817708].

I've also asked a question of the Quality-and-Test-Engineering-Team via the #talk-to-qte Slack channel here about whether we could automate some browser testing of this case where intake-analytics.wikimedia.org times out or is unreachable. This might help to expose any cases where the call happens synchronously.

I'm still trying to work toward identifying any server-side causes for these timeouts too, but perhaps if we can identify a specific type of event from the client that causes the timeout, this will make it easier.

Change 790289 abandoned by Btullis:

[operations/deployment-charts@master] Increase the connect_timeout for eventgate based services

Reason:

We don't think that it is this connect_timeout after all. Rethinking.

https://gerrit.wikimedia.org/r/790289

I notice that things that take like 1s to be served are things make 0 sense to take 1s to serve. Like /robots.txt or /_info. Those are extremely cheap endpoints (returning a small text file or a 1 liner of JSON) IIRC.

"Extremely" can't be stressed enough here. The handler for /robots.txt is as follows:

eventgate/routes/root.js
router.get( '/robots.txt', ( req, res ) => {
  res.type('text/plain').end('User-agent: *\nDisallow: /\n');
} );

Likewise, the handler for /_info stringifies a small-ish in-memory data structure and sends it to the client.

In my local development environment, these endpoints take ~1 ms to return a response from the perspective of the app. Obviously, my dev env isn't as sophisticated as our prod env but I'd hope that these endpoints behave roughly the same regardless of where they're executing.

I was able to leverage that eventgate is an express app and so has quite detailed debug logging built in. Running:

DEBUG="express:*" ./node-modules/.bin/eventgate-wikimedia-dev -c ./eventgate.config.yaml

yielded output like:

  express:router query  : /robots.txt +1ms
  express:router expressInit  : /robots.txt +0ms
{"name":"eventgate-devserver","hostname":"...","pid":..,"level":10,"msg":"Incoming request","request_id":"22011fe0-ddde-11ec-a617-4536cb344e76","request":{"url":"/robots.txt","headers":{"cache-control":"max-age=0","user-agent":"...","x-request-id":"22011fe0-ddde-11ec-a617-4536cb344e76"},"method":"GET","params":{"0":"/robots.txt"},"query":{},"remoteAddress":"127.0.0.1","remotePort":51236},"levelPath":"trace/req","time":"2022-05-27T16:57:46.463Z","v":0}
  express:router compression  : /robots.txt +1ms
  express:router jsonParser  : /robots.txt +0ms
  express:router urlencodedParser  : /robots.txt +0ms
  express:router router  : /robots.txt +0ms
  express:router dispatching GET /robots.txt +0ms

@BTullis: Running the app with that level of logging enabled might help us isolate where the delays are coming from.

I notice that things that take like 1s to be served are things make 0 sense to take 1s to serve. Like /robots.txt or /_info. Those are extremely cheap endpoints (returning a small text file or a 1 liner of JSON) IIRC.

"Extremely" can't be stressed enough here. The handler for /robots.txt is as follows:

Thanks for this. Your sentence made me ask again the perennial what on <insert preferred planet>, and that's where I realized that this is exactly 1000 times larger that it should be (and the number your debugging tests are reporting). Anyway, to cut a long story short, eventgate-wikimedia right now uses service-runner 2.8.4, which lacks https://github.com/wikimedia/service-runner/commit/765ce105525340491b5ea40960fecaf35d1622d4 which is present in 3.1.0 and makes seconds actually be seconds, not milliseconds. The graphs are misleading currently but should be fixed.

I 'll upload a patch to bump service-runner to 3.1.0 for eventgate-wikimedia and I 'll let @BTullis and @Ottomata take it from there. That should at least close this rabbit hole I dragged us into and allow us to look into other reasons for what the task is about.

Change 801393 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[eventgate-wikimedia@master] Bump to service-runnner 3.1.0

https://gerrit.wikimedia.org/r/801393

Thanks @phuedx and @akosiaris for that information and for the patch. That's a great find about service-runner, so I'm keen to deploy that as it should help us sort the signal from the noise. I don't currently have +2 rights on https://gerrit.wikimedia.org/r/admin/repos/eventgate-wikimedia so I can't merge that patch myself. @Ottomata do you have the rights to add me?

Change 801393 merged by Alexandros Kosiaris:

[eventgate-wikimedia@master] Bump to service-runnner 3.1.0

https://gerrit.wikimedia.org/r/801393

Change 801409 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] eventgate-analytics: Bump 2022-05-30-145633-production

https://gerrit.wikimedia.org/r/801409

Change 801409 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics: Bump 2022-05-30-145633-production

https://gerrit.wikimedia.org/r/801409

Didn't work btw, turns out that eventgate also needs a service-runner bump. PR at https://github.com/wikimedia/eventgate/pull/19

Hm, I think we stopped using the github commit sha to install, and instead rely on NPM like everything else. I've just publish your PR to NPM.

Change 803242 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use latest image version in all remaining eventgate services

https://gerrit.wikimedia.org/r/803242

I'm a bit confused by the state of things now.

  1. Has the update to service-runner 3.1.0 been successful for eventgate-analytics-external?

I can see a change in the timings for this service almost 6 days ago, but it's not in the direction that I was expecting. e.g.
Grafana link

image.png (933×1 px, 1 MB)

I was expecting this change to reduce the times shown by a factor of 1000.

  1. Do the grafana graphs need to be updated to take into account the new units? I assumed that they wouldn't need updating because the values are to be generated in seconds, which is what the axes labels are currently set to use, but maybe I've misunderstood that.

I have created a patch to use the new docker image for all of the other eventgate services (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803242) but I'd like to be sure that the change to the timings measurement is doing what we expect, before proceeding with that.

I'm a bit confused by the state of things now.

  1. Has the update to service-runner 3.1.0 been successful for eventgate-analytics-external?

No. We ended up also needing https://github.com/wikimedia/eventgate/pull/19 and while that has been merged and the npm package published per T306181#7970993, there's been no new image built yet by the pipeline to incorporate that change. That also means that the grafana graphs should show no noticeable difference.

  1. Do the grafana graphs need to be updated to take into account the new units? I assumed that they wouldn't need updating because the values are to be generated in seconds, which is what the axes labels are currently set to use, but maybe I've misunderstood that.

You 've assumed correctly, the grafana dashboards have always been in seconds, but eventgate was published milliseconds. With https://github.com/wikimedia/eventgate/pull/19 merged and once we deployed a new image built with that incorporated, eventgate will be publishing seconds and that discrepancy should be eliminated.

I have created a patch to use the new docker image for all of the other eventgate services (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803242) but I'd like to be sure that the change to the timings measurement is doing what we expect, before proceeding with that.

That change would be awesome, once we have a new image built and that image used and not 2022-05-30-145633-production. A simple change in e.g. README of https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/refs/heads/master/README.md should suffice for that.

Great, thanks for the summary @akosiaris - So the reduction in replicas alone explains the slight increase in latency then.

I still can't push to that repo, so I've added the following access request to grant the Analytics group push rights to eventgate-wikimedia.
https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/802950

Once that (or a similar rights change) has come through, I'll update the README as you suggest to trigger new containers and try it again.

Change 803471 had a related patch set uploaded (by Btullis; author: Btullis):

[eventgate-wikimedia@master] Trigger deployment pipeline

https://gerrit.wikimedia.org/r/803471

Change 803471 merged by Btullis:

[eventgate-wikimedia@master] Trigger deployment pipeline

https://gerrit.wikimedia.org/r/803471

Change 803483 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the eventgate image used by eventgate-analytics-external

https://gerrit.wikimedia.org/r/803483

Change 803483 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the eventgate image used by eventgate-analytics-external

https://gerrit.wikimedia.org/r/803483

Change 803242 merged by jenkins-bot:

[operations/deployment-charts@master] Use latest image version in all remaining eventgate services

https://gerrit.wikimedia.org/r/803242

I've run out of time to work on this for now, so I'm removing the Data-Engineering-Kanban tag until I (or someone else) can pick it up again.

We have not managed to get to the bottom of why these errors are occurring, but we have cleared up any lingering doubts about latency being a cause. The latency graphs here are much happier now.

I still have some ideas around T306181#7903490 and/or T306181#7920724 which could be the cause of the 503 and/or 400 errors, but not much more time to investigate right now.

Thanks ben!

Seconded. Thanks for all of your work on this.

This is highly related to T317051 and I think we can close this one and just blame on how Varnish reports some client triggered errors

This is highly related to T317051 and I think we can close this one and just blame on how Varnish reports some client triggered errors

Yes, I see. Thanks for the explanation @Vgutierrez. I'm happy for this ticket to be closed too.

Sorry about the fuss and thanks for your thorough investigation @BTullis