We need to get additional insight into Envoy's proxy time - to be specific, the sum of times
- between a request arriving and being sent to the upstream
- between a response being received from the upstream and the full response being sent to the downstream
There are additional complications to be considered in terms of how filters and things like rate limiting service impacts these times, but for the purposes of our exercise we don't necessarily care about the nuances of these as long as we get raw values for the above.
These statistics should be available via Prometheus as a histogram (our ultimate goal being calculating the 99th percentile of proxy times).
I have spent a long time looking at existing metrics and trying to ascertain whether we can get this from existing metrics but it seems like this functionality does not exist, and that any results we'd approximate from existing histograms (for example trying to juggle the existing upstream and downstream time histograms) would be inaccurate.
Currently our response time SLI is the time from request to response of API content to the user. This means that our response time is bound by the response time of the appservers, databases and other components along the path. This is correct from the perspective of the API server being an application rather than a proxy. Our response time SLIs would be a lot lower and more useful by being independent of these factors and encompassing the real amount of time taken to serve requests.