Page MenuHomePhabricator

Termbox service: unusual errors that could be from envoy
Open, Needs TriagePublic

Description

We're seeing errors coming from our Termbox Service that look like this. We’re trying to make sure that we have some understanding of the different types of timeout so that we can minimize them where possible. We're also seeing this this flurry of related errors that we think is coming from some envoy thing.

We didn't find a corresponding error from an App server that we were expecting to see (e.g. an error from Special:EntityData). Thus, we think our connection is perhaps having problems with envoy.

So we wonder if those errors might actually come from the TLS envoy service that we don't really understand. How would we track down where those errors are coming from? Should we be taking any action about these?

Maybe @akosiaris or @JMeybohm might be able to give us a hint? We guess it might be related to T254581

Event Timeline

Tarrow added a project: serviceops.
Tarrow added subscribers: JMeybohm, serviceops.
Tarrow removed a subscriber: serviceops.

Envoy is being documented at https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by termbox to talk to mediawiki (it's a component of a service mesh). The idea is to have low cost persistent TLS connections, with retries and telemetry. More more insights aside from the doc link above the following grafana dashboard is useful https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now

It is expected and absolutely normal that occasionally connections will be terminated and reestablished by envoy as the network is not infallible. Some will be "masked" by envoy's retry logic, at the cost of extra latency of course.

Using the dashboard above can help tracking down some of the errors. Logs from envoy for termbox are also in logstash, just remove the severity filter and they 'll appear.

Parsing them can be done using https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage

A couple of notes though.

  • Those log entries aren't parsed into a json object unfortunately
  • envoy uses HTTP2 terminology for some stuff internally, even if HTTP1.1 is used. E.g. you will see %REQ(:AUTHORITY)%. That is the authority HTTP2 header (https://tools.ietf.org/html/rfc7540#section-8.1.2). That's equivalent to the Host HTTP/1.1 header
  • The response flags are usually telling. e.g. UF: Upstream connection failure in addition to 503 response code. or URX: The request was rejected because the upstream retry limit (HTTP) or maximum connect attempts (TCP) was reached. and so on

Hopefully the above helps shed a bit of light.

Finally as far as the Should we be taking any action about these?, question goes, my answer would be to use the service's SLO as a guide. As pointed out in T255410, it doesn't seem worthy to investigate those more right now.