Termbox service: unusual errors that could be from envoy
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tarrow
	Sep 24 2020, 3:58 PM

Description

We're seeing errors coming from our Termbox Service that look like this. We’re trying to make sure that we have some understanding of the different types of timeout so that we can minimize them where possible. We're also seeing this this flurry of related errors that we think is coming from some envoy thing.

We didn't find a corresponding error from an App server that we were expecting to see (e.g. an error from Special:EntityData). Thus, we think our connection is perhaps having problems with envoy.

So we wonder if those errors might actually come from the TLS envoy service that we don't really understand. How would we track down where those errors are coming from? Should we be taking any action about these?

Maybe @akosiaris or @JMeybohm might be able to give us a hint? We guess it might be related to T254581

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Michael	T255410 Termbox SSR connection terminated very often
		Open		None	T263764 Termbox service: unusual errors that could be from envoy

Event Timeline

Tarrow created this task.Sep 24 2020, 3:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2020, 3:58 PM

Tarrow updated the task description. (Show Details)Sep 24 2020, 3:59 PM

Tarrow added a project: serviceops.

Tarrow added subscribers: JMeybohm, serviceops.

Tarrow removed a subscriber: serviceops.

• Pablo-WMDE mentioned this in T255410: Termbox SSR connection terminated very often.Sep 24 2020, 4:01 PM

Michael subscribed.Sep 24 2020, 4:06 PM

• toan subscribed.Sep 25 2020, 7:44 AM

Envoy is being documented at https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by termbox to talk to mediawiki (it's a component of a service mesh). The idea is to have low cost persistent TLS connections, with retries and telemetry. More more insights aside from the doc link above the following grafana dashboard is useful https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now

It is expected and absolutely normal that occasionally connections will be terminated and reestablished by envoy as the network is not infallible. Some will be "masked" by envoy's retry logic, at the cost of extra latency of course.

Using the dashboard above can help tracking down some of the errors. Logs from envoy for termbox are also in logstash, just remove the severity filter and they 'll appear.

Parsing them can be done using https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage

A couple of notes though.

Those log entries aren't parsed into a json object unfortunately
envoy uses HTTP2 terminology for some stuff internally, even if HTTP1.1 is used. E.g. you will see %REQ(:AUTHORITY)%. That is the authority HTTP2 header (https://tools.ietf.org/html/rfc7540#section-8.1.2). That's equivalent to the Host HTTP/1.1 header
The response flags are usually telling. e.g. UF: Upstream connection failure in addition to 503 response code. or URX: The request was rejected because the upstream retry limit (HTTP) or maximum connect attempts (TCP) was reached. and so on

Hopefully the above helps shed a bit of light.

Finally as far as the Should we be taking any action about these?, question goes, my answer would be to use the service's SLO as a guide. As pointed out in T255410, it doesn't seem worthy to investigate those more right now.

Termbox service: unusual errors that could be from envoyOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Termbox service: unusual errors that could be from envoy
Open, Needs TriagePublic
Actions

Related Objects
Search...