Page MenuHomePhabricator

Termbox SSR connection terminated very often
Open, Needs TriagePublic

Description

Starting last Wednesday, 2020-06-10, the number of connection failures with the Termbox SSR services seems to sharply rise to about 300 - 450 per hour.

https://logstash.wikimedia.org/goto/808228af4133ed6e857a168de0c6cd4c

The error message is given as:

Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: encountered a bad response from the remote renderer

with the content being:

upstream connect error or disconnect/reset before headers. reset reason: connection termination

It is currently unclear what is going on or what the effects are. At the very least there is lots of logspam that started recently.

IRC log of discussion: https://wm-bot.wmflabs.org/browser/index.php?start=06%2F15%2F2020&end=06%2F15%2F2020&display=%23wikimedia-serviceops

Event Timeline

Michael created this task.Jun 15 2020, 9:53 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 15 2020, 9:53 AM
Michael updated the task description. (Show Details)Jun 15 2020, 10:02 AM

Change 605554 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] services_proxy: Add keepalive, retries

https://gerrit.wikimedia.org/r/605554

Change 605554 merged by Alexandros Kosiaris:
[operations/puppet@production] services_proxy: Add keepalive, retries

https://gerrit.wikimedia.org/r/605554

Addshore updated the task description. (Show Details)Jun 15 2020, 10:49 AM
Michael added a comment.EditedJun 15 2020, 2:41 PM

The patch above (Ib7bfacf88/02109a44) seemingly resolved the original issue, but some new errors started appearing (though much less common than before).

Those new errors seem to come from a timeout in Termbox/Kubernetes which result in a timeout in the php/mediawiki code.

The following two links illustrate that error for the rendering process of a single item: 庞家村 (Q14836111)

Termbox/k8s: https://logstash.wikimedia.org/goto/480ada0d3692e0cadd04e9e09f9d95f5
Mediawiki: https://logstash.wikimedia.org/goto/fdc9bc9dab0d92bc440f7c6a20e016be

This was judged to be not pressing (@Tarrow .@Addshore please correct me if I'm talking rubbish here). Currently, the internal timeouts seem to be 3 seconds both for the Mediawiki side (ssrServerTimeout?) and for the Termbox side. It is not clear what we can do about those errors. While the internal requests should not take longer than 3 seconds, they sometimes do.

Follow-ups:

Termbox/k8s: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-syslog-2020.06.15/syslog?id=AXK3_osLZmYAikdJbyT-&_g=h@e3739c2
Mediawiki: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2020.06.15/mediawiki?id=AXK3_omyK6TSR36GrxZd&_g=h@e1c60c6

Logstash links are unfortunately not copy-pastable. Anyone clicking on those links will get a Unable to completely restore the URL, be sure to use the share functionality.

Logstash links are unfortunately not copy-pastable. Anyone clicking on those links will get a Unable to completely restore the URL, be sure to use the share functionality.

Meh! Thanks for pointing that out. I still have those tabs open. I'll try to see if there is some persistent share functionality there.

@akosiaris I edited my comment and it now has links that hopefully should work. There should be only a single event behind each of the searches linked there.

Also, I added another follow-up task to understand those k8s timeouts themselves.

@Michael thanks for writing this up!

So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest we configure the envoys accordingly by setting tls.upstream_timeout: "3s" in termbox values.yaml as well as timeout: "3s" in appservers envoy config (also lowering keepalive there).

@Michael thanks for writing this up!

So, if it is safe to assume the MW -> termbox timeout is 3s I would suggest we configure the envoys accordingly by setting tls.upstream_timeout: "3s" in termbox values.yaml as well as timeout: "3s" in appservers envoy config (also lowering keepalive there).

That sounds sensible to me, but I'm really not the expert here. What do @Addshore and @Tarrow think? Note that in context of T255436, we might look at timeouts again, but this has not yet been prioritized, so I cannot speculate when that might be.

So one thing relating to this ticket and timeouts that we realized is that it would likely make sense to have the initial MW -> termbox timeout be longer than the termbox -> MW timeouts.
I guess in an ideal situation MW when making the first request would receive an error message from termbox service saying the inner MediaWiki request failed, rather than the current behaviour which is the first MW -> termbox request times out before that happens.
Another thing to take into account is the fact that termbox can possibly make multiple requests to MediaWiki depending on the state of the cache. I guess these each have a timeout?

Summary, we should look at all of this and figure out the correct values.