ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RhinosF1
	Jan 6 2024, 3:19 PM

Description

15:07:50 <AntiComposite> upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 on ticket.wikimedia.org

Manual #page given by TheresNoTime

Details

	Subject	Repo	Branch	Lines +/-
	[vrts] Adjust restart and oom policy for clamav and vrts services	operations/puppet	production	+24 -4
	vrts: auto-restart apache2 on-failure	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		eoghan	T354478 ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers
		Open		LSobanski	T354479 ticket.wikimedia.org should page when down

Event Timeline

RhinosF1 created this task.Jan 6 2024, 3:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 6 2024, 3:19 PM

RhinosF1 triaged this task as Unbreak Now! priority.Jan 6 2024, 3:20 PM

RhinosF1 added projects: Znuny, serviceops.

RhinosF1 updated the task description. (Show Details)

LSobanski edited projects, added collaboration-services; removed serviceops.Jan 6 2024, 3:20 PM

Jelto added subscribers: Arnoldokoth, LSobanski.Jan 6 2024, 3:21 PM

I restarted clamav-daemon and apache2 on the host, both had been stopped. It looks like it was a memory pressure issue again. Short term, we could look at auto-restarting apache/clamav on failure, longer term we should investigate whether increasing the memory allocation of the VM would be possible/worthwhile. I'll take care of more in-depth investigation and follow-up on Monday.

Krd subscribed.Jan 6 2024, 3:34 PM

DannyS712 subscribed.Jan 6 2024, 7:50 PM

Peachey88 subscribed.Jan 7 2024, 12:01 AM

https://gerrit.wikimedia.org/r/c/operations/puppet/+/988081

• NoOnEtHeMaStA added a commit: rMEXTf658c3375869: Update git submodules.Jan 7 2024, 9:15 PM

Aklapper removed a commit: rMEXTf658c3375869: Update git submodules.Jan 7 2024, 9:24 PM

Change 988410 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] vrts: auto-restart apache2 on-failure

https://gerrit.wikimedia.org/r/988410

gerritbot added a project: Patch-For-Review.Jan 8 2024, 9:23 AM

Note that memory starvation (the usual trigger for a OOM) wasn't captured in https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=vrts1001&var-datasource=thanos&var-cluster=misc&from=1704550424958&to=1704556818728&viewPanel=4.

This isn't unexpected n some cases, e.g. since memory usage is a gauge and our polling interval is 60s, if memory use increases suddenly between 2 probes, triggering an OOM-killer situation, this won't be depicted in graphs. There were apparently 2 OOM events, one on 15:00:24 killing clamd and one 20 seconds later, on 15:00:43 killing some cgi-bin process of apache2 (logs are truncated past a string certain length, I only see task=/opt/otrs/bin/c).

It's interesting that killing clamd wasn't enough. It might mean the email that made it through also overwhelmed somehow apache2 processes.

Also interesting is that apache2 already has Restart=on-abort

akosiaris@vrts1001:~$ systemctl show --property=Restart apache2.service
Restart=on-abort

which, intuitively, I would expect to have this covered. However, on-abort says Restart on Unclean signal and the main apache2 process DID not receive the SIGKILL from OOM-killer. It was a child process that did (whereas in the clamd case, it was clamd that got the SIGKILL).

Jan 06 15:00:43 vrts1001 systemd[1]: apache2.service: A process of this unit has been killed by the OOM killer.
Jan 06 15:00:45 vrts1001 systemd[1]: apache2.service: Failed with result 'oom-kill'.

The fact that systemd decided to indeed stop the entire service is because of OOMPolicy=stop

systemctl show --property=OOMPolicy apache2.service
OOMPolicy=stop

for which docs[1] say

This setting takes one of continue, stop or kill. If set to continue and a process in the unit is killed by the OOM killer, this is logged but the unit continues running. If set to stop the event is logged but the unit is terminated cleanly by the service manager. If set to kill and one of the unit's processes is killed by the OOM killer the kernel is instructed to kill all remaining processes of the unit too, by setting the memory.oom.group attribute to 1; also see kernel page Control Group v2.

stop is probably not what we want here. In fact, scheduled jobs aside, I am not sure why a long running service would default to stop and not kill or continue. Kill for when the software can't safely recover from such an event, continue when it can (cgi-bin processes, children of apache2 tend to be ok with this).

[1] https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html

Change 988410 merged by Jelto:

[operations/puppet@production] vrts: auto-restart apache2 on-failure

https://gerrit.wikimedia.org/r/988410

Maintenance_bot removed a project: Patch-For-Review.Jan 8 2024, 12:30 PM

@akosiaris That's great, thanks so much for digging into those and writing it up!

@Jelto has merged the auto-restart for apache, and we'll work on updating the OOMPolicy as well.

We should also try guard against crash-loops for anything we're restarting automatically, which I think we can do with StartLimitIntervalSec= and StartLimitBurst=. Maybe 10 restarts over 10 minutes?

Jelto moved this task from Incoming to Work in Progress (Tracking tasks) on the collaboration-services board.Jan 8 2024, 4:27 PM

Jelto merged a task: T354477: ProbeDown (vrts1001).

Jelto added a subscriber: phaultfinder.

In T354478#9441360, @eoghan wrote:

@akosiaris That's great, thanks so much for digging into those and writing it up!

@Jelto has merged the auto-restart for apache, and we'll work on updating the OOMPolicy as well.

If you do that, you probably want to revert the Restart=on-failure setting and let the Debian shipped Restart=on-abort take over. The reason for that is that ExecStart, ExecStop and ExecReload are calling apachectl which will return >0 if ANY error occurs (including configuration errors as well as pretty much anything else). I think we probably don't want to automatically restart apache many times in a row when e.g. a config error has been shipped. It's probably gonna be confusing to whoever is debugging.

We should also try guard against crash-loops for anything we're restarting automatically, which I think we can do with StartLimitIntervalSec= and StartLimitBurst=. Maybe 10 restarts over 10 minutes?

There is already RestartSec=5s in the merged patch which will have systemd sleep for 5 seconds between restarts. Both of the above settings, at least at their current default values (10s and 5 respectively) are rendered moot while that exists.

Now as to what numbers actually make sense, that's an interesting question. Transient situations that are detectable by systemd (e.g. OOM-killer showing up, a segfault or reception of sigbus/sigabrt) will either not be immediately reproducible or they will be over in a matter of minutes at most. Anything else (detectable by systemd) that persists longer than tends to require a human to intervene. With Restart=5s I 'd gravitate around letting StartLimitBurst at the default value of 5 and setting StartLimitIntervalSec to something like double the product of the other 2 (so something like 50 to 60 seconds).

Jelto merged a task: T354040: ProbeDown - vrts1001.Jan 8 2024, 4:43 PM

Change 988739 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [vrts] Adjust restart and oom policy for clamav and vrts services

https://gerrit.wikimedia.org/r/988739

gerritbot added a project: Patch-For-Review.Jan 8 2024, 11:19 PM

Change 988739 merged by Dzahn:

[operations/puppet@production] [vrts] Adjust restart and oom policy for clamav and vrts services

https://gerrit.wikimedia.org/r/988739

Maintenance_bot removed a project: Patch-For-Review.Jan 9 2024, 3:31 PM

LSobanski lowered the priority of this task from High to Medium.Jan 16 2024, 3:41 PM

Jelto mentioned this in T354479: ticket.wikimedia.org should page when down.Jan 16 2024, 4:27 PM

Dropping the priority as we are currently monitoring the effect of the recent restart and oom policy changes.

I think we've monitored long enough. VRTS has been stable since the recent version upgrade.

ticket.wikimedia.org down: upstream connect error or disconnect/reset before headersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers
Closed, ResolvedPublic
Actions

Related Objects
Search...