Investigate intermittent delay for basic uwsgi requests.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 6 2019, 9:33 PM

Description

In T231222: Address icinga noise from wmflabs we discovered that the wmflabs ORES cluster has an intermittent delay in serving simple requests from the web nodes. Essentially, 0.5% of requests will take a few seconds while 99.5% of requests finish in less than 0.01 seconds.

We don't know what causes this, but we know it doesn't seem to happen on the WMFLabs staging instance. Does it happen in the Beta cluster? Does it happen in prod?

Details

	Subject	Repo	Branch	Lines +/-
	Switches ores.wmflabs monitoring to use new ores-web-(04,05,06)	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T231222: Address icinga noise from wmflabs
Mentioned Here: T231222: Address icinga noise from wmflabs

Event Timeline

Halfak created this task.Sep 6 2019, 9:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2019, 9:33 PM

Halfak mentioned this in T231222: Address icinga noise from wmflabs.Sep 6 2019, 9:33 PM

I confirmed that I do not see the problem:

in Production (tested ores1001.eqiad.wmnet)
in the Beta cluster (tested deployment-ores01.eqiad.wmflabs)
in the Staging "cluster" (tested ores-staging-01.eqiad.wmflabs)

I do see the problem:

in the WMFLabs cluster on ores-web-(01,02,03)

Recently, we've been getting hit with a lot more requests which could be related. See https://grafana-labs-admin.wikimedia.org/d/000000006/ores-labs?orgId=1&panelId=1&fullscreen&edit&tab=metrics&from=1566199557016&to=1568064680013

Our request rate doubled on Sept. 1st and seems to be pretty steady.

I got uwsgitop running on ores-web-01. I was able to confirm that, while the timing script is hanging, the majority of workers are idle. It seems that whatever routes requests out to the workers is getting jammed up. I imagine this is related to us periodically seeing "queue full" warnings.

From: https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html

If your (Linux) server seems to have lots of idle workers, but performance is still sub-par, you may want to look at the value of the ip_conntrack_max system variable (/proc/sys/net/ipv4/ip_conntrack_max) and increase it to see if it helps.

Seems relevant. I can't find this variable on the web nodes though.

It seems like that variable is deprecated and we'll need to enable a kernel module to work with the new one. See https://bugs.launchpad.net/swift/+bug/1354909

$ sudo sysctl net.netfilter.nf_conntrack_max
sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory

Halfak claimed this task.Sep 11 2019, 9:02 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Change 537420 had a related patch set uploaded (by Halfak; owner: Halfak):
[operations/puppet@production] Switches ores.wmflabs monitoring to use new ores-web-(04,05,06)

https://gerrit.wikimedia.org/r/537420

gerritbot added a project: Patch-For-Review.Sep 17 2019, 1:43 PM

It looks like the solution here is to reduce memory pressure on the machines. We've started up 3 new VMs with more memory and the issue seems to be resolved.

Change 537420 merged by Alexandros Kosiaris:
[operations/puppet@production] Switches ores.wmflabs monitoring to use new ores-web-(04,05,06)