Page MenuHomePhabricator

Investigate intermittent delay for basic uwsgi requests.
Closed, ResolvedPublic

Description

In T231222: Address icinga noise from wmflabs we discovered that the wmflabs ORES cluster has an intermittent delay in serving simple requests from the web nodes. Essentially, 0.5% of requests will take a few seconds while 99.5% of requests finish in less than 0.01 seconds.

We don't know what causes this, but we know it doesn't seem to happen on the WMFLabs staging instance. Does it happen in the Beta cluster? Does it happen in prod?

Event Timeline

I confirmed that I do not see the problem:

  • in Production (tested ores1001.eqiad.wmnet)
  • in the Beta cluster (tested deployment-ores01.eqiad.wmflabs)
  • in the Staging "cluster" (tested ores-staging-01.eqiad.wmflabs)

I do see the problem:

  • in the WMFLabs cluster on ores-web-(01,02,03)

Recently, we've been getting hit with a lot more requests which could be related. See https://grafana-labs-admin.wikimedia.org/d/000000006/ores-labs?orgId=1&panelId=1&fullscreen&edit&tab=metrics&from=1566199557016&to=1568064680013

Our request rate doubled on Sept. 1st and seems to be pretty steady.

I got uwsgitop running on ores-web-01. I was able to confirm that, while the timing script is hanging, the majority of workers are idle. It seems that whatever routes requests out to the workers is getting jammed up. I imagine this is related to us periodically seeing "queue full" warnings.

From: https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html

If your (Linux) server seems to have lots of idle workers, but performance is still sub-par, you may want to look at the value of the ip_conntrack_max system variable (/proc/sys/net/ipv4/ip_conntrack_max) and increase it to see if it helps.

Seems relevant. I can't find this variable on the web nodes though.

It seems like that variable is deprecated and we'll need to enable a kernel module to work with the new one. See https://bugs.launchpad.net/swift/+bug/1354909

$ sudo sysctl net.netfilter.nf_conntrack_max
sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory

Change 537420 had a related patch set uploaded (by Halfak; owner: Halfak):
[operations/puppet@production] Switches ores.wmflabs monitoring to use new ores-web-(04,05,06)

https://gerrit.wikimedia.org/r/537420

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.

It looks like the solution here is to reduce memory pressure on the machines. We've started up 3 new VMs with more memory and the issue seems to be resolved.

Change 537420 merged by Alexandros Kosiaris:
[operations/puppet@production] Switches ores.wmflabs monitoring to use new ores-web-(04,05,06)

https://gerrit.wikimedia.org/r/537420