It doesn't look like we are actually constantly going down. What is all the noise about? How can we reduce it?
Description
Related Objects
- Mentioned In
- T232228: Investigate intermittent delay for basic uwsgi requests.
- Mentioned Here
- T232228: Investigate intermittent delay for basic uwsgi requests.
T232164: ORES is creating a lot of metrics. This is due to revid count-based metrics
T189605: Beta cluster ORES is emitting statsd errors
T218567: Fix statsd config handing in ores
Event Timeline
I just restarted nginx on ores-lb-03 to see if that was what was causing the problem. Next I'll be looking into getting a clean deploy into wmflabs.
Also, I'm seeing a lot of lines that look like this in the uwsgi logs:
[2019-09-04T14:38:43] statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pusher_statsd/plugin.c line 40]
I'm also seeing some
[2019-09-04T06:25:41] fork(): Cannot allocate memory [core/master_utils.c line 729]
Looks like we're restarting once per day: https://grafana-labs-admin.wikimedia.org/d/000000006/ores-labs?orgId=1&panelId=6&fullscreen&from=1567046937558&to=1567595497214
Looks like there are periodic timeouts coming from the wmflabs system. I ran a script to just make 1000 request to the fakewiki pseudo-model (which should return right away) and I got many requests not responding for upwards of 60 seconds!
Min. : 0.1881 1st Qu.: 0.2194 Median : 0.3793 Mean : 2.3360 3rd Qu.: 2.8895 Max. :61.2457
No request should hang for more than 15 seconds no matter what because of the timeouts we have in place. It looks like about 2% of the requests (19/1000) hug for more than 15 seconds.
I'm curious if the timeouts are only seen from a single virtual machine, or if you're seeing the same results from multiple ORES virtual machines. Could you please run your script on a couple of different hosts and share the results with host names?
@JHedden, I'm running this script from my local machine. But I imagine icinga is making its requests from a different location than me and it's getting the same behavior. Right now, I'm trying to eliminate various explanations for the cause of the delay.
I just ran a request that should have been drawn from ORES redis cache 1000 times and got similar results. Here's a snippet:
0.3344287872314453 0.34113192558288574 0.31957292556762695 0.34860801696777344 0.3475968837738037 0.36787891387939453 0.36493802070617676 0.35985302925109863 0.3482813835144043 0.34537506103515625 1.3687825202941895 0.3382432460784912 0.35638856887817383 4.523691654205322 34.91725969314575 28.653353214263916 0.3615725040435791 0.35713720321655273 20.093085765838623 10.441282510757446 10.40427017211914 12.945164918899536 10.117834091186523 6.772924184799194 8.29482626914978 10.932244777679443 0.3201889991760254 0.3156282901763916 0.31780529022216797 0.31928396224975586 0.31522202491760254 0.32955360412597656 0.3060317039489746 0.31044483184814453 0.3071455955505371 0.3259284496307373 0.3149731159210205 0.3131980895996094 0.32795023918151855 0.32677674293518066 0.31515979766845703 0.36296677589416504 0.31809473037719727 0.30800867080688477
So the delay is hitting us even when we circumvent celery entirely. This request still hits redis to get the previously generated score. Next I want to try a set of requests that hit the main page or something that doesn't even need to consult celery.
This is a snippet from requesting model_info:
0.3350839614868164 0.3268578052520752 0.3501322269439697 0.3274261951446533 0.35183024406433105 0.3265092372894287 0.3343515396118164 0.32044053077697754 0.3266129493713379 0.3357844352722168 0.3429696559906006 0.3407778739929199 0.32950711250305176 0.33585238456726074 0.32996439933776855 0.33040285110473633 0.3498098850250244 0.3398115634918213 0.39438486099243164 3.5465455055236816 5.110769987106323 10.264256000518799 9.573215961456299 6.922254800796509 10.505949020385742 11.980973243713379 6.962685585021973 11.637609481811523 13.119173526763916 20.195621728897095 8.294030666351318 20.778366327285767 19.46211314201355
This doesn't touch redis or celery. Same problem.
I get a similar pattern running this on the ores-web nodes directly. I don't see any CPU spikes or increases in iowait. What the heck?
Confirmed that even locally, I see:
0.00615000724792 0.00727391242981 0.00678396224976 0.00703287124634 0.00627303123474 0.00723099708557 0.219942092896 0.812291145325 1.29059314728 0.680955886841 2.28662204742 1.30370402336 2.19436597824 0.312132120132 0.0107660293579 0.0107569694519 0.0110359191895 0.0114378929138 0.0110859870911 0.0118498802185 0.00923895835876
Let's try something even simpler -- like the main page.
4.0565340519 1.35831594467 2.44845795631 0.72114610672 0.00227880477905 0.00223803520203 0.00190401077271 0.00235319137573 0.00181698799133
So I've worked out that this issue does not happen on staging. I think that is because staging is sending metrics to localhost. So my newest and bestest hypothesis is that we're blocking on waiting to send metrics to labmon1001. I've reached out to the Cloud Services people about labmon1001 to find out if there could be some issue there. @JHedden and @bd808 have been helping me make sense of where this is failing but we haven't found any clear issues.
JH suggested that maybe we're creating too many new metrics, so I worked on T232164: ORES is creating a lot of metrics. This is due to revid count-based metrics to cut out our primary source of random new metric paths.
It's clear that the errors we are seeing are coming from uwsgi's statsd plugin. I don't know if we're seeing any messages from our own statsd metrics code in python as I imagine they would be reported differently should an exception be thrown.
I've looked at packet captures and traced processes on both ends, as far as I can tell the metrics are being sent and stored correctly in graphite's whisper database.
I spot checked a few UWSGI metrics and confirmed there are no gaps or null data:
$ whisper-fetch ores/ores-web-01/uwsgi/worker/0/requests/count.wsp | wc -l 1440
Do you have an example metric that is not getting updated? I'd also recommend disabling stats-push, restarting the app and confirming the ~10 second delay is related to the stats-push.
also note that I found a number of out of memory events on ores-web-01 that I don't think are directly related, but could likely be causing some additional pain.
$ zgrep -c oom-killer /var/log/syslog.* /var/log/syslog.1:5 /var/log/syslog.2.gz:22 /var/log/syslog.3.gz:0 /var/log/syslog.4.gz:0 /var/log/syslog.5.gz:15 /var/log/syslog.6.gz:3 /var/log/syslog.7.gz:0
OK! So I cut down the number of web workers per node from 48 to 36 and I see both the oom and statsd noise cut down. I've tentatively re-enabled icinga-wm in the #wikimedia-ai channel and will keep monitoring.
That didn't work for us. But in the process of exploring this with @ACraze, we noticed that the celery workers only had 4 processes running. So we bumped that up to 16. We also deployed code updates so that the system matches the production configuration.
Right now, we're still seeing some periodic slowdowns but nothing that would trigger icinga. So I'm declaring this done and filing a follow-up task: T232228: Investigate intermittent delay for basic uwsgi requests.