Page MenuHomePhabricator

Address icinga noise from wmflabs
Closed, ResolvedPublic

Description

It doesn't look like we are actually constantly going down. What is all the noise about? How can we reduce it?

Event Timeline

I registered #wikimedia-ores and set a topic.

Halfak renamed this task from Re-route labs-related icinga to #wikimedia-ores to Address icinga noise from wmflabs.Sep 4 2019, 3:43 PM
Halfak updated the task description. (Show Details)

I just restarted nginx on ores-lb-03 to see if that was what was causing the problem. Next I'll be looking into getting a clean deploy into wmflabs.

Also, I'm seeing a lot of lines that look like this in the uwsgi logs:

[2019-09-04T14:38:43] statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pusher_statsd/plugin.c line 40]

I'm also seeing some

[2019-09-04T06:25:41] fork(): Cannot allocate memory [core/master_utils.c line 729]

Looks like there are periodic timeouts coming from the wmflabs system. I ran a script to just make 1000 request to the fakewiki pseudo-model (which should return right away) and I got many requests not responding for upwards of 60 seconds!

Min.   : 0.1881  
1st Qu.: 0.2194  
Median : 0.3793  
Mean   : 2.3360  
3rd Qu.: 2.8895  
Max.   :61.2457

No request should hang for more than 15 seconds no matter what because of the timeouts we have in place. It looks like about 2% of the requests (19/1000) hug for more than 15 seconds.

I'm curious if the timeouts are only seen from a single virtual machine, or if you're seeing the same results from multiple ORES virtual machines. Could you please run your script on a couple of different hosts and share the results with host names?

@JHedden, I'm running this script from my local machine. But I imagine icinga is making its requests from a different location than me and it's getting the same behavior. Right now, I'm trying to eliminate various explanations for the cause of the delay.

I just ran a request that should have been drawn from ORES redis cache 1000 times and got similar results. Here's a snippet:

0.3344287872314453
0.34113192558288574
0.31957292556762695
0.34860801696777344
0.3475968837738037
0.36787891387939453
0.36493802070617676
0.35985302925109863
0.3482813835144043
0.34537506103515625
1.3687825202941895
0.3382432460784912
0.35638856887817383
4.523691654205322
34.91725969314575
28.653353214263916
0.3615725040435791
0.35713720321655273
20.093085765838623
10.441282510757446
10.40427017211914
12.945164918899536
10.117834091186523
6.772924184799194
8.29482626914978
10.932244777679443
0.3201889991760254
0.3156282901763916
0.31780529022216797
0.31928396224975586
0.31522202491760254
0.32955360412597656
0.3060317039489746
0.31044483184814453
0.3071455955505371
0.3259284496307373
0.3149731159210205
0.3131980895996094
0.32795023918151855
0.32677674293518066
0.31515979766845703
0.36296677589416504
0.31809473037719727
0.30800867080688477

So the delay is hitting us even when we circumvent celery entirely. This request still hits redis to get the previously generated score. Next I want to try a set of requests that hit the main page or something that doesn't even need to consult celery.

This is a snippet from requesting model_info:

0.3350839614868164
0.3268578052520752
0.3501322269439697
0.3274261951446533
0.35183024406433105
0.3265092372894287
0.3343515396118164
0.32044053077697754
0.3266129493713379
0.3357844352722168
0.3429696559906006
0.3407778739929199
0.32950711250305176
0.33585238456726074
0.32996439933776855
0.33040285110473633
0.3498098850250244
0.3398115634918213
0.39438486099243164
3.5465455055236816
5.110769987106323
10.264256000518799
9.573215961456299
6.922254800796509
10.505949020385742
11.980973243713379
6.962685585021973
11.637609481811523
13.119173526763916
20.195621728897095
8.294030666351318
20.778366327285767
19.46211314201355

This doesn't touch redis or celery. Same problem.

I get a similar pattern running this on the ores-web nodes directly. I don't see any CPU spikes or increases in iowait. What the heck?

Confirmed that even locally, I see:

0.00615000724792
0.00727391242981
0.00678396224976
0.00703287124634
0.00627303123474
0.00723099708557
0.219942092896
0.812291145325
1.29059314728
0.680955886841
2.28662204742
1.30370402336
2.19436597824
0.312132120132
0.0107660293579
0.0107569694519
0.0110359191895
0.0114378929138
0.0110859870911
0.0118498802185
0.00923895835876

Let's try something even simpler -- like the main page.

4.0565340519
1.35831594467
2.44845795631
0.72114610672
0.00227880477905
0.00223803520203
0.00190401077271
0.00235319137573
0.00181698799133

So I've worked out that this issue does not happen on staging. I think that is because staging is sending metrics to localhost. So my newest and bestest hypothesis is that we're blocking on waiting to send metrics to labmon1001. I've reached out to the Cloud Services people about labmon1001 to find out if there could be some issue there. @JHedden and @bd808 have been helping me make sense of where this is failing but we haven't found any clear issues.

JH suggested that maybe we're creating too many new metrics, so I worked on T232164: ORES is creating a lot of metrics. This is due to revid count-based metrics to cut out our primary source of random new metric paths.

It's clear that the errors we are seeing are coming from uwsgi's statsd plugin. I don't know if we're seeing any messages from our own statsd metrics code in python as I imagine they would be reported differently should an exception be thrown.

I've looked at packet captures and traced processes on both ends, as far as I can tell the metrics are being sent and stored correctly in graphite's whisper database.

I spot checked a few UWSGI metrics and confirmed there are no gaps or null data:

$ whisper-fetch ores/ores-web-01/uwsgi/worker/0/requests/count.wsp | wc -l
1440

Do you have an example metric that is not getting updated? I'd also recommend disabling stats-push, restarting the app and confirming the ~10 second delay is related to the stats-push.

also note that I found a number of out of memory events on ores-web-01 that I don't think are directly related, but could likely be causing some additional pain.

$ zgrep -c oom-killer /var/log/syslog.*
/var/log/syslog.1:5
/var/log/syslog.2.gz:22
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:15
/var/log/syslog.6.gz:3
/var/log/syslog.7.gz:0

OK! So I cut down the number of web workers per node from 48 to 36 and I see both the oom and statsd noise cut down. I've tentatively re-enabled icinga-wm in the #wikimedia-ai channel and will keep monitoring.

Halfak closed this task as Resolved.EditedSep 6 2019, 9:33 PM
Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.
Halfak added a subscriber: ACraze.

That didn't work for us. But in the process of exploring this with @ACraze, we noticed that the celery workers only had 4 processes running. So we bumped that up to 16. We also deployed code updates so that the system matches the production configuration.

Right now, we're still seeing some periodic slowdowns but nothing that would trigger icinga. So I'm declaring this done and filing a follow-up task: T232228: Investigate intermittent delay for basic uwsgi requests.