ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jan 14 2020, 7:31 AM

Description

Today celery was killed repeatedly in ores codfw hosts:

ores2001.codfw.wmnet: Jan 14 06:25:15 ores2001 kernel: [20960005.387705] Out of memory: Kill process 10948 (celery) score 36 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:20 ores2001 kernel: [20960009.490598] Out of memory: Kill process 11429 (celery) score 36 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:23 ores2001 kernel: [20960013.535409] Out of memory: Kill process 14602 (celery) score 36 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:29 ores2001 kernel: [20960014.446386] Out of memory: Kill process 11256 (celery) score 35 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:29 ores2001 kernel: [20960015.442553] Out of memory: Kill process 20074 (celery) score 35 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:29 ores2001 kernel: [20960017.391891] Out of memory: Kill process 13538 (celery) score 35 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:29 ores2001 kernel: [20960019.497964] Out of memory: Kill process 14019 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:29 ores2001 kernel: [20960019.557037] Out of memory: Kill process 15973 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:31 ores2001 kernel: [20960020.429315] Out of memory: Kill process 13777 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:31 ores2001 kernel: [20960021.413731] Out of memory: Kill process 13693 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:39 ores2001 kernel: [20960024.471984] Out of memory: Kill process 10363 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:39 ores2001 kernel: [20960026.468521] Out of memory: Kill process 10533 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:39 ores2001 kernel: [20960028.431911] Out of memory: Kill process 10630 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:43 ores2001 kernel: [20960031.406687] Out of memory: Kill process 11099 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:43 ores2001 kernel: [20960033.151720] Out of memory: Kill process 11508 (celery) score 34 or sacrifice child
ores2001.codfw.wmnet: Jan 14 06:25:44 ores2001 kernel: [20960034.367117] Out of memory: Kill process 10868 (celery) score 34 or sacrifice child

ores2002.codfw.wmnet: Jan 14 06:25:29 ores2002 kernel: [20960564.242765] Out of memory: Kill process 24990 (celery) score 35 or sacrifice child
ores2002.codfw.wmnet: Jan 14 06:25:30 ores2002 kernel: [20960565.541888] Out of memory: Kill process 23085 (celery) score 35 or sacrifice child
ores2002.codfw.wmnet: Jan 14 06:25:32 ores2002 kernel: [20960567.298123] Out of memory: Kill process 21942 (celery) score 34 or sacrifice child
ores2002.codfw.wmnet: Jan 14 06:25:35 ores2002 kernel: [20960570.294081] Out of memory: Kill process 22547 (celery) score 34 or sacrifice child
ores2002.codfw.wmnet: Jan 14 06:25:36 ores2002 kernel: [20960571.285116] Out of memory: Kill process 22791 (celery) score 34 or sacrifice child
ores2002.codfw.wmnet: Jan 14 06:25:37 ores2002 kernel: [20960572.043351] Out of memory: Kill process 25893 (celery) score 34 or sacrifice child

ores2005.codfw.wmnet: Jan 14 06:25:08 ores2005 kernel: [20957911.232125] Out of memory: Kill process 25376 (celery) score 41 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:15 ores2005 kernel: [20957916.541963] Out of memory: Kill process 30622 (celery) score 37 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:21 ores2005 kernel: [20957921.581982] Out of memory: Kill process 29757 (celery) score 36 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:21 ores2005 kernel: [20957923.510427] Out of memory: Kill process 30451 (celery) score 36 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:23 ores2005 kernel: [20957925.286007] Out of memory: Kill process 25932 (celery) score 35 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:23 ores2005 kernel: [20957926.367625] Out of memory: Kill process 19148 (celery) score 35 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:24 ores2005 kernel: [20957927.283672] Out of memory: Kill process 32869 (celery) score 35 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:27 ores2005 kernel: [20957928.412759] Out of memory: Kill process 28230 (celery) score 35 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:27 ores2005 kernel: [20957929.421421] Out of memory: Kill process 30671 (celery) score 35 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:29 ores2005 kernel: [20957931.354109] Out of memory: Kill process 28560 (celery) score 35 or sacrifice child
ores2005.codfw.wmnet: Jan 14 06:25:31 ores2005 kernel: [20957933.294028] Out of memory: Kill process 28387 (celery) score 35 or sacrifice child

ores2008.codfw.wmnet: Jan 14 06:25:19 ores2008 kernel: [20956860.721316] Out of memory: Kill process 30348 (celery) score 36 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:20 ores2008 kernel: [20956862.774618] Out of memory: Kill process 32102 (celery) score 36 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:24 ores2008 kernel: [20956865.918839] Out of memory: Kill process 38756 (celery) score 36 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:28 ores2008 kernel: [20956869.855087] Out of memory: Kill process 6424 (celery) score 35 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:28 ores2008 kernel: [20956869.906351] Out of memory: Kill process 4290 (celery) score 35 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:30 ores2008 kernel: [20956872.729098] Out of memory: Kill process 38836 (celery) score 35 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:31 ores2008 kernel: [20956873.701863] Out of memory: Kill process 30511 (celery) score 35 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:32 ores2008 kernel: [20956874.699207] Out of memory: Kill process 33477 (celery) score 35 or sacrifice child
ores2008.codfw.wmnet: Jan 14 06:25:35 ores2008 kernel: [20956877.735192] Out of memory: Kill process 33768 (celery) score 35 or sacrifice child

ores2009.codfw.wmnet: Jan 14 06:25:10 ores2009 kernel: [20955449.877789] Out of memory: Kill process 738 (celery) score 39 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:10 ores2009 kernel: [20955451.110142] Out of memory: Kill process 7289 (celery) score 36 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:14 ores2009 kernel: [20955455.424752] Out of memory: Kill process 7209 (celery) score 35 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:18 ores2009 kernel: [20955459.592900] Out of memory: Kill process 11906 (celery) score 35 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:23 ores2009 kernel: [20955463.567749] Out of memory: Kill process 1323 (celery) score 35 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:27 ores2009 kernel: [20955468.150691] Out of memory: Kill process 509 (celery) score 35 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:35 ores2009 kernel: [20955472.300157] Out of memory: Kill process 9055 (celery) score 35 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:38 ores2009 kernel: [20955476.284701] Out of memory: Kill process 3433 (celery) score 35 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:38 ores2009 kernel: [20955477.440998] Out of memory: Kill process 4931 (celery) score 34 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:38 ores2009 kernel: [20955478.353819] Out of memory: Kill process 372 (celery) score 34 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:38 ores2009 kernel: [20955479.289623] Out of memory: Kill process 1184 (celery) score 34 or sacrifice child
ores2009.codfw.wmnet: Jan 14 06:25:39 ores2009 kernel: [20955480.202927] Out of memory: Kill process 10438 (celery) score 34 or sacrifice child

Metrics around that time: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1578980790712&to=1578985954606

From the /srv/log/ores/main.log there seem to be a ton of ChangeProp related entries like:

[2020-01-13T06:25:57] [pid: 8261] 10.64.16.21 (-) {30 vars in 418 bytes} [Mon Jan 13 06:25:54 2020] POST /v3/precache => generated 2239 bytes in 474 msecs (HTTP/1.1 200) 8 headers in 288 bytes (1 switches on core 0) user agent "ChangePropagation/WMF"

On ores2001 I moved all the /srv/log/ores dir to /home/elukey/14012020_celery_oom to prevent logrotate to delete data if needed for debugging.

Details

	Subject	Repo	Branch	Lines +/-
	ores::web: Pass skip-atexit-teardown to uwsgi	operations/puppet	production	+11 -10

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Trizek-WMF	T247507 Scale: deploy Growth features to 100 wikis
Open	None	T249383 Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020)
Open	None	T252599 Build article topic model for Swedish Wikipedia
Resolved	Halfak	T278723 ORES deployment - Spring 2021
Resolved	HAKSOAT	T249382 Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available)
Declined	None	T249520 Fit more topic models into ORES
Resolved	Halfak	T242705 ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart)

Event Timeline

elukey triaged this task as High priority.Jan 14 2020, 7:31 AM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2020, 7:31 AM

When looking into this, I came across: https://docs.celeryproject.org/en/latest/userguide/workers.html#max-memory-per-child-setting

That might be useful.

I also ran some queries looking for some rogue edit that happened during the time window where we saw the OOM error. See https://quarry.wmflabs.org/query/41352 I didn't see anything obvious. The largest change that happened during this time window appears to have been in Wikidata and it was trivial to score.

I had a look at the request log on ores2001 and I can't find any requests that look concerning.

Hypotheses:

celery got into a weird state and went crazy. It may not happen again even if we try run an identical scoring job.
there was a revision that required *a ton* of data to extract features. This is possible if a page is really really big. But there are limits inside of MW to prevent this.

This happened in eqiad and codfw at the *exact same time*. You can see ores1003 go from 22GB available to 0.4GB available over the course of seconds. But other nodes seem to be undisturbed. I think the evidence is leaning towards 2 as an explanation. If it was a specific revision that cause the problem, maybe we would find it in the ChangeProp logs.

Halfak claimed this task.Jan 15 2020, 4:39 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Jan 15 2020, 8:20 PM

@akosiaris, I'm at a loss here. I can't figure out what might have happened and I am struggling to reproduce the behavior. What do you think a good next step might be?

Thanks for the ping. Notes:

This is also evident at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=ores&var-instance=All&from=1578838913888&to=1579046093731.
Also evident at https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1578871146375&to=1579020751546 (look at the returned errors)
The span of time that memory consumption was increased coincides with CPU consumption being increased.
Network utilization also doubled (both in and out) during that timeframe
06:25 UTC is the time that cron.daily is scheduled and thus logrotate is scheduled. ORES has rules for logrotate, both for uwsgi and for celery
There are no core dumps on any of the ores2* hosts
despite a couple of similar bumps in 3 machines in eqiad around 06:25 UTC memory usage in eqiad in the larger timespan from 2020-01-13 08:00 UTC to 2020-01-14 06:25 UTC follows a different pattern indicating that indeed this is localized to codfw.
Indeed ores1003 has a oom killer on that day at 06:25.
changeprop hits both DCs in order to populate the cache. That means we can probably exclude it and focus instead on external requests.
The 3rd party ORES requests by wiki graph at the above dashboard doesn't help with singling out a specific wiki by a traffic pattern.
No host reached memory usage that was close to physical memory during the timeframe. Top use values seem to be close to 45GB, which is ~20GB short of the available physical volume. That in itself doesn't say too much however. Memory usage could have spiked in mere seconds and never show up in our stats (scraped every 10 secs)

So, with that data in mind:

A single revision could be the culprit, but not by consuming the extra memory alone. The persistence of CPU usage, network usage and errors consistently being returned throughout the duration of ~22h, indicates a different form of a problem, perhaps a form of livelock?
It seems the the trigger was logrotate (which just sends SIGTERM), somehow it probably caused memory usage to spike and oom-killer to show up, possibly when celery got the SIGTERM it tried to cleanup after its children that were running haywire?

@Halfak, I would try and dig up external requests from one of the hosts and figure out if something looks weird, perhaps even replaying them on a test host

I 've run the following on the logs @elukey pointed out above on ores2001 and I got:

10 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=369794500%7C25571593%7C89266065%7C17774726%7C42058451%7C29702962%7C222202444%7C183872080%7C440423560%7C633467137%7C102713474%7C632207811%7C13527996%7C261758911%7C50432264%7C282234079%7C916920637%7C236072947%7C763709508%7C134869098%7C864011429%7C113723336%7C435975497%7C51081591%7C350737899%7C918236592%7C335683862%7C89385920%7C140984233%7C261346667%7C14169952%7C752586056%7C11592094%7C533153411%7C520861681%7C125518897%7C585011168%7C131943610%7C73642523%7C476166876%7C548027875%7C96989224%7C367730946%7C578609949%7C37436765%7C304754025%7C58340771%7C766455804%7C596833605%7C27940831 =>
 9 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=741735209%7C299553874%7C4430044%7C49559387%7C411954660%7C140603723%7C760300962%7C39541465%7C223934999%7C497905817%7C13718517%7C930921506%7C729910390%7C334283508%7C58406308%7C570928916%7C387268682%7C917730312%7C35954265%7C128312583%7C179805606%7C62627269%7C20513918%7C7732138%7C26190951%7C64664252%7C69147040%7C56654816%7C224562782%7C301946983%7C87823055%7C177507674%7C13437793%7C50492266%7C43118355%7C68658551%7C73921885%7C58490948%7C187496392%7C678086483%7C35771333%7C562829811%7C56870136%7C153793197%7C910955384%7C302006304%7C18409943%7C106952127%7C35740914%7C23964373 =>
 9 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=735549004%7C71889799%7C42343619%7C12939454%7C1602389%7C702286322%7C694862924%7C39479845%7C695027398%7C11414822%7C34772245%7C559072875%7C19727354%7C21502820%7C319758531%7C71016499%7C339706857%7C12305354%7C384210977%7C8577056%7C126158182%7C19798159%7C241007091%7C11942461%7C902051152%7C370422151%7C63887749%7C46296410%7C26590947%7C651844692%7C20436821%7C103896225%7C293698258%7C9449068%7C46456562%7C759511194%7C78759403%7C52487091%7C237976217%7C70356806%7C30360521%7C11845498%7C71192740%7C139401743%7C29798047%7C14798978%7C3516190%7C137891%7C123983340%7C634219485 =>
 9 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=299958569%7C864657122%7C385914282%7C26317769%7C8583188%7C325549586%7C18909905%7C296362743%7C2647781%7C96739572%7C513700963%7C208164548%7C272832987%7C223864650%7C346086959%7C385380698%7C92029968%7C463905830%7C274699187%7C57541448%7C236075590%7C40331699%7C4127646%7C805400612%7C214144409%7C144155297%7C105111628%7C421184549%7C384685283%7C31567034%7C343657753%7C40402714%7C688740834%7C312747435%7C30624240%7C215627312%7C13158429%7C453139672%7C347478381%7C52367268%7C8585076%7C587970640%7C122034762%7C382432793%7C6800638%7C58126801%7C43727038%7C65351791%7C33992387%7C29856037 =>
 9 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=197080574%7C190611098%7C729729995%7C61503991%7C206344852%7C102695394%7C7530704%7C54787362%7C63174788%7C75791273%7C30061020%7C177469006%7C72710154%7C25054282%7C314896191%7C192548375%7C304197775%7C33171377%7C428129548%7C13857408%7C327033351%7C211497975%7C83019696%7C901506922%7C147487367%7C217978867%7C142558259%7C393552713%7C414924838%7C203112599%7C690265523%7C25968266%7C774737763%7C169912313%7C36934580%7C19708541%7C121657766%7C745156036%7C338467077%7C552335586%7C60372580%7C89055747%7C22282219%7C90986628%7C619313334%7C538037874%7C347928600%7C312195546%7C92447899%7C325681095 =>
 8 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=858946332%7C61366030%7C624002764%7C74521017%7C30962769%7C716757059%7C256617026%7C100292220%7C340070036%7C112457912%7C616975911%7C141929980%7C30238237%7C694038437%7C8574502%7C269736762%7C91131001%7C30307667%7C162737479%7C252144576%7C405142415%7C386047400%7C28205823%7C135960210%7C686425434%7C142166407%7C806207756%7C34417601%7C822772626%7C18656908%7C113667722%7C26396476%7C372815988%7C9684266%7C636861857%7C42585009%7C141842992%7C100211561%7C79392747%7C74332985%7C18019734%7C49307822%7C863083379%7C411469403%7C59487310%7C406671534%7C560127459%7C153091752%7C13436468%7C23532374 =>
 8 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=74465745%7C14987963%7C31738251%7C5864069%7C29463577%7C5733078%7C554561606%7C884316922%7C18226089%7C52341325%7C545695706%7C126529743%7C192239859%7C188533206%7C108492056%7C14336773%7C816776150%7C114225615%7C69843223%7C87250132%7C28137884%7C770773809%7C13884069%7C13457917%7C72973227%7C733545586%7C60667318%7C283547250%7C133864102%7C262605062%7C124606730%7C64267025%7C104411400%7C26711331%7C535195231%7C866993757%7C873598199%7C47343012%7C465754310%7C117027975%7C73452000%7C243120407%7C43640771%7C31685822%7C41283774%7C65901727%7C20031268%7C499245925%7C761150371%7C3516881 =>
 8 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=673669195%7C228211094%7C39480045%7C86396282%7C223810153%7C668646246%7C54561731%7C922052855%7C918372480%7C20279788%7C409760841%7C30486769%7C207467669%7C306735765%7C39081622%7C47800673%7C500609784%7C145797088%7C497972054%7C76778184%7C29791926%7C364056753%7C434354060%7C371089881%7C537602977%7C94824800%7C176294979%7C70571999%7C560910307%7C153136572%7C59839748%7C552817835%7C12603349%7C915409937%7C39086410%7C2604291%7C29851526%7C170763411%7C68370064%7C49815377%7C880228471%7C81799839%7C223166642%7C216793452%7C8586444%7C28989833%7C46088411%7C563088692%7C231144453%7C172792646 =>
 8 GET /v3/scores/enwiki/?models=draftquality%7Cdamaging%7Cgoodfaith%7Cdrafttopic&revids=594014857%7C30754301%7C281986945%7C260095029%7C6801394%7C110051505%7C90812157%7C11737945%7C40776699%7C551447133%7C69114760%7C7993498%7C615094357%7C46933373%7C9153173%7C66266510%7C566291706%7C328699349%7C58085047%7C309010229%7C593968614%7C778583184%7C719730083%7C35563163%7C6800745%7C112613926%7C515768865%7C35661797%7C776404222%7C822157138%7C80542160%7C13949453%7C577282913%7C229970748%7C38952592%7C8574522%7C76579793%7C35119097%7C31709556%7C469133330%7C385588292%7C59451093%7C430852244%7C117849827%7C694141042%7C587749805%7C13438216%7C919963489%7C117287638%7C88445976 =>

I have no idea if they could be the issue or not here, but it does seem weird that they are multiple requests for each.

Just a quick note, an extremely similar event has occurred today at 0630 UTC. The entire ORES cluster seems to have had oom issues for a while.

Created a dir on ores1002 with logs (/home/elukey/0402020_celery_oom).

eqiad and codfw graphs both point to similar issue as last time. 06:25UTC seems to be the start of the incident. Memory usage however has increased by close to 100% 9 hours before the event. The trigger is probably logrotate again (it anyway happens every day - if it was the cause we would see this all the time), but the cause is probably something in the traffic patterns.

Happened again today at 06:25UTC when logrotate ran, forced a puppet run on all hosts to recover quickly.

The 7 days view shows a nice increase of memory usage:

https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-7d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores&var-instance=All&fullscreen&panelId=86

Could it be that now a reload hits the OOM threshold that in turn causes celery to be killed?

There you go: https://tools.wmflabs.org/sal/log/AXAM254BfYQT6VcDATbh

The deployment matches the start of the memory growth, @Halfak do you have any idea what caused it? Should we rollback?

T243451 does explain the higher memory usage. It even points out that the higher memory usage is worrisome, however it was deployed anyway.

I 've tried to reproduce this. It's easily reproducible after all. Just do what logrotate does and issue systemctl reload uwsgi-ores. CPU usage spikes and reaches 100% for all CPUs in the machines for several seconds. Memory usage spikes as well and then OOM killer shows up as the machine is out of memory. The best thing for OOM killer to kill is celery as this is the big memory user.

I am not sure what uwsgi-ores does when receiving that reload signal (it's a HUP signal[1] for what is worth), but it needs to be investigated. The same thing seems to happen on a restart (a TERM signal[2]) so we don't have a quick workaround of not reloading but instead restarting. It also probably is prudent to revert the deploy in T243451 while investigation is ongoing as otherwise we will be having a minor outage every time a reload/restart happens.

[1] https://en.wikipedia.org/wiki/SIGHUP
[2] https://en.wikipedia.org/wiki/Signal_(IPC)#SIGTERM

Mentioned in SAL (#wikimedia-operations) [2020-02-06T15:53:04Z] <halfak@deploy1001> Started deploy [ores/deploy@50a101a]: T242705

Mentioned in SAL (#wikimedia-operations) [2020-02-06T15:57:39Z] <halfak@deploy1001> Finished deploy [ores/deploy@50a101a]: T242705 (duration: 04m 35s)

OK rolled back.

Looking at what happened, it seems like memory pressure was *way worse* in Eqiad than in Codfw

I wonder how that could be?

Re. uwsgi restarts, when ORES restarts uwsgi, it loads complete models into memory and then discards some parts that are unneeded in the uwsgi workers in order to free up some memory. I think we can do some other things free up even more but that won't reduce the initial spike in memory usage. I'll look into it anyway to see if I can replicate and fix.

When I start up the deployment ORES config locally with 4 workers, I can see that we are using ~2516000 bytes of RES for two processes. It looks like my available RAM goes down by ~5000000 bytes so that lines up with actual memory usage.

I would expect this memory usage to go down, but it isn't, so I'm investigating our strategy for releasing some high-memory consuming bits in uwsgi workers.

It happened again this morning, even if the deployment was rolled back.. from dmesg I can see again celery killed by the OOM.

[Fri Feb  7 06:33:29 2020] Out of memory: Kill process 5683 (celery) score 34 or sacrifice child
[Fri Feb  7 06:33:29 2020] Killed process 5683 (celery) total-vm:5361972kB, anon-rss:2272892kB, file-rss:0kB, shmem-rss:0kB

In case it is needed, I copied logs of this occurrence in /home/elukey/0702020_celery_oom on ores1008

Simply running systemctl reload uwsgi-ores doesn't seem to trigger the issue anymore.

EDIT: correction, it doesn't trigger the OOM, but I can see uwsgi spiking in CPU usage right after issuing the reload.

On ores1008 there is a difference in impact though:

Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:09,274: ERROR/MainProcess] Process 'ForkPoolWorker-2057' pid:16737 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:10,887: ERROR/MainProcess] Process 'ForkPoolWorker-2056' pid:16617 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,881: ERROR/MainProcess] Process 'ForkPoolWorker-2054' pid:16573 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,903: ERROR/MainProcess] Process 'ForkPoolWorker-2053' pid:16464 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,922: ERROR/MainProcess] Process 'ForkPoolWorker-2052' pid:16454 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,933: ERROR/MainProcess] Process 'ForkPoolWorker-2050' pid:16403 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,944: ERROR/MainProcess] Process 'ForkPoolWorker-2047' pid:16216 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,955: ERROR/MainProcess] Process 'ForkPoolWorker-2046' pid:16215 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,966: ERROR/MainProcess] Process 'ForkPoolWorker-2044' pid:16170 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,978: ERROR/MainProcess] Process 'ForkPoolWorker-2040' pid:16112 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:11,989: ERROR/MainProcess] Process 'ForkPoolWorker-2030' pid:15924 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:12,000: ERROR/MainProcess] Process 'ForkPoolWorker-2010' pid:15330 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:12,011: ERROR/MainProcess] Process 'ForkPoolWorker-2006' pid:15216 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:12,022: ERROR/MainProcess] Process 'ForkPoolWorker-1996' pid:15087 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:12,033: ERROR/MainProcess] Process 'ForkPoolWorker-1902' pid:2494 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:12 ores1008 celery-ores-worker[26100]: [2020-02-05 06:25:12,045: ERROR/MainProcess] Process 'ForkPoolWorker-1901' pid:2399 exited with 'signal 9 (SIGKILL)'
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]:
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]:  -------------- celery@ores1008 v4.1.1 (latentcall)
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: ---- **** -----
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: --- * ***  * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.11 2020-02-04 06:30:10
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: -- * - **** ---
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: - ** ---------- [config]
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: - ** ---------- .> app:         ores.scoring_systems.celery_queue:0x7f45fe058c88
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: - ** ---------- .> transport:   redis://:**@oresrdb.svc.eqiad.wmnet:6379//
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: - ** ---------- .> results:     redis://:**@oresrdb.svc.eqiad.wmnet:6379/
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: - *** --- * --- .> concurrency: 90 (prefork)
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]: --- ***** -----
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]:  -------------- [queues]
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]:                 .> celery           exchange=celery(direct) key=celery
Feb 05 06:25:40 ores1008 celery-ores-worker[26100]:
Feb 05 06:25:49 ores1008 systemd[1]: celery-ores-worker.service: Main process exited, code=exited, status=1/FAILURE
Feb 05 06:25:49 ores1008 systemd[1]: celery-ores-worker.service: Unit entered failed state.
Feb 05 06:25:49 ores1008 systemd[1]: celery-ores-worker.service: Failed with result 'exit-code'.

Feb 07 06:25:29 ores1008 celery-ores-worker[37401]: /srv/deployment/ores/deploy-cache/revs/039251fecb319ba7f02f7c13886fc7ee1a5d449c/venv/lib/python3.5/site-packages/smart_open/smart
Feb 07 06:25:29 ores1008 celery-ores-worker[37401]:   'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Feb 07 06:25:29 ores1008 celery-ores-worker[37401]: [2020-02-07 06:25:28,607: ERROR/MainProcess] Process 'ForkPoolWorker-958' pid:13446 exited with 'signal 9 (SIGKILL)'
Feb 07 06:25:29 ores1008 celery-ores-worker[37401]: [2020-02-07 06:25:28,618: ERROR/MainProcess] Process 'ForkPoolWorker-948' pid:10765 exited with 'signal 9 (SIGKILL)'
Feb 07 06:25:34 ores1008 celery-ores-worker[37401]: [2020-02-07 06:25:34,537: ERROR/MainProcess] Process 'ForkPoolWorker-1178' pid:5683 exited with 'signal 9 (SIGKILL)'
Feb 07 06:25:34 ores1008 celery-ores-worker[37401]: [2020-02-07 06:25:34,548: ERROR/MainProcess] Process 'ForkPoolWorker-961' pid:13497 exited with 'signal 9 (SIGKILL)'
Feb 07 07:51:12 ores1008 celery-ores-worker[37401]: [2020-02-07 07:51:12,826: ERROR/MainProcess] Process 'ForkPoolWorker-1350' pid:27574 exited with 'signal 9 (SIGKILL)'
Feb 07 07:52:51 ores1008 celery-ores-worker[37401]: [2020-02-07 07:52:51,975: ERROR/MainProcess] Process 'ForkPoolWorker-1366' pid:28268 exited with 'signal 9 (SIGKILL)'

In fact when I ran puppet this time I didn't see any mention of celery being restarted in any host, so I guess it just respawned new workers?

Might not be related, but https://github.com/unbit/uwsgi/issues/296 seems interesting, pasting here as reference. A lot of people report --ignore-sigpipe as useful parameter to add to avoid uwsgi workers to get stuck if the master is killed before them (shouldn't happen, but..).

Ok so I did some tests (even with --ignore-sigpipe), but nothing good to report. I also straced master and worker processes of uwsgi right after reload, and I can see two things:

the cpu spike is related to some work done for the workers to shutdown (and finally exit) and the master process to spawn new ones.
given what https://uwsgi-docs.readthedocs.io/en/latest/articles/TheArtOfGracefulReloading.html says, it seems to me that the extra memory consumption could be due to what Aaron mentioned plus the fact that some (old) workers might be held running due to outstanding requests, leading to memory pressure.

It doesn't seem to be a uwsgi issue from what I can see.

Something to note (might a totally unrelated/known bug/feature) is that upon startup, I see a lot of the following exceptions in the mail.log:

[2020-02-07T06:25:20] Exception ignored in: <bound method Dict.__del__ of <enchant.Dict object at 0x7f773ab2bcc0>>
[2020-02-07T06:25:20] Traceback (most recent call last):
[2020-02-07T06:25:20]   File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/enchant/__init__.py", line 576, in __del__
[2020-02-07T06:25:20]   File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/enchant/__init__.py", line 638, in _free
[2020-02-07T06:25:20]   File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/enchant/__init__.py", line 346, in _free_dict
[2020-02-07T06:25:20]   File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/enchant/__init__.py", line 353, in _free_dict_data

That looks a bug not solved: https://github.com/pyenchant/pyenchant/issues/98

Mentioned in SAL (#wikimedia-operations) [2020-02-07T10:36:43Z] <akosiaris> conduct experiments with stopping/starting uwsgi-ores on ores2001 T242705

Let me add my own finding. Doing systemctl stop uwsgi-ores triggers the issue. It's during the stop phase that uwsgi workers go haywire on CPU and memory usage. systemctl start uwsgi-ores after that does cause any significant CPU or memory increase.

In my limited tests, during that stop step CPU usage skyrockets to 100% for all cores and memory usage temporarily spikes by >20GB. The stop step also takes > 20 secs to complete.

On the contrary, during the start step, CPU usage remains constant and memory usage increase is less than 3GB in the next 30s.

So we should focus on what happens when a worker is stopped (receives the TERM signal I guess?)

I just deployed a change to beta that dramatically reduced the memory usage of uwsgi processes.

See https://github.com/wikimedia/ores/pull/337

This change addresses what I think is a garbage collection issue by loading models and trimming the memory intensive bits we don't need in uwsgi in a separate process and then killing the process to free the memory. It makes our restarts a bit slower, but it is working consistently.

It looks like this will help give us some breathing room. I'd like to try deploying this change on Monday's deployment window.

Halfak moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Feb 7 2020, 5:03 PM

In T242705#5860019, @Halfak wrote:

I'd like to try deploying this change on Monday's deployment window.

+1. Since we have a way to reproduce, we can test it easily and revert if needed.

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:00:21Z] <halfak@deploy1001> Started deploy [ores/deploy@a6f4f14]: T242705

Mentioned in SAL (#wikimedia-operations) [2020-02-10T21:12:39Z] <halfak@deploy1001> Finished deploy [ores/deploy@a6f4f14]: T242705 (duration: 12m 18s)

Mentioned in SAL (#wikimedia-operations) [2020-02-11T07:57:13Z] <akosiaris> T242705 systemctl stop uwsgi-ores on ores2001.

The result of running the command below is at:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ores2001&var-datasource=codfw%20prometheus%2Fops&var-cluster=ores&from=1581407424059&to=1581407664742

It's not entirely clear there, probably because the entire event lasts ~20s and prometheus might not have the time to scrape it in its entirety. A second effort led to

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ores2001&var-datasource=codfw%20prometheus%2Fops&var-cluster=ores&from=1581409754768&to=1581410083951

In that one it's more clear what's happening. CPU and memory usage are clearly increased during the uwsgi stop phase. Interestingly, during the start step, nothing of the sort happens.

The per second numbers on the host are more nuanced. Using htop/top on the host I see 100% cpu usage and increase of memory usage of ~20GB when uwsgi-ores is stopped. prometheus doesn't have the resolution of top/htop but rather 15s so it can't pickup the event in all its glory.

I don't think a6f4f14 has addressed the underlying issue. It has probably just bought us some breathing room, but my guess is that the moment the average free memory on the hosts goes < 20GB we will start seeing the issue again, which probably means limits to what features can be deployed product wise.

Maybe related:

Till uWSGI 2.1, by default, sending the SIGTERM signal to uWSGI means “brutally reload the stack” while the convention is to shut an application down on SIGTERM. To shutdown uWSGI use SIGINT or SIGQUIT instead. If you absolutely can not live with uWSGI being so disrespectful towards SIGTERM, by all means enable the die-on-term option. Fortunately, this bad choice has been fixed in uWSGI 2.1

Could it be that the listen queue is filling up when we stop uwsgi?

In T242705#5872435, @Halfak wrote:

Maybe related:

Till uWSGI 2.1, by default, sending the SIGTERM signal to uWSGI means “brutally reload the stack” while the convention is to shut an application down on SIGTERM. To shutdown uWSGI use SIGINT or SIGQUIT instead. If you absolutely can not live with uWSGI being so disrespectful towards SIGTERM, by all means enable the die-on-term option. Fortunately, this bad choice has been fixed in uWSGI 2.1

We pass --die-on-term since a very long time (since the beginning IIRC), so probably not.

FWIW all hosts are presenting OOMs, I've renamed the task accordingly:

$ sudo cumin 'A:ores' 'zgrep -c "Out of memory: Kill process" /var/log/syslog*'
18 hosts will be targeted:
ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) ores1007.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:4
/var/log/syslog.1:5
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:2
/var/log/syslog.6.gz:45
/var/log/syslog.7.gz:39
===== NODE GROUP =====
(1) ores2008.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:4
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:29
===== NODE GROUP =====
(1) ores1005.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:5
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:145
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:0
===== NODE GROUP =====
(1) ores1008.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:5
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:4
/var/log/syslog.6.gz:5
/var/log/syslog.7.gz:40
===== NODE GROUP =====
(1) ores2005.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:104
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:0
===== NODE GROUP =====
(1) ores2002.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:15
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:70
===== NODE GROUP =====
(1) ores2007.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:5
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:27
===== NODE GROUP =====
(1) ores1004.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:5
/var/log/syslog.6.gz:31
/var/log/syslog.7.gz:41
===== NODE GROUP =====
(1) ores2003.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:9
/var/log/syslog.7.gz:38
===== NODE GROUP =====
(1) ores1001.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:35
===== NODE GROUP =====
(1) ores1009.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:36
===== NODE GROUP =====
(1) ores2006.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:30
===== NODE GROUP =====
(1) ores2009.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:1
/var/log/syslog.7.gz:36
===== NODE GROUP =====
(1) ores1003.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:2
/var/log/syslog.7.gz:72
===== NODE GROUP =====
(1) ores2004.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:3
/var/log/syslog.7.gz:47
===== NODE GROUP =====
(1) ores2001.codfw.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:30
/var/log/syslog.7.gz:29
===== NODE GROUP =====
(1) ores1002.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:0
/var/log/syslog.1:7
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:153
/var/log/syslog.8.gz:0
===== NODE GROUP =====
(1) ores1006.eqiad.wmnet
----- OUTPUT of 'zgrep -c "Out of.../var/log/syslog*' -----
/var/log/syslog:2
/var/log/syslog.1:0
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:0
/var/log/syslog.4.gz:0
/var/log/syslog.5.gz:0
/var/log/syslog.6.gz:0
/var/log/syslog.7.gz:94
================

Today this caused quite a spam on the #wikimedia-operations channel because in multiple hosts the nagios NRPE daemon failed too:

nrpe[21365]: fork() failed with error 12, bailing out...
systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: nagios-nrpe-server.service: Failed to fork: Cannot allocate memory

The NRPE daemon is the one performing all the local checks for Icinga, hence the noise of all the alerts going off because of the failed NRPE.

Seems like the deploy did not fix it after all. Most (if not all) hosts alerted this morning. It's evident in the graphs as well

https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores&var-instance=All&from=1581488233352&to=1581489513461

Halfak moved this task from Pending deployment to Parked on the Machine-Learning-Team (Active Tasks) board.Feb 12 2020, 2:00 PM

Halfak renamed this task from Ores celery OOM events in all hosts to ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).Feb 12 2020, 7:37 PM

OK so I've done some tests. It's clear that we can see this CPU/memory spike when shutting down uwsgi. Essentially, all of the child processes (workers) suddenly use as much CPU as they can. top doesn't report more memory being used by the workers, but it does report precipitous drop in available memory while shutting down. Further the logs report:

[2020-02-12T06:25:20] worker 5 buried after 16 seconds
[2020-02-12T06:25:21] worker 1 buried after 17 seconds
[2020-02-12T06:25:21] worker 10 buried after 17 seconds
[2020-02-12T06:25:21] worker 31 buried after 17 seconds
[2020-02-12T06:25:21] worker 33 buried after 17 seconds
[2020-02-12T06:25:21] worker 34 buried after 17 seconds
[2020-02-12T06:25:22] worker 4 buried after 18 seconds
[2020-02-12T06:25:22] worker 8 buried after 18 seconds
[2020-02-12T06:25:22] worker 14 buried after 18 seconds
[2020-02-12T06:25:22] worker 41 buried after 18 seconds
[2020-02-12T06:25:22] worker 42 buried after 18 seconds
[2020-02-12T06:25:22] worker 43 buried after 18 seconds
[2020-02-12T06:25:22] worker 47 buried after 18 seconds

When they used to report something like:

[2020-02-12T06:25:04] Gracefully killing worker 38 (pid: 18227)...
[2020-02-12T06:25:04] Gracefully killing worker 39 (pid: 18228)...
[2020-02-12T06:25:04] Gracefully killing worker 40 (pid: 18229)...
[2020-02-12T06:25:04] Gracefully killing worker 41 (pid: 18230)...
[2020-02-12T06:25:04] Gracefully killing worker 42 (pid: 18231)...
[2020-02-12T06:25:04] Gracefully killing worker 43 (pid: 18232)...
[2020-02-12T06:25:04] Gracefully killing worker 44 (pid: 18233)...
[2020-02-12T06:25:04] Gracefully killing worker 47 (pid: 18236)...

I ran some tests on our staging server in WMFLabs -- ores-staging-01.eqiad.wmflabs. I was able to replicate the behavior. I also tried rolling back our code to previous versions. Even with our code from Aug 2nd, 2019 I was able to replicate the same behavior on shutdown. This all points to a change on our machines -- probably via puppet causing the problem.

One final note. It looks like https://github.com/unbit/uwsgi/issues/296 covers a similar issue. I was not able to get any of the work-arounds listed to work.

From https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html

To shutdown uWSGI use SIGINT or SIGQUIT instead. If you absolutely can not live with uWSGI being so disrespectful towards SIGTERM, by all means enable the die-on-term option.

I tried manually sending a SIGQUIT to the main uwsgi process and saw the same behavior.

I also get the same behavior if I just sudo kill the main process.

I tried removing the --die-on-term option and I get the same behavior.

Here's an strace of one of the child processes that goes berserk:

strace: Process 25405 attached
epoll_wait(7, 0x7ffef404cd7c, 1, -1)    = -1 EINTR (Interrupted system call)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=25169, si_uid=33} ---
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1, si_uid=0} ---
getpid()                                = 25405
futex(0x7fe08c354aac, FUTEX_WAKE_PRIVATE, 2147483647) = 0
getpid()                                = 25405
getpid()                                = 25405
close(103)                              = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe08d79f0c0}, {sa_handler=0x55cc34255ad0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe08d79f0c0}, 8) = 0
munmap(0x7fe052e33000, 262144)          = 0
munmap(0x7fe054710000, 262144)          = 0
munmap(0x7fe0546d0000, 262144)          = 0
munmap(0x7fe0549f4000, 262144)          = 0
munmap(0x7fe06f979000, 262144)          = 0
getpid()                                = 25405
getpid()                                = 25405
munmap(0x7fe07e074000, 33554432)        = 0
munmap(0x7fe07b873000, 33554432)        = 0
munmap(0x7fe0780db000, 33554432)        = 0
munmap(0x7fe0760db000, 33554432)        = 0
munmap(0x7fe0740db000, 33554432)        = 0
munmap(0x7fe0720db000, 33554432)        = 0
munmap(0x7fe0700db000, 33554432)        = 0
munmap(0x7fe069de4000, 33554432)        = 0
munmap(0x7fe064e76000, 33554432)        = 0
munmap(0x7fe062e76000, 33554432)        = 0
munmap(0x7fe060e76000, 33554432)        = 0
munmap(0x7fe05ee36000, 33554432)        = 0
munmap(0x7fe05ce36000, 33554432)        = 0
munmap(0x7fe05ae36000, 33554432)        = 0
exit_group(30)                          = ?
+++ exited with 30 +++

Note that the output hangs on the line that starts with "rt_sigaction(SIGINT, ..." while CPU and memory go crazy for a few seconds.

@akosiaris, any updates on testing with gunicorn?

I ran a test with gunicorn on my laptop and I was able to replicate this behavior.

I started the server with this:

~/venv/3.5/bin/gunicorn --workers=16 --preload ores_wsgi:application

The --preload option is important otherwise each fork loads the application separately and shares no memory.

Then I just sent a SIGINT to the main process and watched top.

I observed the same race to 100% CPU along with a spike in memory usage.

So maybe this is ORES. I think our next test should be with a very simple test application to see if the behavior is common to wsgi-related forks.

I was able to replicate the behavior with a very simple Flask application.

app.py:

$ cat app.py 
from flask import Flask

app = Flask(__name__)

my_data = {"data{0}".format(i): "value{0}".format(i) for i in range(2000000)}


@app.route("/")
def index():
  return "I have {0} data items totalling {1} characters".format(
    len(my_data), sum(len(k) + len(v) for k, v in my_data.items()))

I started it up with

~/venv/3.5/bin/gunicorn --workers=16 --preload app:app

And when I sent a SIGINT (^C) to the main process I nearly lost my laptop to the memory/CPU spike.

@ACraze and I were discussing this and we were wondering if maybe it's time to try threads. In some simple tests, they seemed to work really well. We'd need to do some load testing. And it might cause some instability that we'd want to look out for.

We also talked about https://medium.com/@pgjones/quart-a-asyncio-alternative-to-flask-32666ae2abb0 - https://gitlab.com/pgjones/quart

We also talked about https://flask-aiohttp.readthedocs.io/en/latest/

@akosiaris, I've run a couple of tests now using threads rather than process-based workers and the behavior is fixed under that configuration.

What do you think about experimenting with threads? Given that web workers are very IO-bound, I think python's GIL won't bother us.

Another option is to use pseudo-threads (event-based). See https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7

In T242705#5926753, @Halfak wrote:
I was able to replicate the behavior with a very simple Flask application.

app.py:
$ cat app.py 
from flask import Flask

app = Flask(__name__)

my_data = {"data{0}".format(i): "value{0}".format(i) for i in range(10000000)}


@app.route("/")
def index():
  return "I have {0} data items totalling {1} characters".format(
    len(my_data), sum(len(k) + len(v) for k, v in my_data.items()))
I started it up with
~/venv/3.5/bin/gunicorn --workers=16 --preload app:app
And when I sent a SIGINT (^C) to the main process I nearly lost my laptop to the memory/CPU spike.

I just had a quick look at the task and I was surprised by this as it goes counter to my knowledge of how memory allocation/freeing works. I expect the CPU spike on termination as the python garbage collector would have a considerable amount of work to do, but not the memory spike.

Just to make sure I create a new venv and ran the same simple Flask application and I wasn't able to reproduce the issue reported. I did indeed see the CPU spike but no memory spike. @Halfak I think you nearly lost your laptop because of the memory usage in the first place. It probably pushed for machine into its memory limits, making it very heavily swap out to disk and thus making it way less responsive.

@akosiaris, I've run a couple of tests now using threads rather than process-based workers and the behavior is fixed under that configuration.

Could you please share the code/commands for those tests. I 'd love to see them.

What do you think about experimenting with threads? Given that web workers are very IO-bound, I think python's GIL won't bother us.

Experimenting is fine with me but the assumption that the GIL won't bother much in this case might not hold up to the incoming request rate. As you point out, Python's GIL does mean that only 1 thread can essentially run at a time and if for whatever reason that 1 thread delays serving a request (e.g. aggressive garbage collection, a bug in a code path causing it to unnecessarily loop over structures, some recursion happening way too many times) it will, at the very least, push latencies for all requests to higher levels, potentially causing backlogs that can bring the service down. Before we contemplate pushing such a change to production, it would need extensive testing in other environments as well as tests simulating such (difficult to debug) failure scenarios.

Halfak added a parent task: T249520: Fit more topic models into ORES.Apr 6 2020, 5:07 PM

I made this stack overflow post: https://stackoverflow.com/questions/61130651/memory-available-free-plummets-and-cpu-spikes-when-shutting-down-uwsgi-gunicor

Based on a response, I experimented with sigterm and uwsgi's "--die-on-term" option but I still got the problematic behavior.

Do you think this is Flask-specific? How hard would it be to port this code to another Python micro WSGI web framework?

We've been testing with gunicorn as well. The stackoverflow question has notes about running the test app with gunicorn.

The issue happens from time to time, always at the same time. @Halfak is there any possible solution that you guys have in mind? If you need help from SRE speak up :)

Oh yes. We need help from SRE. I'm at my limit here. @akosiaris was working with us on this in the past but a fresh set of eyes might do well. The stack overflow post provides the best summary and a nice simple chunk of example code.

@Gilles and @elukey, you've both offered to help us out here. I'm really excited that we may be able to work out this issue. Have you been able to make and progress or do you have any questions I could answer?

@dpifke has looked into this. Dave, can you share notes here about what you've tried and what you understand about this problem?

I need to review the task and think about next steps if any, let me know if there is anything outstanding to test/review/etc.. in the meantime :)

Currently, the basic problem of the task is still outstanding. The stack overflow post is the best reference to the basic problem as it demonstrates it via simple example -- not full blown ORES. https://stackoverflow.com/questions/61130651/memory-available-free-plummets-and-cpu-spikes-when-shutting-down-uwsgi-gunicor

Essentially, a collection of uwsgi or gunicorn web workers consumes a ton more memory in the few seconds after being shut down. This causes OOM for us and it doesn't make any sense. The stack overflow post gives you the code necessary to demonstrate this on your own hardware.

@Halfak have we reported the issue to upstream asking for an advice (https://github.com/unbit/uwsgi) ? IIRC they are very open to help out the community, and you have already listed how to repro etc.. in the stack overflow post. Upstream could have already seen the issue, and give us a quick fix while we are testing our solutions.

@Halfak ping :)

https://github.com/unbit/uwsgi/issues/2189

From upstream they suggest to use --skip-atexit-teardown (have we tried it before?):

My bet is that when you shutdown the workers, the PyObject c structure of each python object is 'touched' forcing the 'copy part' of the COW behaviour.

You should check if this is the true cause by adding the --skip-atexit-teardown option that will skip the Py_Finalize() call

My bet is that when you shutdown the workers, the PyObject c structure of each python object is 'touched' forcing the 'copy part' of the COW behaviour.

That bet seems to be absolutely correct. It's interesting that the cleanup would touch the python objects to such a degree, but a quick test on ores1001 validated that --skip-atexit-teardown works. I 'll prepare a patch to apply it and send it for review.

@elukey Many thanks for pursuing and persisting on chasing this down.

Change 608282 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores::web: Pass skip-atexit-teardown to uwsgi

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608282

gerritbot added a project: Patch-For-Review.Jun 29 2020, 10:18 AM

Change 608282 merged by Alexandros Kosiaris:
[operations/puppet@production] ores::web: Pass skip-atexit-teardown to uwsgi

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608282

I 'll lower priority for this. We may have the solution. https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores&var-instance=All&from=now-7d&to=now lacks the pattern on the 30th and the 1st, whereas it is clearly visible on all the days before that. I 'll leave it be for a couple more days though so we can monitor.

Fantastic :) Thanks for the quick turn-around @akosiaris.

Halfak awarded a token.Jul 1 2020, 3:41 PM

https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&edit&fullscreen&orgId=1&from=now-10d&to=now&var-site=eqiad&var-cluster=ores&var-instance=All&var-datasource=thanos says we are ok since the change and grepping logs in all hosts points out the the last time OOM showed up was on Jun 29th. I 'll resolve this, it seems like we 've solved it.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Jul 13 2020, 4:29 PM

akosiaris closed this task as Resolved.Sep 1 2020, 12:33 PM

ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart)
Closed, ResolvedPublic
Actions

Related Objects
Search...