Page MenuHomePhabricator

ORES celery workers maxing CPU in Cloud VPS
Open, LowPublic

Description

Our ORES workers are maxing out in labs. But our request rate has dropped in half. What is going on?

Event Timeline

From IRC while I have been thinking through this:

[08:08:32] <halfak> Looks like our ability to process requests is diminished. 
[08:09:29] >chanserv< quiet #wikimedia-ai icinga-wm
[08:09:30] * ChanServ sets quiet on *!*@wikimedia/bot/icinga-wm
[08:09:35] <halfak> I told icinga to shuttup. 
[08:09:44] <halfak> So now to look into what's going on. 
[08:45:31] <halfak> Something really weird is going on.  
[08:45:59] <halfak> I just cleared out celery and restarted everything and we're still getting *pinned* on 3/4 of our celery workers. 
[08:46:03] <halfak> There was no lull. 
[08:46:35] <halfak> And AFAICT, there's not a queue of incoming requests that is filling up. 
[08:46:47] <halfak> Something else is using a ton of CPU on these machines. 
[08:47:19] <halfak> I have looked through the user agents.  There's no clear indication of anyone.  It's almost all just "-"
[08:47:49] <halfak> I suspect that celery is spinning on its own, so I'm going to try temporarily disabling the load balancer -- thus cutting outside traffic and monitoring what celery does. 
[08:54:14] <halfak> OK it looks like CPU usage on the workers has fallen, but we still have one worker process on each machine taking up 100% of a core. 
[08:54:43] <halfak> This should be impossible.  We are using signal based timeouts that should eliminate the possibility of a run-away celery process.  But here it is happening.

After a bunch more digging in the logs, I'm finding some revids that never seem to finish scoring. E.g. https://ores.wmflabs.org/v3/scores/enwiki/559565916/damaging?features never seems to finish.

But I can run the model locally and get a score quickly enough. Note running the model locally requires a lot of pre-amble so the timing should be faster when everything is already in memory.

$ time revscoring score models/enwiki.damaging.gradient_boosting.model --host=https://en.wikipedia.org 559565916
559565916	{"probability": {"false": 0.8600289360409357, "true": 0.13997106395906433}, "prediction": false}

real	0m8.595s
user	0m8.007s
sys	0m0.343s

OK I just replicated this on ores-worker-04:

time revscoring score /srv/ores/config/submodules/editquality/models/enwiki.damaging.gradient_boosting.model --host https://en.wikipedia.org 559565916
559565916	{"probability": {"false": 0.8600289360409357, "true": 0.13997106395906433}, "prediction": false}

real	0m10.095s
user	0m9.568s
sys	0m0.392s

It still should be fast enough to complete in time. So that's a dead end.

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.