ORES celery workers maxing CPU in Cloud VPS
Open, LowPublic
Actions

Assigned To

None

Authored By

	Halfak
	Oct 8 2019, 2:00 PM

Description

Our ORES workers are maxing out in labs. But our request rate has dropped in half. What is going on?

Event Timeline

Halfak created this task.Oct 8 2019, 2:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 8 2019, 2:00 PM

From IRC while I have been thinking through this:

[08:08:32] <halfak> Looks like our ability to process requests is diminished. 
[08:09:29] >chanserv< quiet #wikimedia-ai icinga-wm
[08:09:30] * ChanServ sets quiet on *!*@wikimedia/bot/icinga-wm
[08:09:35] <halfak> I told icinga to shuttup. 
[08:09:44] <halfak> So now to look into what's going on. 
[08:45:31] <halfak> Something really weird is going on.  
[08:45:59] <halfak> I just cleared out celery and restarted everything and we're still getting *pinned* on 3/4 of our celery workers. 
[08:46:03] <halfak> There was no lull. 
[08:46:35] <halfak> And AFAICT, there's not a queue of incoming requests that is filling up. 
[08:46:47] <halfak> Something else is using a ton of CPU on these machines. 
[08:47:19] <halfak> I have looked through the user agents.  There's no clear indication of anyone.  It's almost all just "-"
[08:47:49] <halfak> I suspect that celery is spinning on its own, so I'm going to try temporarily disabling the load balancer -- thus cutting outside traffic and monitoring what celery does. 
[08:54:14] <halfak> OK it looks like CPU usage on the workers has fallen, but we still have one worker process on each machine taking up 100% of a core. 
[08:54:43] <halfak> This should be impossible.  We are using signal based timeouts that should eliminate the possibility of a run-away celery process.  But here it is happening.

After a bunch more digging in the logs, I'm finding some revids that never seem to finish scoring. E.g. https://ores.wmflabs.org/v3/scores/enwiki/559565916/damaging?features never seems to finish.

But I can run the model locally and get a score quickly enough. Note running the model locally requires a lot of pre-amble so the timing should be faster when everything is already in memory.

$ time revscoring score models/enwiki.damaging.gradient_boosting.model --host=https://en.wikipedia.org 559565916
559565916	{"probability": {"false": 0.8600289360409357, "true": 0.13997106395906433}, "prediction": false}

real	0m8.595s
user	0m8.007s
sys	0m0.343s

OK I just replicated this on ores-worker-04:

time revscoring score /srv/ores/config/submodules/editquality/models/enwiki.damaging.gradient_boosting.model --host https://en.wikipedia.org 559565916
559565916	{"probability": {"false": 0.8600289360409357, "true": 0.13997106395906433}, "prediction": false}

real	0m10.095s
user	0m9.568s
sys	0m0.392s

It still should be fast enough to complete in time. So that's a dead end.

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Oct 21 2019, 4:49 PM

Halfak triaged this task as Low priority.Oct 23 2019, 9:08 PM

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

• ACraze moved this task from Maintenance/cleanup to Backlog/ORES on the Machine-Learning-Team board.Jan 19 2021, 9:10 PM

ORES celery workers maxing CPU in Cloud VPSOpen, LowPublicActions

Description

Event Timeline

ORES celery workers maxing CPU in Cloud VPS
Open, LowPublic
Actions