Our ORES workers are maxing out in labs. But our request rate has dropped in half. What is going on?
Description
Description
Event Timeline
Comment Actions
From IRC while I have been thinking through this:
[08:08:32] <halfak> Looks like our ability to process requests is diminished. [08:09:29] >chanserv< quiet #wikimedia-ai icinga-wm [08:09:30] * ChanServ sets quiet on *!*@wikimedia/bot/icinga-wm [08:09:35] <halfak> I told icinga to shuttup. [08:09:44] <halfak> So now to look into what's going on. [08:45:31] <halfak> Something really weird is going on. [08:45:59] <halfak> I just cleared out celery and restarted everything and we're still getting *pinned* on 3/4 of our celery workers. [08:46:03] <halfak> There was no lull. [08:46:35] <halfak> And AFAICT, there's not a queue of incoming requests that is filling up. [08:46:47] <halfak> Something else is using a ton of CPU on these machines. [08:47:19] <halfak> I have looked through the user agents. There's no clear indication of anyone. It's almost all just "-" [08:47:49] <halfak> I suspect that celery is spinning on its own, so I'm going to try temporarily disabling the load balancer -- thus cutting outside traffic and monitoring what celery does. [08:54:14] <halfak> OK it looks like CPU usage on the workers has fallen, but we still have one worker process on each machine taking up 100% of a core. [08:54:43] <halfak> This should be impossible. We are using signal based timeouts that should eliminate the possibility of a run-away celery process. But here it is happening.
Comment Actions
After a bunch more digging in the logs, I'm finding some revids that never seem to finish scoring. E.g. https://ores.wmflabs.org/v3/scores/enwiki/559565916/damaging?features never seems to finish.
But I can run the model locally and get a score quickly enough. Note running the model locally requires a lot of pre-amble so the timing should be faster when everything is already in memory.
$ time revscoring score models/enwiki.damaging.gradient_boosting.model --host=https://en.wikipedia.org 559565916 559565916 {"probability": {"false": 0.8600289360409357, "true": 0.13997106395906433}, "prediction": false} real 0m8.595s user 0m8.007s sys 0m0.343s
Comment Actions
OK I just replicated this on ores-worker-04:
time revscoring score /srv/ores/config/submodules/editquality/models/enwiki.damaging.gradient_boosting.model --host https://en.wikipedia.org 559565916 559565916 {"probability": {"false": 0.8600289360409357, "true": 0.13997106395906433}, "prediction": false} real 0m10.095s user 0m9.568s sys 0m0.392s
It still should be fast enough to complete in time. So that's a dead end.