Page MenuHomePhabricator

ORES redis: max number of clients reached...
Closed, ResolvedPublic

Description

On Sept 26 ~ 01:05 UTC we found out that ores was erroring with redis complaining that max number of clients reached...

@CDanis restarted celery workers, and uwsgi and later depooled codfw and pooled eqiad. It appears this brought the service back.

https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1601066376291&to=1601089247711


SAL:

02:04 cdanis@cumin2001: conftool action : set/pooled=false; selector: dnsdisc=ores,name=eqiad
02:04 cdanis@cumin2001: conftool action : set/pooled=true; selector: dnsdisc=ores,name=codfw
01:56 cdanis: ❌cdanis@cumin2001.codfw.wmnet ~ 🕙🍺 sudo cumin 'A:ores and A:codfw' 'systemctl restart celery-ores-worker.service uwsgi-ores.service '
01:48 cdanis@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=ores,name=codfw
01:48 cdanis@cumin1001: conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad
01:17 cdanis: ❌cdanis@ores2001.codfw.wmnet ~ 🕤🍺 sudo systemctl restart uwsgi-ores.service
01:11 cdanis: ✔️ cdanis@ores2001.codfw.wmnet ~ 🕘🍺 sudo systemctl restart celery-ores-worker.service

Event Timeline

jijiki created this task.Sep 26 2020, 8:34 AM
Restricted Application added a project: Machine Learning Platform. · View Herald TranscriptSep 26 2020, 8:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

"busy web workers" graph, which correlates quite well with the slowdown: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&from=1600951547072&to=1601099479878

The system almost went into this state on Thursday, but managed to recover. On Friday it deadlocked and wasn't able to.

Below is some speculation as to the cause and some thoughts on followup actions to be taken (all from #wikimedia-operations at the time). Times are US Eastern (one of these days I'll get glowing-bear to output UTC...).

21:51:40	<cdanis>	what I'm suspecting is happening is that there's no timeouts for the redis requests, or a timeout so long that the query gets killed by the client or at the traffic layer before it has a chance to get through the queueing in redis 
21:51:53	<cdanis>	and so then once you get into this state you never have any 'good' throughput
21:54:00	<cdanis>	okay, eqiad seems to be happily handling this traffic
21:54:31	<cdanis>	it's been long enough for TTLs for records pointing to codfw to clear, going to percussively restart things there
21:55:01	<chrisalbon>	sounds good, yeah looks good, API is up and serving fine
21:55:03	<cdanis>	the redises there are *still* hung on too many clients, btw, and it shouldn't've been getting new queries for at least 3 minutes now
21:55:18	<cdanis>	so once we get into this state, it seems we don't work our way out of it
22:00:01	<cdanis>	the number of steady-state connections from the ORES hosts in codfw (which were just restarted and haven't gotten traffic yet) is about 1/3rd to 1/2 of what seems to be the ceiling of allowed # of redis connections, btw
22:00:17	<cdanis>	so that's a second thing that should probably be fixed, not a comfortable amount of headroom
22:00:54	<cdanis>	https://phabricator.wikimedia.org/P12803
22:01:36	<cdanis>	okay, I think ores@codfw is healthier now
22:02:40	<chrisalbon>	okay thanks, ill make tickets for those two fixes so we don't return to this situation on a future friday night
22:02:49	<cdanis>	I'm actually going to restore the traffic flow as it was before, depooling eqiad and repooling codfw, because I'm not 100% sure it is meant to be active/active, and that's not the kind of change I want to make on a Friday night :)
22:03:15	<chrisalbon>	cool
22:03:36	<cdanis>	thanks! can you file a third to verify that ores dnsdisc should have both DCs pooled?
22:03:47	<chrisalbon>	yep

every time it starts at 16:00

calbon added a comment.EditedSep 26 2020, 7:41 PM

It started happening again, I went into each Ores200X box and manually sudo service uwsgi-ores restart to restart the service.

Edit: 30 minutes later I checked back and it appears to be working fine after the restart.

SD0001 added a subscriber: SD0001.Sep 27 2020, 5:59 PM

I have a bot that queries ORES a lot (I believe most of its requests are probably being fulfilled from the caches) and indeed it's persistently been running into 500 errors since 25 Sep.

I've sudo service uwsgi-ores restart on all ORES200x boxes again. This fixes the problem temporarily (~24 hours)
and tomorrow we'll start on a permanent solution

calbon added a subscriber: Halfak.Sep 27 2020, 8:29 PM

Any ideas on causes and solutions @Halfak?

awight added a subscriber: awight.Sep 28 2020, 7:44 AM

Possibly related to T181632: Celery manager implodes horribly if Redis goes down. In the past, Redis was a single point of failure and if Celery could not connect, it would lock up unrecoverably. We thought this was fixed, but it's worth keeping in mind as a possibility.

ArielGlenn triaged this task as High priority.Sep 28 2020, 9:49 AM

Possibly related to T181632: Celery manager implodes horribly if Redis goes down. In the past, Redis was a single point of failure and if Celery could not connect, it would lock up unrecoverably. We thought this was fixed, but it's worth keeping in mind as a possibility.

It seems that this case is about the uwsgi connections on the score cache redis (port 6380) and not the celery one (port 6379)

I'm not familiar with this problem. Anything change with the deployment recently? Did any overload errors happen during the outage? If not, that would line up with uwsgi-level failures.

Nope no deploys have happened recently. It has been happening every few hours since the 24th

Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing down ores: https://logstash.wikimedia.org/goto/f91cdb8acc0f46a02bd7552ffaf0efd4

  • Suddenly, there's a 1M more requests a day by OKAPI, have this been counted for? specially during the switchover that we are basically half the capacity.
  • The user agent given by OKAPI is wrong and has a typo. It's "okpi@wikimedia.org" which would be correct if you wanted ocean of pi
  • The requests OKAPI is making are exactly most recent edits (without any delay), well, beside CP and Mediawiki, now we have a third one that is causing the race condition on ores and this would explain the web nodes starving redis cache connections, because they see it's already being scored by another thread, so they wait for it to come back while holding the connection open.

I highly recommend turning the ores part of OKAPI off for no w to see if ores comes back to normal. If they want to continue, I'd recommend, at least giving it a delay so other parts of ores can compute and cache it.

Joe added a subscriber: Joe.Sep 30 2020, 6:18 AM

Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing down ores: https://logstash.wikimedia.org/goto/f91cdb8acc0f46a02bd7552ffaf0efd4

  • Suddenly, there's a 1M more requests a day by OKAPI, have this been counted for? specially during the switchover that we are basically half the capacity.

My expectation would be that if such a volume of requests is going to happen, we (SRE and the service owners) would be contacted beforehand. If this was any other external client, we would just ban it with no second thoughts.

Having said that, it's probably time we invest some engineering in ensuring ORES cannot be overwhelmed like that. Ideas could be:

  • Adding rate-limiting to the envoy frontends
  • Separating a smallish cluster for external requests, and dedicating the larger portion of computing power to our internal needs (although this would need some redis trickery to be effective)

Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing down ores: https://logstash.wikimedia.org/goto/f91cdb8acc0f46a02bd7552ffaf0efd4

  • Suddenly, there's a 1M more requests a day by OKAPI, have this been counted for? specially during the switchover that we are basically half the capacity.

My expectation would be that if such a volume of requests is going to happen, we (SRE and the service owners) would be contacted beforehand. If this was any other external client, we would just ban it with no second thoughts.

Yes, we would.

Having said that, it's probably time we invest some engineering in ensuring ORES cannot be overwhelmed like that. Ideas could be:

  • Adding rate-limiting to the envoy frontends
  • Separating a smallish cluster for external requests, and dedicating the larger portion of computing power to our internal needs (although this would need some redis trickery to be effective)

Yes, a separation of concerns is needed here I think. That is, powering RC and external users using the same infrastructure is prone to those kind of issues. Ideally we should just be able to run different ORES installations for different purposes so that the needs of one set of users don't end up causing issues to the other sets.

This sounds good to me, my only worry is that it would be too much work for something that's going to be replaced in the soon-ish future. I leave the decision to you.

Okapi team, please consider following the https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_revision_score event stream rather than attaching to the on-demand service.

Hey, connecting with folks on ORES team around this today - sorry we were given advice that the ORES stream may have some data integrity issues when implementing since we need the whole corpus. Pulling in our engineers to the conversation to elaborate. We'll dig back into the streams and discuss today on the call to move things over.

Also hah at Ocean of Pi, that would also be a good team name :)). We'll fix the typo though

Joe added a comment.EditedSep 30 2020, 4:46 PM

Hey, connecting with folks on ORES team around this today - sorry we were given advice that the ORES stream may have some data integrity issues when implementing since we need the whole corpus. Pulling in our engineers to the conversation to elaborate. We'll dig back into the streams and discuss today on the call to move things over.

Also hah at Ocean of Pi, that would also be a good team name :)). We'll fix the typo though

As a general advice, if you plan on calling a WMF production service with a large amount of requests, you should warn the relevant teams beforehand. Doing so can not only ensure SREs check that the volume of calls is not excessive when you deploy, but also that we spend a smaller amount of time trying to find the root cause of a production problem in case it happens.

Please let us know the next time :)

Hey, connecting with folks on ORES team around this today - sorry we were given advice that the ORES stream may have some data integrity issues when implementing since we need the whole corpus. Pulling in our engineers to the conversation to elaborate. We'll dig back into the streams and discuss today on the call to move things over.

If there are scores missing from the stream, feel free to hit the endpoint but I suggest waiting for scores from the stream before hitting ores so that way you filter at least 99% of them and then hit us for the 1% (which I'm pretty sure the stream is much better than 99% but let's assume it's so bad that's two 9s)

ACraze added a subscriber: ACraze.EditedSep 30 2020, 7:11 PM

If there are scores missing from the stream, feel free to hit the endpoint but I suggest waiting for scores from the stream before hitting ores so that way you filter at least 99% of them and then hit us for the 1% (which I'm pretty sure the stream is much better than 99% but let's assume it's so bad that's two 9s)

Agreed on using the stream when applicable, otherwise hit the endpoint and batch the revisions to score, if possible. You can batch up to 50 revids per request, see: https://www.mediawiki.org/wiki/ORES#API_usage

edit: Also implementing a delay between multiple requests on the endpoint could be helpful as well.

Around 23:48 we got another user report in #wikimedia-ai of Redis
exceptions when using the ORES API.

Joe added a comment.EditedOct 1 2020, 6:23 AM

The request hammering from okapi is continuing, we might need to ban the UA at the edge.

@RBrounley_WMF can you ensure that the requests to ORES are deactivated for the time being?

The graph below shows clearly the okapi traffic didn't go down after this was reported:

My expectation was that after we reported this problem to you, you'd involve the engineers and ask them to stop the process. This clearly didn't happen.

If the flux of requests doesn't stop, I'll just have to ban or rate-limit the user-agent for the time being.

Change 631385 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache-text: add throttling for calls to ORES from the OKAPI

https://gerrit.wikimedia.org/r/631385

Joe added a comment.Oct 1 2020, 7:16 AM

I've created the patch above, that can be used in case the okapi causes further problems to ORES. It's here as a stopgap measure if issues show up again today.

Joe added a comment.Oct 1 2020, 8:38 AM

I also noticed that the number of requests for /v3/precache has increased a lot around the time of the issue. This points to Change-Propagation being a contributor to the issues too.

Specifically, looking at the logs on the servers I found the following number of calls to /v3/precache:

Sep 21: 2M
Sep 22: 2M
Sep 23: 2 M
Sep 24: 2.2M
Sep 25: 2.35 M
Sep 26: 2.4M
Sep 27: 2.3M
Sep 28: 2.2M
Sep 29: 2.15M
Sep 30: 2 M

I didn't dig any deeper, but a 20% increase in calls to changeprop at peak seems relevant as well. Overall the number of requests ORES had to manage moved from an average of 3.3M to over 4M due to the increase in requests to change-propagation and the addition of the OKAPI ~ 500k calls per day.

It seems like ORES right now is operating almost "at-capacity", and the additional traffic just made it get over the threshold and in what looks like a positive feedback loop between redis maxing out connections and busy web workers.

It seems like ORES right now is operating almost "at-capacity", and the additional traffic just made it get over the threshold and in what looks like a positive feedback loop between redis maxing out connections and busy web workers.

I can't say with 100% certainty but it's not just the higher number of requests, it's that the fact they are being requested immediately after the edit (I manually double checked several requests coming from OKAPI) while ChangePropagation and mediawiki are sending the exact same request so uwsgi workers are stuck waiting for results all at the same time. To test this hypotheses if OKAPI just delays the requests for 1min or 30s, I have a high confidence that we wouldn't run into issues (while the number of requests stays the same)

We’re working to patch up our end, switching to streams with querying the Ores api when streams fail. Sorry will update soon

Joe added a comment.Oct 2 2020, 9:10 AM

Checkin in to report that calls from OKAPI have stopped tonight. Thanks @RBrounley_WMF (and the team)!

So if we still see the starvation pattern, we have to search for the root cause further.

calbon added a comment.Oct 2 2020, 3:12 PM

I just had to do another service restart.

Joe added a comment.Oct 2 2020, 3:57 PM

Small status update:

in order to grant everyone a quieter weekend (hopefully!), we've repooled eqiad and raised manually the max clients for rdb2003:6380 to 10k (from 4064). This should give us some capacity and growth room.

Change 632662 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] redis::instance: raise the file limit to match maxclients

https://gerrit.wikimedia.org/r/632662

calbon added a comment.Oct 8 2020, 9:28 PM

Suspiciously similar just happened to ores on eqiad. I reset the uwsgi service on all ores100x boxes and will monitor.

Change 633578 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Make redis connections for cache slightly healthier

https://gerrit.wikimedia.org/r/633578

^ This will make redis connection handling slightly healthier but I can't say it will handle this case as observability of ores is not great

Change 632662 merged by Giuseppe Lavagetto:
[operations/puppet@production] redis::instance: raise the file limit to match maxclients

https://gerrit.wikimedia.org/r/632662

Change 638467 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] ores: Reduce number of requests that trigger a restart of uwsgi worker

https://gerrit.wikimedia.org/r/638467

Mentioned in SAL (#wikimedia-operations) [2020-11-05T17:52:38Z] <akosiaris> restart uwsgi-ores in all ores1* nodes per complaint on IRC that max redis clients have been reached T263910

This will very likely fix it: https://github.com/wikimedia/ores/pull/352 I can deploy it if @akosiaris or @calbon approve the PR

calbon added a comment.Fri, Nov 6, 7:18 PM

@akosiaris Can you review it? I don't know enough about the nodes vs redis connection to intelligently review.

Change 639898 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] ores: Reduce number of uwsgi workers to 50

https://gerrit.wikimedia.org/r/639898

@akosiaris Can you review it? I don't know enough about the nodes vs redis connection to intelligently review.

Sure, I 've been reviewing already.

Change 639898 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Reduce number of uwsgi workers to 50

https://gerrit.wikimedia.org/r/639898

Change 638467 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Reduce number of requests that trigger a restart of uwsgi worker

https://gerrit.wikimedia.org/r/638467

calbon added a comment.Mon, Nov 9, 2:28 PM
This comment was removed by calbon.
Ladsgroup closed this task as Resolved.Thu, Nov 12, 5:59 PM
Ladsgroup claimed this task.

Guess when changes got merged: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&from=1604640082823&to=1605203814766

If it happens again, I'll use other tools in my toolbox.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptThu, Nov 12, 5:59 PM
Maintenance_bot moved this task from Incoming to Done on the User-Ladsgroup board.Thu, Nov 12, 6:15 PM