Page MenuHomePhabricator

~3000% increase in session redis memory usage, causing evictions and session loss
Closed, ResolvedPublic

Description

Bytes in on the application server cluster increased by about 40% over the past day:

This matches an increase in bytes out on the memcached / session redis hosts:

It seems that the issue is with redis usage rather than memcached; redis memory usage jumped from ~15mb to >500mb:

This is forcing redis to evict keys:

Event Timeline

ori created this task.Jan 29 2016, 10:58 PM
ori raised the priority of this task from to Unbreak Now!.
ori updated the task description. (Show Details)
ori added a subscriber: ori.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 29 2016, 10:58 PM
greg added a subscriber: greg.Jan 29 2016, 10:59 PM

Change 267387 had a related patch set uploaded (by Anomie):
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267387

Change 267388 had a related patch set uploaded (by BryanDavis):
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267388

Change 267388 abandoned by BryanDavis:
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267388

Change 267388 restored by BryanDavis:
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267388

Change 267388 had a related patch set uploaded (by BryanDavis):
SessionManager: Don't save non-persisted sessions to backend storage

https://gerrit.wikimedia.org/r/267388

Change 267387 merged by jenkins-bot:
SessionManager: Don't save non-persisted sessions to backend storage

https://gerrit.wikimedia.org/r/267387

Change 267388 merged by jenkins-bot:
SessionManager: Don't save non-persisted sessions to backend storage

https://gerrit.wikimedia.org/r/267388

On investigation @Anomie determined that SessionManager was potentially fetching data from the backing cache multiple times in handling a single request. https://gerrit.wikimedia.org/r/267387 (& the https://gerrit.wikimedia.org/r/267388 cherry-pick) add an in-process cache to stop this. TCP dump traces taken by @ori appeared to show ~7 fetches of the same session for a single request before the patch was applied.

What seems to have been happening here is two related problems:

  1. SessionManager was saving data (expiry 1 hour) for every anon pageview. This caused memory usage to skyrocket, and eventually start evicting keys. That's T125194.
  2. SessionManager was loading the session data from redis several times per request, which caused an increase in the traffic to and from redis.

#2 was solved by introducing an in-memory cache, so each session's data would be loaded only once per request. It's possible that data still might be saved multiple times per request, however. We can look into mitigating that if that too is a problem.

#1 was solved by only saving to redis when the session is persisted.[1] Non-persisted sessions are saved in-memory only, as any data they might contain is not needed outside the context of the individual request. If this breaks anything, the solution is for that thing to persist the session like it should have been doing in the first place. We note that the baseline memory usage is still increased by (estimated) about 350–400 bytes per session, due to the extra metadata being stored to support the improved security checks offered by SessionManager.

[1]: A "persisted" session is one that has had cookies sent to the user, or whatever is equivalent.

Via https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=&vl=&x=&n=&hreg[]=mc10[0-9][0-9]*&mreg[]=used_memory&gtype=line&glegend=show&aggregate=1&embed=1:

Baseline is certainly up a bit as expected by increased base session contents, but the runaway growth seems to have been taken care of.

ori closed this task as Resolved.Jan 30 2016, 8:50 AM
ori claimed this task.
Tgr added a subscriber: Tgr.Feb 14 2016, 8:24 PM

We note that the baseline memory usage is still increased by (estimated) about 350–400 bytes per session, due to the extra metadata being stored to support the improved security checks offered by SessionManager.

Seems to be about a 4x increase.
http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Memcached+eqiad&h=mc1001.eqiad.wmnet&jr=&js=&v=523764096.000000&m=used_memory&vl=bytes&ti=used_memory

It doesn't look so bad if you look at all the servers instead of just mc1001: https://ganglia.wikimedia.org/latest/stacked.php?m=used_memory&c=Memcached%20eqiad&r=week&st=1455218004&host_regex=


It seems that 14 servers were in use before, but only 8 are in use now for some reason.