Page MenuHomePhabricator

~3000% increase in session redis memory usage, causing evictions and session loss
Closed, ResolvedPublic

Assigned To
Authored By
ori
Jan 29 2016, 10:58 PM
Referenced Files
F3359784: stacked.php.png
Feb 15 2016, 2:08 AM
F3359124: graph.png
Feb 14 2016, 8:24 PM
F3292921: graph.png
Jan 30 2016, 6:51 AM
F3291688: usedmem.png
Jan 29 2016, 10:58 PM
F3291664: memcaches.png
Jan 29 2016, 10:58 PM
F3291699: evict.png
Jan 29 2016, 10:58 PM
F3291671: abcd.png
Jan 29 2016, 10:58 PM

Description

Bytes in on the application server cluster increased by about 40% over the past day:

abcd.png (237×577 px, 23 KB)

This matches an increase in bytes out on the memcached / session redis hosts:

memcaches.png (237×577 px, 20 KB)

It seems that the issue is with redis usage rather than memcached; redis memory usage jumped from ~15mb to >500mb:

usedmem.png (419×577 px, 30 KB)

This is forcing redis to evict keys:

evict.png (293×577 px, 19 KB)

Event Timeline

ori raised the priority of this task from to Unbreak Now!.
ori updated the task description. (Show Details)
ori subscribed.

Change 267387 had a related patch set uploaded (by Anomie):
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267387

Change 267388 had a related patch set uploaded (by BryanDavis):
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267388

Change 267388 abandoned by BryanDavis:
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267388

Change 267388 restored by BryanDavis:
SessionManager: Don't save sessions until they're persisted

https://gerrit.wikimedia.org/r/267388

Change 267388 had a related patch set uploaded (by BryanDavis):
SessionManager: Don't save non-persisted sessions to backend storage

https://gerrit.wikimedia.org/r/267388

Change 267387 merged by jenkins-bot:
SessionManager: Don't save non-persisted sessions to backend storage

https://gerrit.wikimedia.org/r/267387

Change 267388 merged by jenkins-bot:
SessionManager: Don't save non-persisted sessions to backend storage

https://gerrit.wikimedia.org/r/267388

On investigation @Anomie determined that SessionManager was potentially fetching data from the backing cache multiple times in handling a single request. https://gerrit.wikimedia.org/r/267387 (& the https://gerrit.wikimedia.org/r/267388 cherry-pick) add an in-process cache to stop this. TCP dump traces taken by @ori appeared to show ~7 fetches of the same session for a single request before the patch was applied.

What seems to have been happening here is two related problems:

  1. SessionManager was saving data (expiry 1 hour) for every anon pageview. This caused memory usage to skyrocket, and eventually start evicting keys. That's T125194.
  2. SessionManager was loading the session data from redis several times per request, which caused an increase in the traffic to and from redis.

#2 was solved by introducing an in-memory cache, so each session's data would be loaded only once per request. It's possible that data still might be saved multiple times per request, however. We can look into mitigating that if that too is a problem.

#1 was solved by only saving to redis when the session is persisted.[1] Non-persisted sessions are saved in-memory only, as any data they might contain is not needed outside the context of the individual request. If this breaks anything, the solution is for that thing to persist the session like it should have been doing in the first place. We note that the baseline memory usage is still increased by (estimated) about 350–400 bytes per session, due to the extra metadata being stored to support the improved security checks offered by SessionManager.

[1]: A "persisted" session is one that has had cookies sent to the user, or whatever is equivalent.

Via https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=&vl=&x=&n=&hreg[]=mc10[0-9][0-9]*&mreg[]=used_memory&gtype=line&glegend=show&aggregate=1&embed=1:

graph.png (569×747 px, 49 KB)

Baseline is certainly up a bit as expected by increased base session contents, but the runaway growth seems to have been taken care of.

ori claimed this task.

We note that the baseline memory usage is still increased by (estimated) about 350–400 bytes per session, due to the extra metadata being stored to support the improved security checks offered by SessionManager.

Seems to be about a 4x increase.
http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Memcached+eqiad&h=mc1001.eqiad.wmnet&jr=&js=&v=523764096.000000&m=used_memory&vl=bytes&ti=used_memory

graph.png (373×747 px, 26 KB)

It doesn't look so bad if you look at all the servers instead of just mc1001: https://ganglia.wikimedia.org/latest/stacked.php?m=used_memory&c=Memcached%20eqiad&r=week&st=1455218004&host_regex=

stacked.php.png (457×781 px, 55 KB)

It seems that 14 servers were in use before, but only 8 are in use now for some reason.