Page MenuHomePhabricator

Establish an SLA for session storage
Closed, ResolvedPublic

Description

Sessions are currently stored in Redis, a highly-optimized in-memory store with request latency reportedly on the order of ~1 ms. Given the nature of the proposed session storage service, latency can only be higher than what we're accustomed to; It seems implied that we are striking a bargain to trade away some latency in return for multi-master replication, but have not yet agreed upon a price.

We should to establish an SLA for the proposed session storage service.

Per discussion below:

ActionRedis (ms)MultiplierMean latency (ms)p90 latency (ms)error rate
GET12231%
SET45145751%
DELETE11010151%

Event Timeline

Eevans triaged this task as Medium priority.Dec 11 2018, 8:41 PM
Eevans created this task.

Had an extended conversation with @Eevans about this on IRC today. His values are good with me -- as long as he/Core Platform feel that session storage is fast enough I'm happy to take the increased latency in exchange for a move toward multi-master.

@aaron and @Krinkle may have more insight into current actual performance of session storage, though there's good reason to think that what's in Grafana at the moment is very not right.

Overall, this is a tricky question because we 1) don't have an SLA on page loads and 2) don't have any visibility on current performance, so we're operating pretty blind on setting a number. I would much rather implement and react than overthink upfront.

Had an extended conversation with @Eevans about this on IRC today. His values are good with me -- as long as he/Core Platform feel that session storage is fast enough I'm happy to take the increased latency in exchange for a move toward multi-master.

In practice these numbers will be lower, but it's good to have headroom and account for possible future increase in traffic to session storage.

Overall, this is a tricky question because we 1) don't have an SLA on page loads and 2) don't have any visibility on current performance, so we're operating pretty blind on setting a number. I would much rather implement and react than overthink upfront.

It's true that we haven't been systematic about performance numbers in our system in general and I'd also like to see us start doing that whenever possible. Eventually, we should be able to have the full picture :)

I don't think we need to overthink this, but knowing what kind of latency increase we can expect might drive us to choices of implementation technologies different from the ones we currently picked.

So we need to understand two things:

  • How many times per user request we access the session data. Right now we separate every tiny piece of it and so it might mean we make - say - 50 requests to the session redis per user request.
  • How much time we spend performing those requests to redis collectively.

Any number we come up with as Service Level Objective needs to start from those two pieces of data. AIUI neither of those things is currently measured, too.

I don't think we need to overthink this, but knowing what kind of latency increase we can expect might drive us to choices of implementation technologies different from the ones we currently picked.

So we need to understand two things:

  • How many times per user request we access the session data. Right now we separate every tiny piece of it and so it might mean we make - say - 50 requests to the session redis per user request.
  • How much time we spend performing those requests to redis collectively.

Any number we come up with as Service Level Objective needs to start from those two pieces of data. AIUI neither of those things is currently measured, too.

@Anomie is this something you can help us with? How many times do we go to session storage for a single request? What would it take to instrument storage so that we could collect (session storage) request latency?

I don't think we need to overthink this, but knowing what kind of latency increase we can expect might drive us to choices of implementation technologies different from the ones we currently picked.

So we need to understand two things:

  • How many times per user request we access the session data. Right now we separate every tiny piece of it and so it might mean we make - say - 50 requests to the session redis per user request.
  • How much time we spend performing those requests to redis collectively.

Any number we come up with as Service Level Objective needs to start from those two pieces of data. AIUI neither of those things is currently measured, too.

@Anomie is this something you can help us with? How many times do we go to session storage for a single request?

Short answer: We access the storage zero or one time for almost all requests.

Long answer: When we first deployed SessionManager, there were two major issues that came up in T125267: ~3000% increase in session redis memory usage, causing evictions and session loss: it was writing non-persisted sessions for anons into the storage, making memory usage explode, and it was reading sessions back from the storage multiple times per request. We fixed that by introducing an in-process caching layer, where non-persisted-session writes went only to the IPC and the multiple loads could load from the IPC subsequent times during the request. A new non-persisted session (that never gets persisted) shouldn't hit the storage backend at all now.

There's still the possibility of hitting the storage more than once for logins, logouts, password changes, and other things that change the session ID. Those should be rare enough to not be worried about.

Note that the current code "hitting the session storage" fetches all of the data for the session ID. If "right now we separate every tiny piece of it" means you're changing that somehow, then my explanation above likely doesn't apply.

What would it take to instrument storage so that we could collect (session storage) request latency?

If you want to measure the reads/writes of the storage itself, ignoring accesses of the in-process cache, probably the thing to do would be to either instrument the relevant BagOStuff class or to put an instrumenting wrapper around the storage BagOStuff before SessionManager::__construct() wraps it in a CachedBagOStuff (which implements the in-process cache). The alternative would be to find all the interesting uses of that object and add instrumenting code at each place, complicated by need to first check whether the in-process cache already has it cached.

I was asking because looking at what's currently stored in the "service", I see both mwsession objects (that are created I guess by the user session), and objects that have the form $wiki:echo:(alert|seen|message) which seem to be created by... Flow?

There is a huge number of such objects, which is a problem in itself - we have 2M echo objects compared to just 16k session objects.

So my next question would be - do those objects need to be active-active in writing like the session objects? How important is latency in accessing those? @Catrope maybe you know how to answer this question (or know who could)?

I was asking because looking at what's currently stored in the "service", I see both mwsession objects (that are created I guess by the user session), and objects that have the form $wiki:echo:(alert|seen|message) which seem to be created by... Flow?

There is a huge number of such objects, which is a problem in itself - we have 2M echo objects compared to just 16k session objects.

Is this part of redis_sessions? IOW, is it included in the rates show here: https://grafana.wikimedia.org/d/000000174/redis?orgId=1&panelId=8&fullscreen ?

I was asking because looking at what's currently stored in the "service", I see both mwsession objects (that are created I guess by the user session), and objects that have the form $wiki:echo:(alert|seen|message) which seem to be created by... Flow?

There is a huge number of such objects, which is a problem in itself - we have 2M echo objects compared to just 16k session objects.

Is this part of redis_sessions? IOW, is it included in the rates show here: https://grafana.wikimedia.org/d/000000174/redis?orgId=1&panelId=8&fullscreen ?

Yes it is. So we might have amply overestimated the amount of requests we get for sessions specifically.

It might be the case that echo needs to use a persistent storage too, from my discussion on IRC with @Tgr :

<tgr>	the code uses ObjectCache::getMainStashInstance()
<tgr>	* Stashes should be configured to propagate changes to all data-centers.
<tgr>	* Callers should be prepared for:
<tgr>	*   - a) Writes to be slower in non-"primary" (e.g. HTTP GET/HEAD only) DCs
<tgr>       *   - b) Reads to be eventually consistent, e.g. for get()/getMulti()
<tgr>	oh yeah it also says the data is not stored elsewhere
<tgr>	so someone should probably review usages of getMainStashInstance
<tgr>	at a glance it is used by Echo, CentralAuth, AbuseFilter, ConfirmEdit at least

So we need to take a hard look, because if it's data not stored elsewhere (which is.. the opposite of what I'd expect from something called ObjectCache) and it needs to be written from both datacenters, then it needs a different data store tha redis too. I'm not sure those data should live in the same place where we store our sessions (actually, I explicitly think it shouldn't).

Per the docs: Stash objects are BagOStuff instances suitable for storing lightweight data that is not canonically stored elsewhere (such as RDBMS). Stashes should be configured to propagate changes to all data-centers.

At a glance, used by AbuseFilter, FlaggedRevs, CentralAuth, CentralNotice, ConfirmEdit, Echo, TorBlock. Not sure if these all actually rely on data persistence (although that's really the only reason to use instead of something more appropriate like the WAN cache). CentralAuth definitely needs it (it does its own session handling, basically), ConfirmEdit I think should be fine as long as the requests for captcha display and submission are guaranteed to go to the same DC; I'm not familiar with the rest.

Sessions are currently stored in Redis, a highly-optimized in-memory store with request latency reportedly on the order of ~1 ms.

Aren't we using Redis through nutcracker? That can't be 1ms, right?

Sessions are currently stored in Redis, a highly-optimized in-memory store with request latency reportedly on the order of ~1 ms.

Aren't we using Redis through nutcracker? That can't be 1ms, right?

$ mwscript maintenance/eval.php --wiki=enwiki
> $stash = ObjectCache::getMainStashInstance();
> $key = $stash->makeKey( 'flaggedrevs', 'statsUpdated' );
> list($usec, $sec) = explode(" ", microtime()); for ($i=0;$i<1000; $i++) { $stash->get('enwiki:flaggedrevs:statsUpdated'); }; list($usec1, $sec1) = explode(" ", microtime())
> var_dump($sec1 - $sec + $usec1 - $usec)
float(0.466629)

Getting that specific key 1000 times took 466 milliseconds if I read it correctly. So yes, it kinda is 1 ms or so.

To add to what @Tgr found, we have to search for usage of MediaWikiServices::getInstance()->getMainObjectStash(); as that's what that method uses under the hood.

There is quite a few uses of it here:
https://codesearch.wmflabs.org/search/?q=getMainObjectStash&i=nope&files=&repos=

I would propose that whatever is not extremely valuable (so anything that's not in the session store) should be stored on memcached via mcrouter, so that we can have multi-dc writes and broadcast both writes and evictions if needed. We can't realistically support sub-millisecond latencies in the service we're desinging and most uses of this cache are for caching purposes indeed.

@aaron do you think using mcrouter for MainObjectStash is feasible?

While I guess we should create a new ticket to talk about MainObjectStash is used and where to migrate it.

Circling back to the SLA, I think it's off by ~ 1 order of magnitude for errors (at the very least), and I don't think our p90 (which is not such a steep percentile to measure against, given our volumes I'd expect p95 to be more significant) should exceed 50 ms tops.

I'm ok with a penalty due to the multi-dc nature of the new service, but given I doubt the current p99 exceeds 10 ms, I wouldn't want to make things so much worse.

To add to what @Tgr found, we have to search for usage of MediaWikiServices::getInstance()->getMainObjectStash(); as that's what that method uses under the hood.

There is quite a few uses of it here:
https://codesearch.wmflabs.org/search/?q=getMainObjectStash&i=nope&files=&repos=

I would propose that whatever is not extremely valuable (so anything that's not in the session store) should be stored on memcached via mcrouter, so that we can have multi-dc writes and broadcast both writes and evictions if needed. We can't realistically support sub-millisecond latencies in the service we're desinging and most uses of this cache are for caching purposes indeed.

@aaron do you think using mcrouter for MainObjectStash is feasible?

Looking at the callers, some of them cold get by on mcrouter. Ideally there would be some cheap disk-persistence, but it's not the end of the world for those callers. As long as restarts and network problems are not extremely common, then it doesn't matter much. I do wonder about how mcrouter handles higher volume *replicated* queries, which we currently avoid by only replicating purges. I don't know it does any kind of pipelining/batching logic since it uses the ASCII protocol. I remember having a hard time looking at the code. It seems to just have an internal in-memory queue and replicates in async threads. It's worth experimenting with.

I think, for now, that the current callers could get by on a separate wan cache pool prefix (a small puppet patch) that makes mcrouter replicate SET commands...and rely on the rareness of narrow race conditions. It's not ideal, but is doable, if that lets things move forward.

EvanProdromou subscribed.

So, I'm going to try to get some numbers on this.

So, last level of discussion had these values:

ActionRedis (ms)MultiplierMean latency (ms)p90 latency (ms)error rate
GET12231%
SET45104507501%
DELETE11010151%

The SET metric for redis is very slow, so wouldn't use 10x that figure.

CentralAuth (CA) logins involves HTTP POST, redirect => HTTP GET, redirect HTTP GET. In terms of session store writes there is:
a) POST localwiki: session backend set + CA start-token set
b) GET centralwiki: session backend set + CA start-token delete + CA completion-token set
c) GET localwiki: session backend set + CA completion-token delete

1x450ms would already be slow, 7x that would be quite slow.

I think 10x of a normal SET in a properly configured/allocated memcached/redis setup (2-3ms) would be great though impossible. This is below the eqiad <=> cofw RTT latency (36.1ms ping), so any real latency of the whole hyperswitch => cassandra stack will be higher.

If the store was LOCAL_QUORUM by default (with WRITE_SYNC making it QUOROM_EACH), then both token SETs could be cross-DC-async (the GET could use WaitConditionLoop in MediaWiki instead). That would get it to 5x cross-DC-sync writes, so 200ms writes would add 1 second. The token deletes could probably also be made cross-DC-async (the whole function is already not atomic and does not check if the delete succeeded either). That would get it down further to 3x cross-DC-sync writes, so 330ms writes would add 1 second.

For any given user, login is infrequent, so +1 second total for login should be tolerable. To add less than 1 second without any of these code change would mean session write latency would have to be <= 142ms. One would hope that in the time it takes to do cross-DC 4RTTs, that the session store could save one value...I'd expect better than that for a good key/value system.

I don't have input on the numbers, but I do think we need a p99 and max as well.

For highly variable metrics (such as client-side operations) we generally look at p75. For server-side metrics, I generally recommend p99 and max.

Other percentiles could be listed if others find them useful, but I don't think we should attribute any meaning to a mean average or median (p50) as that ignores half the traffic.

  • The p99 defines what most operations should look like.
  • The max defines after what time we may assume error and return false to allow the program to continue if possible. The client can enforce this, but service itself would ideally also attempt to return within this time, with an error if need be.
  • The 1% in-between defines our tolerance.

1% may seem small, but it's huge for backend services that face users (e.g. RESTBase, Varnish). If 1% of our Varnish requests had an issue, everyone could be affected at least once during a 3 minute browsing session (see this blog to learn why).

It's even more important for internal services (i.e. this service), as they are involved multiple times during a single user request.

Change 499561 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/CentralAuth@master] Swap uses of READ_LATEST with new getKeyValueUponExistence() method

https://gerrit.wikimedia.org/r/499561

Change 499561 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@master] Swap uses of READ_LATEST with new getKeyValueUponExistence() method

https://gerrit.wikimedia.org/r/499561

@aaron I lost the thread here. Could you give me some candidate numbers for what would be acceptable performance in the table above? It sounds like 200-330ms SET would be better?

@Eevans have we been considering cross-DC writes in the performance testing? Are we comparing apples to apples here?

@Krinkle so, we need p99 and max on the above table? and no p50? Also, I assume by 1% you're talking about 1.00 - .99 and not the stated error rate above?

@Eevans have we been considering cross-DC writes in the performance testing? Are we comparing apples to apples here?

Cross-DC only occurs for DELETE operations and should be identical to POST, plus whatever inter-DC latency is. I have not included them in performance testing thus far because the focus (thus far) has been on Kask performance (and any fixes/optimizations needed there), and including DELETE would only provide a measurement of Cassandra over the WAN.

@Eevans have we been considering cross-DC writes in the performance testing? Are we comparing apples to apples here?

Cross-DC only occurs for DELETE operations and should be identical to POST, plus whatever inter-DC latency is. I have not included them in performance testing thus far because the focus (thus far) has been on Kask performance (and any fixes/optimizations needed there), and including DELETE would only provide a measurement of Cassandra over the WAN.

Why would there be cross-DC requests to Kask at all? MW should send requests only to Kask in the local DC, and then it's up to Cassandra to propagate the change to the other DCs. Or am I missing something?

@Eevans have we been considering cross-DC writes in the performance testing? Are we comparing apples to apples here?

Cross-DC only occurs for DELETE operations and should be identical to POST, plus whatever inter-DC latency is. I have not included them in performance testing thus far because the focus (thus far) has been on Kask performance (and any fixes/optimizations needed there), and including DELETE would only provide a measurement of Cassandra over the WAN.

Why would there be cross-DC requests to Kask at all? MW should send requests only to Kask in the local DC, and then it's up to Cassandra to propagate the change to the other DCs. Or am I missing something?

Let me rephrase: The only thing which generates any cross-DC traffic synchronous to the client request is a DELETE, which uses Cassandra's each quorum. So it's Cassandra-on-Cassandra cross-DC (with the client ultimately waiting on that remote quorum). Otherwise, I assume MW will always send requests to local DC Kask instances, yes.

@mobrovac I think it's less about cross-DC writes to Kask, and more about discussion of possible performance of cross-DC writes in https://phabricator.wikimedia.org/T211721#5009838 above.

I think @aaron is talking about the total time for Kask request + Cassandra cross-DC sync, although I'm not sure I understand why that would be the case, right? It looks like a single login event takes 7 session writes (set or delete), which means we'd see a noticeable time increase for login.

I'd think that read-after-write consistency in a single DC would be our main goal. Am I reading @aaron's comment wrong, or is he expecting synchronous writes across DCs?

On a related note, do we want or need an SLA on consistency?

The SET metric for redis is very slow, so wouldn't use 10x that figure.

CentralAuth (CA) logins involves HTTP POST, redirect => HTTP GET, redirect HTTP GET. In terms of session store writes there is:
a) POST localwiki: session backend set + CA start-token set
b) GET centralwiki: session backend set + CA start-token delete + CA completion-token set
c) GET localwiki: session backend set + CA completion-token delete

Note that with the patches above having been merged a while back, all the token operations do not involve cross-DC cassandra writes. The three user session writes still do. My understanding is session writes would be visible in all DCs once an HTTP response is sent (at least the old security team wanted that). Kask/Cassandra itself can be set to use a local quorum by default, with certain Kask/RestBagOStuff operations using global quorum via WRITE_SYNC.

As far as relaxing the session writes to be DC-local goes, I would say:

  • Login/creation: stub and full sessions could have similar logic to getKeyValueUponExistence() during the Login/CentralAuth special page handshake; the presence of the session key or certain fields with in it could be blocked on until they appear. In theory, a different option would be making all CentralAuth stuff use the master DC via VCL...but many cases use localized special page names in the URL, so that would have to change first...and it would still seem fragile.
  • Session data changes: luckily, we've moved most random extensions using the session for key/value storage to their own system/cookies, but it is still possible for bits of user status/information to be stored there. The combination of that fact and the fact that end-user HTTP requests might alternate between DCs sometimes...makes things tricky. It's easiest to just do synchronous writes. In theory, CentralAuthSessionProvider could have a writePosIndex field in the session data array, MediaWiki::preOutputCommit could use a cookie like cpPosIndex but for sessions, and CentralAuthSessionProvider could wait for the expected write index to appear in the backend session data array (similar to ChronologyProtector::initPositions). That would cut down cases where cross-DC writes would be desired. OTOH, given that session writes could happen in any DC (though usually the master), and that they are already "racey" when done in parallel, it would be tricky to come up with a good scheme for write positions.
  • Logout: Given how "remember me" auth token cookies as well as all backend sessions are tied to the gu_auth_token field, which changes in the DB on logout, cross-DC session writes should not be needed to more logout to work correctly (e.g. race conditions). Once that column value change replicates, any old sessions are properly treated as invalid.

[ ... ]

... My understanding is session writes would be visible in all DCs once an HTTP response is sent (at least the old security team wanted that).

Uh oh, it sounds like we have a disconnect; This is the first I'm hearing this.

During the requirements and RFC stages, talk centered around having read-your-write consistency in the data-center local to application servers. As I understand it, other than very exceptional circumstances, a user will be pinned to a data-center (ala geo DNS). And, AIUI, the worst-case scenarios of a racing remote DC read beating an async replication, didn't translate into any serious errors. In other words: Weighed against improbable, temporary issues of correctness, resulting in minor errors, having cross-DC latencies in-lined on every request did not look like a good bargain.

The exception to this was delete operations, since it seemed feasible that someone might be able to game the race on logout from a remote DC.

Kask/Cassandra itself can be set to use a local quorum by default, with certain Kask/RestBagOStuff operations using global quorum via WRITE_SYNC.

As of right now, Kask is hard-coded to do local quorum reads and writes, and each quorum deletes (consistent with the RFC).

As far as relaxing the session writes to be DC-local goes, I would say:

  • Login/creation: stub and full sessions could have similar logic to getKeyValueUponExistence() during the Login/CentralAuth special page handshake; the presence of the session key or certain fields with in it could be blocked on until they appear. In theory, a different option would be making all CentralAuth stuff use the master DC via VCL...but many cases use localized special page names in the URL, so that would have to change first...and it would still seem fragile.
  • Session data changes: luckily, we've moved most random extensions using the session for key/value storage to their own system/cookies, but it is still possible for bits of user status/information to be stored there. The combination of that fact and the fact that end-user HTTP requests might alternate between DCs sometimes...makes things tricky. It's easiest to just do synchronous writes. In theory, CentralAuthSessionProvider could have a writePosIndex field in the session data array, MediaWiki::preOutputCommit could use a cookie like cpPosIndex but for sessions, and CentralAuthSessionProvider could wait for the expected write index to appear in the backend session data array (similar to ChronologyProtector::initPositions). That would cut down cases where cross-DC writes would be desired. OTOH, given that session writes could happen in any DC (though usually the master), and that they are already "racey" when done in parallel, it would be tricky to come up with a good scheme for write positions.
  • Logout: Given how "remember me" auth token cookies as well as all backend sessions are tied to the gu_auth_token field, which changes in the DB on logout, cross-DC session writes should not be needed to more logout to work correctly (e.g. race conditions). Once that column value change replicates, any old sessions are properly treated as invalid.

Again, my understanding is that baring administrative changes, (which are infrequent and expected to be disruptive), we have client/DC affinity, that session operations will not be bouncing back and forth between data-centers. Is that not correct?

On a related note, do we want or need an SLA on consistency?

Can you explain what this would look like, how it would be measured?

On a related note, do we want or need an SLA on consistency?

Can you explain what this would look like, how it would be measured?

Maybe something like:

  • Read-after-Write consistency within the data centre
  • Bounded Staleness between data centres, with bound = X seconds
  • Propagation error rate of Y%

?

Looking at the RFC, it says RAW in the DC, Eventual Consistency for multi-DC. If that went over fine during the RFC discussions, and if we're not worried about multi-DC client sessions, then I don't think it's necessary to establish staleness bounds.

I think 10x of a normal SET in a properly configured/allocated memcached/redis setup (2-3ms) would be great though impossible. This is below the eqiad <=> cofw RTT latency (36.1ms ping), so any real latency of the whole hyperswitch => cassandra stack will be higher.

So, if we're doing RAW consistency in-DC, eventual consistency across DCs, then this becomes at least feasible, right? Does a number like 30ms make sense for mean latency on POST? Or even 45ms, since that's the current Redis value?

I updated the description so it has what seems to be the latest values.

I'd like to close this off soon.