Page MenuHomePhabricator

Harmonize CentralAuth and core session TTL and migrate CentralAuth sessions to Kask
Closed, ResolvedPublic

Description

Core sessions expire after one hour because that's the expiry time that Brion used when he implemented MediaWiki's first session storage feature in 2003 r1890.

CentralAuth sessions expire after 24 hours because that's the expiry time Andrew chose when he implemented the CentralAuth session concept in 2008 r33061.

I want to switch testwiki to multi-DC active/active mode within the next few working days, but the main remaining blocker is the fact that these two expiry times are different, since the different expiry times prevent them from being put in the same Kask container. Otherwise they have a very similar access pattern, both being managed by core's SessionManager, so it would make sense to put them in the same data store.

I think one hour is too short, because an editor might take longer than that to edit a page. Writing content is hard and we don't want to get in the way of it with unnecessary CSRF warnings.

Currently, logins without the "remember me" option will expire after 24 hours of inactivity, since expiry of a core session will cause login from the CentralAuth session. All logged-in users have a CentralAuth session. So I suggest making both expiry times be 24 hours, since this is the best way to harmonize them without upsetting user expectations.

Event Timeline

The Kask config says

# WARNING: The value of $wgObjectCacheSessionExpiry in MediaWiki must
# correspond to the TTL defined here; If you alter default_ttl, update
# MediaWiki accordingly or problems with session renewal/expiry may occur.
default_ttl: 3600

I think the correct way to update it is to increase the Kask TTL first, and then $wgObjectCacheSessionExpiry. MediaWiki renews a session when its age is close to $wgObjectCacheSessionExpiry, so it should be safe to have a Kask TTL which is greater than $wgObjectCacheSessionExpiry.

Change 816059 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/deployment-charts@master] Increase core session expiry to 86400 to match CentralAuth

https://gerrit.wikimedia.org/r/816059

Change 816060 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Increase $wgObjectCacheSessionExpiry to 86400

https://gerrit.wikimedia.org/r/816060

Change 816061 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Move CentralAuth sessions to Kask

https://gerrit.wikimedia.org/r/816061

Cassandra operations

sessionstore hosts have a sawtooth disk space usage graph showing up to 24% disk space usage of the 136GB /srv partition

sessionstore1002-df.png (270×467 px, 21 KB)

Naïvely multiplying disk space usage by 24 would imply that there is not enough space. So we have to dive a bit deeper into Cassandra internals to figure out what will happen when the TTL is increased.

On sessionstore1001 there is 9GB in the data directory, but most of that is in sstables that haven't been written to within the TTL. The most recent sstable is only 418MB. But compaction is not removing the old sstables. nodetool tablestats shows 9GB in use.

I see that SizeTieredCompactionStrategy is being used. Maybe switching to the TWCS compaction strategy would help. "In an expiring/TTL workload, the contents of an entire SSTable likely expire at approximately the same time, allowing them to be dropped completely, and space reclaimed much more reliably than when using SizeTieredCompactionStrategy or LeveledCompactionStrategy."

@Eevans should probably chime in here.

Cassandra operations

sessionstore hosts have a sawtooth disk space usage graph showing up to 24% disk space usage of the 136GB /srv partition

sessionstore1002-df.png (270×467 px, 21 KB)

Naïvely multiplying disk space usage by 24 would imply that there is not enough space. So we have to dive a bit deeper into Cassandra internals to figure out what will happen when the TTL is increased.

On sessionstore1001 there is 9GB in the data directory, but most of that is in sstables that haven't been written to within the TTL. The most recent sstable is only 418MB. But compaction is not removing the old sstables. nodetool tablestats shows 9GB in use.

The value of gc_grace_seconds is 86000 seconds -the default- so it doesn't matter how quickly the TTL expires, values will hang around for (at least) 10 days afterward; I don't think that just extending the TTL from 3600 to 86400 is going to (meaningful) change the utilization picture.

I see that SizeTieredCompactionStrategy is being used. Maybe switching to the TWCS compaction strategy would help. "In an expiring/TTL workload, the contents of an entire SSTable likely expire at approximately the same time, allowing them to be dropped completely, and space reclaimed much more reliably than when using SizeTieredCompactionStrategy or LeveledCompactionStrategy."

As I recall, session values are overwritten (resetting the TTL), which would tend to undermine TWCS. If you look at the output of STCS the results here are actually quite good (and approximate would TWCS would do under ideal circumstances). Looking at sessionstore1001 as of now, we have one table each with timestamps of July 14, 18, and 20, and 6 for today (the 22nd), and SSTables/read is never more than 1. I'd be surprised if we could optimize this further.


What is the expected utilization for CentralAuth sessions? How much additional storage will it require?

And separately from the issue of size (I'm sure the sessionstore cluster makes the most sense for this type data), is this really the same thing as the core sessions? Should we be using a single storage namespace for both (as opposed to creating a new Kask instance for CentralAuth)?

The value of gc_grace_seconds is 86000 seconds -the default- so it doesn't matter how quickly the TTL expires, values will hang around for (at least) 10 days afterward;

That explains some things, thanks.

I don't think that just extending the TTL from 3600 to 86400 is going to (meaningful) change the utilization picture.

I'll take that as approval for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/816059

As I recall, session values are overwritten (resetting the TTL), which would tend to undermine TWCS.

My assumption was that INSERT will add a whole new row to the store, superseding the previous row. So the old value will still exist with the same expiry and will be dropped during compaction. I know there is a concept of partial row updates, but I assumed they were only used for UPDATE.

If you look at the output of STCS the results here are actually quite good (and approximate would TWCS would do under ideal circumstances). Looking at sessionstore1001 as of now, we have one table each with timestamps of July 14, 18, and 20, and 6 for today (the 22nd), and SSTables/read is never more than 1. I'd be surprised if we could optimize this further.

I'm not worried about disk read rate, the host metrics show that the disk read rate is exactly zero. Their disk size is only double their RAM size, so it's not surprising all the data fits into memory. I'm worried about these hosts running out of disk space.

sessionstore1001 is coincidentally good at the moment, we happen to be at the bottom of a cycle. Look at disk usage over the last 60 days:

sessionstore1001-df.png (657×1 px, 94 KB)

Or look at sessionstore1003 which is around the midpoint of the range:

# ls -lh *-Data.db
-rw-r--r-- 1 cassandra cassandra 3.2G Jul 14 14:31 md-16486-big-Data.db
-rw-r--r-- 1 cassandra cassandra 2.3G Jul 20 11:15 md-16620-big-Data.db
-rw-r--r-- 1 cassandra cassandra 806M Jul 22 05:11 md-16681-big-Data.db
-rw-r--r-- 1 cassandra cassandra 692M Jul 23 17:46 md-16766-big-Data.db
-rw-r--r-- 1 cassandra cassandra 199M Jul 24 02:17 md-16787-big-Data.db
-rw-r--r-- 1 cassandra cassandra  45M Jul 24 02:50 md-16788-big-Data.db
-rw-r--r-- 1 cassandra cassandra  43M Jul 24 03:22 md-16789-big-Data.db
-rw-r--r-- 1 cassandra cassandra  43M Jul 24 03:54 md-16790-big-Data.db

Yes it still has several sstables with dates going back to gc_grace_seconds, but the total size is double that of sessionstore1001. I think that is because STCS mixes data of various ages when it compacts, so that the oldest sstables are inflated with values older than gc_grace_seconds.

I read this DataStax article which has some more information about compaction strategies. My prediction is that if we switch to TWCS, that monthly sawtooth will stop and we'll have flat disk space usage on the order of gc_grace_seconds multiplied by the number of bytes inserted per second.

What is the expected utilization for CentralAuth sessions? How much additional storage will it require?

I can try to get some exact numbers out of Redis, or collect stats from MediaWiki. Every logged-in user has a CentralAuth session, but the write rate should be lower than core sessions, because they're only written to on login, logout and renewal. CentralAuth sessions are not being used as a general store of user-related data.

And separately from the issue of size (I'm sure the sessionstore cluster makes the most sense for this type data), is this really the same thing as the core sessions? Should we be using a single storage namespace for both (as opposed to creating a new Kask instance for CentralAuth)?

They share a lot of code and concepts. They both support MW login. They were both in memcached before we started moving things around for multi-DC. It looks like a couple of days of work to create a new Kask instance.

There are about 4.6M CentralAuth session keys in Redis with an average value size of 146 bytes and a key size of 52. So around 860 MB of data plus overhead.

Method:

$n = 0; 
$sz = 0; 
foreach ( $redis->keys('centralauth:session:00*') as $key ) { 
    $n++; 
    $sz += strlen($redis->get($key)); 
} 
print "$n $sz\n";
// 2230 325192

The session ID is hexadecimal, so that's 1/256 keys from 1/8 servers so a sampling ratio of 1/2048. I used a codfw replica.

Comparing that to core sessions: the minimum partition count is around 75M and partition size is 100 bytes. Disk space usage is ~10.3GB implying an overhead of 47 bytes per partition.

So with gc_grace_seconds = 864000 we can expect CentralAuth sessions to consume about 10 GB of disk space. With STCS it could rise to 20 or 30 GB.

Of course, we could reduce gc_grace_seconds. I understand that gc_grace_seconds prevents recreation of deleted data when a node rejoins the cluster after downtime. But 10 days is a very generous period of downtime, and after a day, any recreated data would be treated as tombstones due to the TTL.

Basically we've got 1GB of actual live data on a server with 64GB of RAM and 136GB of disk, and I want to make it 2GB, but due to the way it is tuned, I am scratching my head figuring out if it is going to fit. Probably it will fit.

Pinging @Reedy as security team member, about the proposed extension of the session expiry time.

[ ... ]

As I recall, session values are overwritten (resetting the TTL), which would tend to undermine TWCS.

My assumption was that INSERT will add a whole new row to the store, superseding the previous row. So the old value will still exist with the same expiry and will be dropped during compaction. I know there is a concept of partial row updates, but I assumed they were only used for UPDATE.

INSERT and UPDATE semantics are identical; If you're writing a value for something with an identical partition, it's an overwrite.

If you look at the output of STCS the results here are actually quite good (and approximate would TWCS would do under ideal circumstances). Looking at sessionstore1001 as of now, we have one table each with timestamps of July 14, 18, and 20, and 6 for today (the 22nd), and SSTables/read is never more than 1. I'd be surprised if we could optimize this further.

I'm not worried about disk read rate, the host metrics show that the disk read rate is exactly zero. Their disk size is only double their RAM size, so it's not surprising all the data fits into memory. I'm worried about these hosts running out of disk space.

sessionstore1001 is coincidentally good at the moment, we happen to be at the bottom of a cycle. Look at disk usage over the last 60 days:

sessionstore1001-df.png (657×1 px, 94 KB)

Or look at sessionstore1003 which is around the midpoint of the range:

# ls -lh *-Data.db
-rw-r--r-- 1 cassandra cassandra 3.2G Jul 14 14:31 md-16486-big-Data.db
-rw-r--r-- 1 cassandra cassandra 2.3G Jul 20 11:15 md-16620-big-Data.db
-rw-r--r-- 1 cassandra cassandra 806M Jul 22 05:11 md-16681-big-Data.db
-rw-r--r-- 1 cassandra cassandra 692M Jul 23 17:46 md-16766-big-Data.db
-rw-r--r-- 1 cassandra cassandra 199M Jul 24 02:17 md-16787-big-Data.db
-rw-r--r-- 1 cassandra cassandra  45M Jul 24 02:50 md-16788-big-Data.db
-rw-r--r-- 1 cassandra cassandra  43M Jul 24 03:22 md-16789-big-Data.db
-rw-r--r-- 1 cassandra cassandra  43M Jul 24 03:54 md-16790-big-Data.db

Yes it still has several sstables with dates going back to gc_grace_seconds, but the total size is double that of sessionstore1001. I think that is because STCS mixes data of various ages when it compacts, so that the oldest sstables are inflated with values older than gc_grace_seconds.

It does mix them, but since it combines based on candidate size, the larger a file is the less often it is a candidate and the files it is compacted with have values of a similar age.

I read this DataStax article which has some more information about compaction strategies. My prediction is that if we switch to TWCS, that monthly sawtooth will stop and we'll have flat disk space usage on the order of gc_grace_seconds multiplied by the number of bytes inserted per second.

IME, TWCS only ever works as advertised for those pathological examples it was created for: total ordered datasets. Even if it did produce a "better" result here, it would come with the overhead of using a less common/less understood compaction algorithm; It would create an exception. Since "better" in this context is the saving of disk space that we don't need for any other purpose (and given the narrow scope of this cluster, it's not likely we ever will), that doesn't seem like a good trade-off.

The same could be said for the 10 day gc_grace_seconds. On this cluster we could safely get away with something lower, and in doing so, lower disk utilization. But this time period equates to the upper bounds on the duration of a node outage, the point of no return if you will, after which data integrity is in danger. Keeping that consistent with what the other clusters use (and what has become best practice for Cassandra generally), provides value because it's easier for everyone to reason about.

There are also a few other (easier) things we could try to hasten tombstone GC. For example: lowering min_threshold and/or tombstone_threshold, or setting unchecked_tombstone_compaction to true. These would come at the "expense" of some additional I/O (probably trivial in this case), but might create a truer picture of utilization.

[ ... ]

And separately from the issue of size (I'm sure the sessionstore cluster makes the most sense for this type data), is this really the same thing as the core sessions? Should we be using a single storage namespace for both (as opposed to creating a new Kask instance for CentralAuth)?

They share a lot of code and concepts. They both support MW login. They were both in memcached before we started moving things around for multi-DC.

Yes, but that was a problem wasn't it? Prior to moving sessions to Kask, we had a Redis cluster full of things that were by some definition similar. Having everything in one storage namespace made it hard to reason about access or utilization, and prevented you from acting on a single type of thing (for example, dropping one type of value in isolation while retaining others).

It looks like a couple of days of work to create a new Kask instance.

Perhaps @Joe can weigh in on whether he thinks this warrants a new namespace (he was involved in disentangling the Redis environment), and if so, what the lead-time is in creating one.

Since "better" in this context is the saving of disk space that we don't need for any other purpose (and given the narrow scope of this cluster, it's not likely we ever will), that doesn't seem like a good trade-off.

I'm predicting 50% disk space usage. My concern is that some future software deployment will accidentally increase session storage requirements, exceeding the 2x headroom, causing disk space usage to hit 100%.

But this time period equates to the upper bounds on the duration of a node outage, the point of no return if you will, after which data integrity is in danger.

I'm not concerned about data integrity. It's fine to drop the whole database and start from empty. I'm only concerned about the site staying up.

A few considerations:

  • I do prefer to have separate kask installations for different logical functions. These are all sessions, even if they're coming from different parts of MediaWiki. The same database seems logical, and hence also using the same kask installation if the TTLs are made to coincide
  • Data loss is not a problem, of course - we've wiped out our sessions to log out everyone multiple times over the years due to some bug we had to resolve. Data inconsistency would.
  • I don't have a pick regarding how to reclaim space in cassandra of course. I want to point out that if the disk space is running low, we can expand the cluster in case of need. 50% usage is in the range where I'd start being vaguely nervous.

Since "better" in this context is the saving of disk space that we don't need for any other purpose (and given the narrow scope of this cluster, it's not likely we ever will), that doesn't seem like a good trade-off.

I'm predicting 50% disk space usage. My concern is that some future software deployment will accidentally increase session storage requirements, exceeding the 2x headroom, causing disk space usage to hit 100%.

I honestly think we'll be OK; We included this data when we sized the cluster originally, and I'm still confident it will work. There are a lot of things we can do to move the needle on tombstone GC, and get utilization closer to true (if it becomes necessary).

Given the data on this cluster is for all intents and purposes bounded in size, I'd be comfortable running it as high as 75% utilization (we're not limited to 50%).

But this time period equates to the upper bounds on the duration of a node outage, the point of no return if you will, after which data integrity is in danger.

I'm not concerned about data integrity. It's fine to drop the whole database and start from empty. I'm only concerned about the site staying up.

My reply got snipped here, but I was rationalizing why I'd rather keep a GC grace period of 10 days (since you had earlier suggested that we lower it).

A few considerations:

  • I do prefer to have separate kask installations for different logical functions. These are all sessions, even if they're coming from different parts of MediaWiki. The same database seems logical, and hence also using the same kask installation if the TTLs are made to coincide
  • Data loss is not a problem, of course - we've wiped out our sessions to log out everyone multiple times over the years due to some bug we had to resolve. Data inconsistency would.

Ok, fair enough. I guess it's easy enough to separate them later if needed.

Change 816059 merged by jenkins-bot:

[operations/deployment-charts@master] Increase core session expiry to 86400 to match CentralAuth

https://gerrit.wikimedia.org/r/816059

I deployed the increased TTL and tested it by regenerating my enwiki session and then doing select ttl(value) from values where key='...'; on sessionstore1001.

Change 816060 merged by jenkins-bot:

[operations/mediawiki-config@master] Increase $wgObjectCacheSessionExpiry to 86400

https://gerrit.wikimedia.org/r/816060

Change 816061 merged by jenkins-bot:

[operations/mediawiki-config@master] Move CentralAuth sessions to Kask

https://gerrit.wikimedia.org/r/816061

Mentioned in SAL (#wikimedia-operations) [2022-07-27T23:45:51Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: move CentralAuth sessions to Kask T313496 (duration: 05m 34s)

Mentioned in SAL (#wikimedia-operations) [2022-07-27T23:59:08Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: sync again now that scap proxy list is fixed T313730 T313496 (duration: 03m 25s)

I confirmed that my CentralAuth session is now in Cassandra and that the TTL is ~86400. No obvious change in request rate in the Cassandra dashboards in Grafana.

I did a normal request with debug logging. It showed no request for the CentralAuth session data. So I cleared my local wiki session to force a refresh of the login from the CentralAuth session. The subsequent request showed a read and write request for the CentralAuth session.