Page MenuHomePhabricator

Determine if per-request TTLs are needed
Closed, ResolvedPublic2 Estimated Story Points

Description

At @Fjalapeno 's request, I'm reviewing the old task list from this Etherpad. One task is for TTLs.

The RFC doesn't allow for setting TTLs per-request, and explicitly states that the service will have a single TTL configured per-server.

In the interest of simplicity, set operations as described here use a TTL defined on a service-wide basis (read: a single, configurable TTL, applied to all writes). In the event this proves inadequate, a backward-compatible change to enable client-supplied TTL overrides via a Cache-Control header is planned.

This task is to check if this "proves inadequate". We're done when we've...

  • checked MediaWiki code to see if we set different session TTLs for different sessions within a wiki for some reason
  • checked if we have session TTLs set at different values for different wikis

More on TTL semantics

One (implicit) design goal of Kask is to be better about our use of namespacing than our previous implementation under Redis (for session storage, central auth, and object stash). For example, we want sessions to be isolated in their own store, not intermixed with other, arbitrary data; Kask instances should be thought of as mapping 1:1 with some specific use-case.

The status quo at the time of this writing (both code and RFC) is that Kask supports a single TTL value. This value is configured server-side (Kask-side), and applies to every value written. This was done under the assumption that this behavior would fit that 1:1, instance/use-case mapping. The RFC states that if this assumption proves false, any supplied TTL must be less-than or equal-to the default. Put another way, what is at question is NOT whether arbitrary TTLs are possible, but whether the client can override the default with something shorter.

The rationale for limiting client TTLs like this is about bounding retention expectations. Unlike Redis, sessions in Kask are persisted and fall within Foundation policy with respect to retention of PII data. It should be possible to reason about retention and expiration based on an instances configuration, and this will no longer be the case if clients are able to write arbitrary TTLs.

Event Timeline

FYI, I think CentralAuth needs a TTL different to that of regular sessions (IIRC, this was cited as a reason for exposing the TTL). Regardless of whether or not this is true, we should probably consider using a different instance of the service for CentralAuth (it's trivial to do). If we do, we can configure it with a different default TTL as well.

FYI, I think CentralAuth needs a TTL different to that of regular sessions (IIRC, this was cited as a reason for exposing the TTL). Regardless of whether or not this is true, we should probably consider using a different instance of the service for CentralAuth (it's trivial to do). If we do, we can configure it with a different default TTL as well.

Is the thing that CentralAuth stores with a low TLL "a session"? That is, does it override an existing user session that would otherwise have a longer TTL? If that's the case, and actually intentional, we'll need per-write TTLs on the service. If CentralAuth writes something other than sessions to the store, it should probably use a separate instance. Unless we need some kind of consistency requirement between the actual sessions and the CentralAuth thingies.

We're done when we've...

checked MediaWiki code to see if we set different session TTLs for different sessions within a wiki for some reason
checked if we have session TTLs set at different values for different wikis

I think the answer to both of these is "no". (Disclaimer: I frequently discover that I know less about our operational environment than I think I do. So it is possible I'm mistaken about some or all of what follows.)

From the task description:

Put another way, what is at question is NOT whether arbitrary TTLs are possible, but whether the client can override the default with something shorter.

The ability to configure session TTLs is baked in to our current SessionManager/SessionBackend/BagOStuff implementation, and nothing I do in RESTBagOStuff will eliminate that configuration setting. So it will be possible for someone to *think* they are configuring an arbitrary TTL on the client (wiki) side. But we can, of course, ignore that setting. And in fact, we currently are ignoring it - the current RESTBagOStuff implementation never even sends it to Kask.

So if it suffices to provide a consistent TTL across all sessions on all wikis (and it appears it does) then I think that we are, technically speaking, done.

However, it feels sketchy to me that it is possible to configure a TTL that gets silently ignored. Someone might set $wgObjectCaacheSessionExpiry on a wiki that uses RESTBagOStuff and reasonably expect it to do something. But it would not.

With that said, I'm struggling to suggest an alternative, and I'm wondering if I'm overthinking this. Maybe we just document on https://www.mediawiki.org/wiki/Manual:$wgObjectCacheSessionExpiry that some object cache types may ignore this setting and call it a day?

For reference, here are the configuration values on which I based my analysis, all from the mediawiki-config repo. Two notes:

  1. the "timeout" in the memcached-pecl configuration is a communications timeout, and not related to session expiry.
  2. the "$wgObjectCacheSessionExpiry" global appears nowhere in the repository, so I assume we are using the default value of 3600 (seconds)

InitialiseSettings.php:

'wgSessionCacheType' => [
	'default' => 'redis_local',  // declared in redis.php
	'wikitech' => 'memcached-pecl',
],

redis.php:

foreach ( [ 'eqiad', 'codfw' ] as $dc ) {
	$wgObjectCaches["redis_{$dc}"] = [
		'class'       => 'RedisBagOStuff',
		'servers'     => [ "/var/run/nutcracker/redis_{$dc}.sock" ],
		'password'    => $wmgRedisPassword,
		'loggroup'    => 'redis',
		'reportDupes' => false
	];
}

$wgObjectCaches['redis_master'] = $wgObjectCaches["redis_{$wmfMasterDatacenter}"];
$wgObjectCaches['redis_local'] = $wgObjectCaches["redis_{$wmfDatacenter}"];

mc.php:

$wgObjectCaches['memcached-pecl'] = [
	'class'                => 'MemcachedPeclBagOStuff',
	'serializer'           => 'php',
	'persistent'           => false,
	'servers'              => defined( 'HHVM_VERSION' )
		? [ '/var/run/nutcracker/nutcracker.sock:0' ]
		: [ '127.0.0.1:11212' ],
	// Effectively disable the failure limit (0 is invalid)
	'server_failure_limit' => 1e9,
	// Effectively disable the retry timeout
	'retry_timeout'        => -1,
	'loggroup'             => 'memcached',
	'timeout'              => $wgMemCachedTimeout,
];

Oh, and FWIW, it is possible to configure RESTBagOStuff to send a cache-control header to Kask in the current implementation, with no code changes, as follows (tested on my local, hence 127.0.01.)

$wgObjectCaches['sessions'] = [
	'class' => 'RESTBagOStuff',
	'url' => 'http://127.0.0.1:8080/sessions/v1/',
	'httpParams' => [
		'readHeaders' => [],
		'writeHeaders' => [
			'content-type' => 'application/octet-stream',
			'cache-control' => 'max-age=' . $wgObjectCacheSessionExpiry,
		],
		'deleteHeaders' => [],
		'writeMethod' => 'POST',
	],
	'extendedErrorBodyFields' => [ 'type', 'title', 'detail', 'instance' ]
];
$wgSessionCacheType = 'sessions';

That goes a bit against the spirit of the BagOStuff class, because it ignores the $exptime parameter of the set() method and just brute-forces the header, but it does work. I'm not suggesting we do that - Kask as implemented would still ignore it so it would serve little point - but I just wanted to document it here in case it becomes relevant in further discussion.

[ ... ]

From the task description:

Put another way, what is at question is NOT whether arbitrary TTLs are possible, but whether the client can override the default with something shorter.

The ability to configure session TTLs is baked in to our current SessionManager/SessionBackend/BagOStuff implementation, and nothing I do in RESTBagOStuff will eliminate that configuration setting. So it will be possible for someone to *think* they are configuring an arbitrary TTL on the client (wiki) side. But we can, of course, ignore that setting. And in fact, we currently are ignoring it - the current RESTBagOStuff implementation never even sends it to Kask.

So if it suffices to provide a consistent TTL across all sessions on all wikis (and it appears it does) then I think that we are, technically speaking, done.

However, it feels sketchy to me that it is possible to configure a TTL that gets silently ignored. Someone might set $wgObjectCaacheSessionExpiry on a wiki that uses RESTBagOStuff and reasonably expect it to do something. But it would not.

When this came up during the RFC discussions, someone suggested they thought it was acceptable to do exactly this though, to use whatever the service was configured for, and ignore this setting. If memory serves, as part of rationalizing this it was explained that the BagOStuff interface was rife with things that only made sense in certain contexts, and were ignored entirely in others. I can't remember who it was that would have said this (so that we could follow up now), and it's possible I am mis-remembering. I agree, though it seems janky to have a configuration parameter that will be silently ignored. I see two alternatives to this:

  1. We implement client-provided TTLs (ala Cache-Control) as indicated (where the TTL is strictly less-than or equal-to to the default)
  2. We implement client-provided TTLs of arbitrary value

#1 seems like little more than window dressing. Kask will still be configured for a default TTL, and RESTBagOStuff will have to send something matching (or smaller). IOW, that MediaWiki setting will still not be the arbiter of TTL values.

#2 is a departure from what has been discussed thus far. We'd be moving where/how we establish the upper bound on retention from the service to the application (MediaWiki here, other applications where applicable moving forward). I'd certainly not rule this out, but I think we'd want to re-raise this for discussion and be prepared to weigh the cost:benefit ratio.

With that said, I'm struggling to suggest an alternative, and I'm wondering if I'm overthinking this. Maybe we just document on https://www.mediawiki.org/wiki/Manual:$wgObjectCacheSessionExpiry that some object cache types may ignore this setting and call it a day?

Is $wgObjectCacheSessionExpiry specific to session storage? Could we override the default with something obviously contrived (-1?), and add a comment explaining that the value is ignored in favor of what is configured on the service?

Is $wgObjectCacheSessionExpiry specific to session storage? Could we override the default with something obviously contrived (-1?)

Unfortunately, that would confuse logic in SessionBackend::renew(). That bit of code, per its docblock "Resets the TTL in the backend store if the session is near expiring". It bases "near expiring" on $wgObjectCacheSessionExpiry. So if the wiki's TTL setting and the Kask service's TTL setting are out of sync, this functionality won't work correctly.

Considering that code makes TTL configuration feel more important. A mismatch between core and Kask settings could more actively cause unexpected behavior than I had previously appreciated. It seems like, at a minimum, we need to document in core that the wiki TTL must be less than or equal to the Kask TTL. This could be done in the existing RESTBagOStuff comment that describes how to configure RESTBagOStuff for Kask.

We could be more aggressive, and have Kask return an error if the wiki sends a TTL greater than the Kask timeout. This would make Option #1 less like window dressing and more like a strong check for incompatible configuration settings. But maybe that is too unfriendly?

Change 512798 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Allow Kask TTL to override $wgObjectCacheSessionExpiry

https://gerrit.wikimedia.org/r/512798

Speculative code posted. While this may not be the right answer, it gives us something solid to critique.

tl;dr: I made BagOStuff and RESTBagOStuff slightly more extensible, derived a KaskBagOStuff class, and changed two lines in SessionBackend. The Kask service needs to send Cache-Control headers.

Constraints/Assumptions:

  • MediaWiki core has a globally configured TTL (via $wgObjectCacheSessionExpiry). It must retain this for compatibility with other session backends.
  • Kask has a globally configured TTL, which is not known to MediaWiki unless MediaWiki obtains it via HTTP.
  • The Kask TTL, if available, should supercede the MediaWiki TTL.
  • BagOStuff and RESTBagOStuff are generic and should not have per-bag TTLs, because that may not make sense in different usages.
  • BagOStuff and RESTBagOStuff may be used for general key/value storage, not just session storage, and therefore can’t contain anything specific to session storage.
  • We don’t mind deriving a minimal KaskBagOStuff from RESTBagOStuff
  • KaskBagOStuff, being specific to Kask, can internally have a per-bag TTL, and things related to session storage.
  • SessionBackend must be able to work with various BagOStuff types, not just KaskBagOStuff.
  • KaskBagOStuff needs to know about the Kask TTL before doing any POST requests (or else we’ll need larger changes to SessionBackend than we’d like). Fortunately, SessionBackend always does a GET before any POST, giving us an opportunity to transfer the TTL from Kask to MediaWiki core. Unfortunately, this GET will often be a 404, meaning that we have to send Cache-Control with 404s too.

Speculative implementation:

  • created a KaskBagOStuff class, derived from RESTBagOStuff.
  • added expiresSoon() and calculateExpiry() utility functions to BagOStuff, overrode them in KaskBagOStuff, and modified SessionBackend to use them.
  • added forwarding functions for calculateExpiry() and expiresSoon() to CachedBagOStuff, similar to its existing forwarding functions.
  • modified RESTBagOStuff to allow derived classes an opportunity to extract info from headers, and used that ability in KaskBagOStuff to extract the Kask TTL info from Cache-Control headers.
  • noted in DefaultSettings.php that $wgObjectCacheSessionExpiry may be overridden by some session storage backends.

New requirements on Kask:

  • pass a Cache-Control header with max-age equal to the Kask TTL with (at least all GET) responses. This includes negative responses (404s).

Notes:

  • if we’re not comfortable sending Cache-Control with 404s (especially when we are often going to immediately POST), we can discuss possibilities. KaskBagOStuff just needs somewhere to get that data.
  • the KaskBagOStuff class, if we keep it, will need tests. But as this is still speculative, I have not yet written any.
  • The new expiresSoon() and calculateExpiry() functions on BagOStuff may not be useful to all derived classes, but they are non-instrusive, don’t affect the BagOStuff state, are not specific to session storage, and might be useful for general key-value stores. In short, they are small warts, but perhaps not unacceptable ones. FWIW, BagOStuff already had a few expiry-related utility functions, so there is precedent.
WDoranWMF set the point value for this task to 2.May 28 2019, 7:22 PM

As discussed on IRC, another alternative is to not make any code changes, and instead simply require both MediaWiki core and Kask to have the same TTL setting.

As posted above in this task, it would be straightforward to have RESTBagOStuff send a Cache-Control header to Kask containing MediaWiki's TTL. Kask could then compare MediaWiki's TTL setting to its own TTL setting and log a warning if there were a discrepancy.

The post above claiming we could do with with no code changes assumes we're okay with $wgObjectCaches being dependent on $wgObjectCacheSessionExpiry. If we're not okay with that, this could still be done with a trivial code change to (only) RESTBagOStuff.

So, it sounds like any additional code here would be to deal with a misconfiguration problem, where the configured TTL for MediaWiki and Kask are grossly mismatched.

Institutional memory is hard, but I think we can handle this with human memory rather than with error-handling code. Since the configuration file formats for MediaWiki (PHP) and Kask (YAML) both allow comments, we can put comments in both files to warn against changing the one configuration value without changing the other.

To that end:

  • I've added a note in T224993 to add a comment to the MediaWiki configuration warning not to change the session timeout without changing the timeout in the Kask configuration.
  • I've opened T224995 to add that warning to the example and production YAML config files for Kask

I think that should resolve this ticket.

I concur with @EvanProdromou . I'm glad we investigated doing this in code, but my takeaway is that a code-based solution is unnecessarily awkward, and a configuration-based solution is sufficient and appropriate.

The consequence of a mismatch is relatively low: people must log in sometimes when they shouldn't have to. While this would be annoying, it also isn't as tragic as "Wikipedia crashes" or even "nobody can log in". The natural course of action when investigating a bug report of "I have to log in too often" would be to check the session timeout in configuration, and the comment there would lead to an easily-resolved answer of making the Kask and MediaWiki configurations match.

Change 512798 abandoned by BPirkle:
Allow Kask TTL to override $wgObjectCacheSessionExpiry

Reason:
After discussion, we are going to handle this in configuration instead of code.

https://gerrit.wikimedia.org/r/512798