Page MenuHomePhabricator

Document and communicate potentially breaking session storage serialization change
Closed, ResolvedPublic

Description

As part of T206016: Create a service for session storage and T161647: RFC: Deprecate using php serialization inside MediaWiki, we implemented T215533: Enable use of session storage service in MediaWiki and T222099: Staging release of RESTBagOStuff using Kask. In summary, this modified RESTBagOStuff to serialize session information to JSON and store it in the new Kask session storage service (implemented in Cassandra and sitting behind a REST API). This is part of the long-term goals of production active/active datacenters and eliminating use of PHP serialization.

However, this mangles PHP objects, which affected MediaWiki-extensions-OATHAuth . This extension is being adjusted, but this breaking change was a surprise to its developers, caused confusion, and took some time to track down.

Help avoid affecting others by properly documenting this change:

  • update WebRequest::setSessionData() documentation to indicate "proper" objects are not allowed
  • corresponding update to Session::set() documentation, and probably also a note in the class comment
  • corresponding update to functions with RESTBagOStuff, and probably also a note in the class comment
  • add something to RELEASE-NOTES
  • emails to mediawiki-l and wikitech-l
  • search for and update any relevant documentation on mediawiki.org

I'm adding as subscribers people who commented on T222099, where this was discussed. Feel free to unsubscribe yourself if you like.

Event Timeline

update WebRequest::setSessionData() documentation to indicate "proper" objects are not allowed

*that don't implement JsonSerializable (or something thereof) :)

This beggs the question, though. Why this solution rather than implementing proper PHP serialization in RestBagOStuff so $_SESSION and corresponding Session functionality works like everyone else expects it to?

In terms of T161647, this would fall under the "Any edge cases that require use of serialize or unserialize complicated classes" clause.

update WebRequest::setSessionData() documentation to indicate "proper" objects are not allowed

*that don't implement JsonSerializable (or something thereof) :)

Note that classes implementing JsonSerializable won't give behavior people probably expect either:

class MyObject implements JsonSerializable {
    private $foo, $bar, $baz;

    public function __construct( $foo, $bar, $baz ) {
        $this->foo = $foo;
        $this->bar = $bar;
        $this->baz = $baz;
    }

    public function jsonSerialize() {
        return [
            'foo' => $this->foo,
            'bar' => $this->bar,
            'baz' => $this->baz,
        ];
    }
}

$obj = new MyObject( 1, 42, 23 );

// Simulate storing in $_SESSION and reading back
$obj2 = json_decode( json_encode( $obj ), true );

var_dump( $obj2 );
/** Output:
array(3) {
  ["foo"]=>
  int(1)
  ["bar"]=>
  int(42)
  ["baz"]=>
  int(23)
}
*/

var_dump( $obj2 instanceof MyObject );
/** Output:
bool(false)
*/

The object got mangled to an associative array.

This beggs the question, though. Why this solution rather than implementing proper PHP serialization in RestBagOStuff so $_SESSION and corresponding Session functionality works like everyone else expects it to?

In terms of T161647, this would fall under the "Any edge cases that require use of serialize or unserialize complicated classes" clause.

  1. RESTBagOStuff could potentially be used for things other than sessions, with different security and behavior considerations
  2. in my experience, the more "external" data storage is to the serializing code, the more likely a need will eventually arise for the data to be conveniently readable by something other than its initial writer. And JSON is generally more convenient for things not named PHP. While this seems unlikely for sessions, we don't know what might eventually be stored in Kask via RESTBagOStuff.
  3. it sounds like little code would be affected, and the affected code would not be difficult to refactor, so I'm not convinced this rises to the "require" wording for T161647: RFC: Deprecate using php serialization inside MediaWiki edge cases.

Why this solution rather than implementing proper PHP serialization in RestBagOStuff so $_SESSION and corresponding Session functionality works like everyone else expects it to?

Because PHP serialization is extremely brittle and rather insecure. We should not use it for anything that leaves the process, ever.

Using PHP serialization means that, if anything(*) changes about any of the objects that got serialized as part of the session, we lose all session data.


(*) this includes innocent changes like re-ordering fields, or changing access modifiers on fields. Not to mention adding fields.

  1. RESTBagOStuff could potentially be used for things other than sessions, with different security and behavior considerations
  2. in my experience, the more "external" data storage is to the serializing code, the more likely a need will eventually arise for the data to be conveniently readable by something other than its initial writer. And JSON is generally more convenient for things not named PHP. While this seems unlikely for sessions, we don't know what might eventually be stored in Kask via RESTBagOStuff.

Neither of these are arguments for not supporting this use case.

  1. it sounds like little code would be affected, and the affected code would not be difficult to refactor, so I'm not convinced this rises to the "require" wording for T161647: RFC: Deprecate using php serialization inside MediaWiki edge cases.

PHP sessions normally support storing PHP-serializable objects. Breaking that assumption, but only when RESTBagOStuff happens to be being used, seems liable to become a recurring issue which we could easily avoid.

Using PHP serialization means that, if anything(*) changes about any of the objects that got serialized as part of the session, we lose all session data.

The fact that rare cases on upgrade might lose all session data is not particularly convincing as a reason to abandon the standard behavior entirely. We already lose sessions much more often than that due to timeouts or eviction.

Somewhat more interesting is that the rare cases might not actually lose the session data, but instead result in objects with invalid internal state. But we'd have similar issues with other caches storing those objects, and the issues wouldn't be specific to use of RESTBagOStuff.

(*) this includes innocent changes like re-ordering fields, or changing access modifiers on fields. Not to mention adding fields.

Re-ordering fields is fine, as they're referenced by name not position.

Adding fields is fine with the caveat that they'll have their default-initialized values, not the values assigned by __construct(). __wakeup() can be used to assign different values if necessary.

Removing fields is fine, but the object will have undeclared fields assigned for the removed fields (with unexpected names containing ASCII NUL if the removed fields were private or protected).

Renaming fields is handled as a removal of the old name and an addition of the new, so very likely the new field won't have the expected value. You're right that changing access modifiers of fields will have issues, as it's basically a field rename.

PHP sessions normally support storing PHP-serializable objects. Breaking that assumption, but only when RESTBagOStuff happens to be being used, seems liable to become a recurring issue which we could easily avoid.

My impression, based not only on T161647 but also on the Coding Conventions, was that we intended to (eventually) stop using PHP serialization for (at least most of) the other BagOStuff classes as well. In other words, yes the current change would make RESTBagOStuff an outlier, but only in that it had already moved in the intended direction and the other implementations were technical debt. Did I misunderstand?

I think we can require not having SessionManager support PHP serialization. For any legacy cases, if they exist, the caller could be made responsible for that instead, thus storing only a primitive string from the REST perspective.

The tricky bit will be how we communicate and detect this. So probably we'd need at least 1 major release cycle where SessionManager (and maybe RESTBagOStuff) somehow detects whether the value is JSON-friendly, and if not, emit a warning and use php serialization still. It would also need to detect this on the way out, possibly with a magic prefix that would be invalid JSON (any non-quoted character will be invalid JSON so that's easy).

While allowing writes during the migration should be safe (in that we'd "simply" not use that feature), allowing it to happen on read is something we'd probably want a feature flag for so that we can explicitly turn that off and make it hard fail and/or fallback to perceiving the value as absent. If we have a feature flag like that, I suppose we might as well use it both during read and during write.

For any legacy cases, if they exist

Our previous assessment, some of which is documented in T215533 but some of which occurred on IRC, was that legacy cases do not exist. (Of course, we can never guarantee what third-party wikis may be doing privately.)

My initial implementation added a KaskBagOStuff class, in order to not break RESTBagOStuff compatibility. We determined this was not necessary, and I proceeded to modify RESTBagOStuff directly. That change, which was merged four months ago, broke RESTBagOStuff compatibility in more ways than just serialization method (ex. PUT vs POST). If that was a mistake, there are more changes to make. FWIW, I have heard of no related breakage other than T233146, which started this conversation and which is being addressed without (additional) changes to RESTBagOStuff.

Regarding migration of existing production session data, that's being handled via MultiWriteBagOStuff and should not require RESTBagOStuff changes.

So ... I just checked an email from @CCicalese_WMF from about a week ago, and the 1.34 release branch is scheduled to be created on Sept. 29, three days from now. If I need to make changes to RESTBagOStuff before the release, time is growing short.

I think the open questions are:

  1. should RESTBagOStuff support PHP serialize/unserialize (either in addition to or instead of JSON)?
  2. should we use PHP or JSON serialization for sessions stored in Kask?
  3. does SessionManager and/or RESTBagOStuff need auto-detect backward compatibility code for serialization type?

Looking back, I was incorrect in my previous comment about PUT vs. POST. What I actually did in the merged code was allow specifying a write method as a constructor parameter, but I defaulted to PUT, which is what RESTBagOStuff had previously been using. So no breaking change there.

My suggestion: under a separate task, add an additional constructor parameter for serialization type, with choices "PHP" and "JSON", with a default of "PHP". This has the following benefits:

  1. maintains backward compatibility for any existing callers (even though we are not aware of any)
  2. still allows us to use JSON for sessions stored in Kask if we choose to (we'd just specify "JSON" in the config)
  3. permanently avoids the need for autodetect backward compatibility code. If we use JSON for sessions stored in Kask, we can migrate production via MultiWriteBagOStuff. Any other callers will find the new serialization type option helpful should they also choose to migrate
  4. allows us to deprecate then remove PHP serialization from RESTBagOStuff in a friendly and deprecation-policy-compliant way, should we choose to do so
  5. lets us get 1.34 out the door cleanly.

This is a simple change to code, so if no one objects very soon, I'm going to move forward with that plan and see if anyone likes the idea enough to merge it. :)

So ... I just checked an email from @CCicalese_WMF from about a week ago, and the 1.34 release branch is scheduled to be created on Sept. 29, three days from now. If I need to make changes to RESTBagOStuff before the release, time is growing short.

I think the open questions are:

  1. should RESTBagOStuff support PHP serialize/unserialize (either in addition to or instead of JSON)?
  2. should we use PHP or JSON serialization for sessions stored in Kask?
  3. does SessionManager and/or RESTBagOStuff need auto-detect backward compatibility code for serialization type?

Looking back, I was incorrect in my previous comment about PUT vs. POST. What I actually did in the merged code was allow specifying a write method as a constructor parameter, but I defaulted to PUT, which is what RESTBagOStuff had previously been using. So no breaking change there.

My suggestion: under a separate task, add an additional constructor parameter for serialization type, with choices "PHP" and "JSON", with a default of "PHP". This has the following benefits:

  1. maintains backward compatibility for any existing callers (even though we are not aware of any)
  2. still allows us to use JSON for sessions stored in Kask if we choose to (we'd just specify "JSON" in the config)
  3. permanently avoids the need for autodetect backward compatibility code. If we use JSON for sessions stored in Kask, we can migrate production via MultiWriteBagOStuff. Any other callers will find the new serialization type option helpful should they also choose to migrate
  4. allows us to deprecate then remove PHP serialization from RESTBagOStuff in a friendly and deprecation-policy-compliant way, should we choose to do so
  5. lets us get 1.34 out the door cleanly.

This is a simple change to code, so if no one objects very soon, I'm going to move forward with that plan and see if anyone likes the idea enough to merge it. :)

My 2¢: If we are serious about following through on T161647 and the coding conventions, then this seems like a Good Plan. However, we should at least create the issues to follow through on moving the other usages to JSON.

My suggestion: under a separate task, add an additional constructor parameter for serialization type, with choices "PHP" and "JSON", with a default of "PHP". This has the following benefits:

  1. maintains backward compatibility for any existing callers (even though we are not aware of any)
  2. still allows us to use JSON for sessions stored in Kask if we choose to (we'd just specify "JSON" in the config)
  3. permanently avoids the need for autodetect backward compatibility code. If we use JSON for sessions stored in Kask, we can migrate production via MultiWriteBagOStuff. Any other callers will find the new serialization type option helpful should they also choose to migrate
  4. allows us to deprecate then remove PHP serialization from RESTBagOStuff in a friendly and deprecation-policy-compliant way, should we choose to do so
  5. lets us get 1.34 out the door cleanly.

This is a simple change to code, so if no one objects very soon, I'm going to move forward with that plan and see if anyone likes the idea enough to merge it. :)

This sounds like a good plan to me.

Note the HMACing code called for there is simple enough. On save, based on existing code in Session::setSecret(), it's like

$serialized = serialize( $data );
$hmac = hash_hmac( 'sha256', $serialized, $hmacKey, true );
$toStore = 'phpserialized.' . base64_encode( $hmac ) . '.' . $serialized;

and on load, based on Session::getSecret(), perhaps

$pieces = explode( '.', $stored, 3 );
if ( count( $pieces ) !== 3 || $pieces[0] !== 'phpserialized' ) {
    throw ...;
}
list( , $hmac, $serialized ) = $pieces;
$checkHmac = hash_hmac( 'sha256', $serialized, $hmacKey, true );
if ( !hash_equals( $checkHmac, base64_decode( $hmac ) ) ) {
    throw ...;
}
$data = unserialize( $serialized );

// check for unserialize failure, if you want
if ( $data === false && $serialized !== serialize( false ) ) {
    throw ...;
}

@BPirkle why don't we just revert RESTBagOStuff to its pre-Kask version, and make the new version KaskBagOStuff? We'd need to change the config for WMF production but otherwise harmless.

@BPirkle why don't we just revert RESTBagOStuff to its pre-Kask version, and make the new version KaskBagOStuff? We'd need to change the config for WMF production but otherwise harmless.

"just" implies that would be easier. At this point, it'd be more work, not less.

We originally talked about using a KaskBagOStuff. We went this way (at least in my mind) because certain things about RESTBagOStuff were inflexible and poorly suited for many use cases. So it made more sense to improve it, which made it more suitable not only for Kask, but also for any other code that might want to use it. Otherwise, any other usages would likely need to derive their own version, and we'd end up with KaskBagOStuff, FooBagOStuff, BarBagOStuff, etc. So instead, we improved it for everyone.

Here are the changes we made to RESTBagOStuff, which are really pretty minimal code-wise:

  1. PUT vs POST. The pre-Kask version of RESTBagOStuff was hard-coded to use PUT, but Kask uses POST
  2. Headers. The pre-Kask RESTBagOStuff did not allow customizing them, but we need to for Kask
  3. Extended error handling. The pre-Kask RESTBagOStuff did not have a way to "ripple up" the error messages supplied by Kask, so helpful information for diagnosing problems would be lost.
  4. T161647: RFC: Deprecate using php serialization inside MediaWiki. Whether we end up using PHP or JSON serialization for Kask, as an organization we have decided that we prefer to not use PHP serialization for any data that leaves the process, unless we really, really need to. Having a class whose purpose in life is to send data outside not only the process boundary, but likely also to a different server who-knows-where on the internet, but which only supports PHP serialization, seems like a Bad Idea.

All of those changes, with the possible exception of extended error handling, are generally useful to any code that might use RESTBagOStuff.

If the change for T233963: Add serialization options to RESTBagOStuff (which is also a fairly small code change) gets merged, there will be no backward compatibility issues for any existing users of RESTBagOStuff. So our "breaking change" is reduced to only the PHP vs JSON effect on session storage, and only if we choose to continue using JSON for session storage. That would be true even with a KaskBagOStuff, so changing our class hierarchy wouldn't save us anything.

At this point, unless someone strongly objects to T233963: Add serialization options to RESTBagOStuff it is less work to stay on our current path, and it leaves us, in my opinion, in a better place.

@BPirkle why don't we just revert RESTBagOStuff to its pre-Kask version, and make the new version KaskBagOStuff? We'd need to change the config for WMF production but otherwise harmless.

"just" implies that would be easier. At this point, it'd be more work, not less.

We originally talked about using a KaskBagOStuff. We went this way (at least in my mind) because certain things about RESTBagOStuff were inflexible and poorly suited for many use cases. So it made more sense to improve it, which made it more suitable not only for Kask, but also for any other code that might want to use it. Otherwise, any other usages would likely need to derive their own version, and we'd end up with KaskBagOStuff, FooBagOStuff, BarBagOStuff, etc. So instead, we improved it for everyone.

Here are the changes we made to RESTBagOStuff, which are really pretty minimal code-wise:

  1. PUT vs POST. The pre-Kask version of RESTBagOStuff was hard-coded to use PUT, but Kask uses POST
  2. Headers. The pre-Kask RESTBagOStuff did not allow customizing them, but we need to for Kask
  3. Extended error handling. The pre-Kask RESTBagOStuff did not have a way to "ripple up" the error messages supplied by Kask, so helpful information for diagnosing problems would be lost.
  4. T161647: RFC: Deprecate using php serialization inside MediaWiki. Whether we end up using PHP or JSON serialization for Kask, as an organization we have decided that we prefer to not use PHP serialization for any data that leaves the process, unless we really, really need to. Having a class whose purpose in life is to send data outside not only the process boundary, but likely also to a different server who-knows-where on the internet, but which only supports PHP serialization, seems like a Bad Idea.

And the first one is really the only one that isn't backward compatible, isn't it?

All of those changes, with the possible exception of extended error handling, are generally useful to any code that might use RESTBagOStuff.

If the change for T233963: Add serialization options to RESTBagOStuff (which is also a fairly small code change) gets merged, there will be no backward compatibility issues for any existing users of RESTBagOStuff. So our "breaking change" is reduced to only the PHP vs JSON effect on session storage, and only if we choose to continue using JSON for session storage. That would be true even with a KaskBagOStuff, so changing our class hierarchy wouldn't save us anything.

At this point, unless someone strongly objects to T233963: Add serialization options to RESTBagOStuff it is less work to stay on our current path, and it leaves us, in my opinion, in a better place.

AFAIK, RESTBagOStuff was written for more less this purpose, and went unused (by us, at least) in the meantime. I think fixing it is the Right Thing™.

At this point, unless someone strongly objects to T233963: Add serialization options to RESTBagOStuff it is less work to stay on our current path, and it leaves us, in my opinion, in a better place.

Asked and answered! Thanks for taking on the complexities here; makes sense to me.

So, it sounds like we need to make a decision about whether we use JSON or PHP serialization in the RESTBagOStuff -> Kask interface.

My understanding is that if a session variable is a serializable object, but isn't JSONSerializable, we fail to serialize the data?

Looking at our extensions, it seems like there are a lot of calls to the session storage, and some of the calls seem to use complex data structures, although I haven't found one that uses objects specifically.

I think @daniel has a good point, that we shouldn't use PHP serialization across process boundaries, and I'd prefer to comply with existing RFCs and coding conventions.

Is it possible for us to wrap serializable-but-not-JSONSerializable objects somehow, so that they're encoded in the JSON? Maybe something like:

{
  "property": {
       "__tag": "__RESTBAGOSTUFF_PHP_SERIALIZATION_COMPATIBILITY__",
       "__data": "PHP serialization form of the object"
   }
}

We could walk the tree of session variables before encoding, and then walk the tree after decoding, to convert non-JSON-compatible objects from their wrappers.

This would still mean that we're sending PHP-serialized data over the wire, but it's a good stopgap until extensions change their code.

I think our next best option is to find where objects are being stored in the production extensions, and change them to be JSON-serializable (or do something else).

Using PHP serialization is probably my last choice. If we really can't check and fix our production extensions, it's probably what we need to go with, but I think it's going to be a something we have to deal with in the future.

My understanding is that if a session variable is a serializable object, but isn't JSONSerializable, we fail to serialize the data?

If the session variable is a PHP object (or is an array that contains PHP objects at some level of nesting), PHP serialization will often round-trip it while JSON never will (whether it's JSONSerializable or not).

I think @daniel has a good point, that we shouldn't use PHP serialization across process boundaries, and I'd prefer to use JSON in this case.

Note @daniel overstated the fragility in T233537#5518881, as detailed in T233537#5520231.

As for security, that was addressed by rMW2ed69af15cb8: Add optional serialization_type and hmac_key param values to RESTBagOStuff.

Is it possible for us to wrap serializable-but-not-JSONSerializable objects somehow, so that they're encoded in the JSON?

It would have to be be every object, not just non-JSONSerializable ones. JSONSerializable doesn't give you back the original object from json_decode(), it only gives you whatever non-object data structure was returned from jsonSerialize().

We might special case stdClass; we'd still have to mark it somehow to avoid it being returned as an associative array, but the "__data" part could be JSON-encoded (with the custom object handling) as an associative array instead of being a PHP-serialized string.

BTW, it's not just objects that can be problematic. Consider this:

$a = 'nope';
$data = [ &$a, &$a ];

// simulate storing to the session and reading back
$out = unserialize( serialize( $data ) );
$out[0] = 'ok';
echo $out[1]; // produces 'ok'

// simulate storing to the session and reading back
$out = json_decode( json_encode( $data ), true );
$out[0] = 'ok';
echo $out[1]; // produces 'nope'

Unfortunately references are much harder to detect than objects. Perhaps fortunately, I don't think we have much in MediaWiki that would care. But see PHP bug 36424 where upstream fixed an edge case that produced the "nope" result.

PHP serialization will often round-trip it while JSON never will (whether it's JSONSerializable or not).

My understanding is that JSONSerializable expects classes that implement it to also provide a static pseudo-constructor for construction from the JSONic data structure. That would allow for a full round trip.

PHP serialization will often round-trip it while JSON never will (whether it's JSONSerializable or not).

My understanding is that JSONSerializable expects classes that implement it to also provide a static pseudo-constructor for construction from the JSONic data structure. That would allow for a full round trip.

I see no mention of such an understanding at https://www.php.net/JsonSerializable or https://www.php.net/manual/en/jsonserializable.jsonserialize.php. And json_decode() still wouldn't round-trip it.

I see no mention of such an understanding at https://www.php.net/JsonSerializable or https://www.php.net/manual/en/jsonserializable.jsonserialize.php. And json_decode() still wouldn't round-trip it.

You are right, we would have to make thus up ourselves. Something like

interface JsonDeserializable extends JsonSerializable {
  static function jsonDeserialize( $data, JsonDeserializer $deserializer );
}

interface JsonDeserializer {
  function jsonDeserialize( $data );
}

The separate JsonDeserializer is needed for transparently desalinizing sub-structures.

@BPirkle @daniel @Anomie and I discussed the approach to transitioning from PHP to JSON serialisations. I'll share the notes doc with you. There appears to be a good path for it but since REST API is the highest priority we'll close of this work by moving to use PHP serialisation for now but schedule the work to convert to JSON for next quarter. I've added it to initiative planning and I will create a new task to track that work.

Because we're currently using JSON we'll have to make sure we consider the impact of switching, though the change is currently only available on Test Wiki, is that correct?

Correct, just testwiki, so the impact of a hard cutover would be minimal. I suggest we just do it without transition.

Status note: related tasks T233963: Add serialization options to RESTBagOStuff and T234779: Removed deprecated "legacy" serialization type from RESTBagOStuff are complete. RESTBagOStuff now supports (only) PHP and JSON serialization types, with (optional) hmac protection. It is in use on production wikis of all groups to store sessions in Kask, using PHP serialization and hmac protection.

I think the only open question for this task is whether we convert to JSON serialization in the spirit of T161647: RFC: Deprecate using php serialization inside MediaWiki, or consider this an exception and stay with PHP.

The other tasks mentioned herein are completed. This appears to be done.