Page MenuHomePhabricator

RFC: Deprecate using php serialization inside MediaWiki
Closed, ResolvedPublic

Description

The first version of this convention has since been published. See mw:Coding conventions/PHP § Don't use built in serialization.

Problem statement

PHP unserialize() and serialize() can execute code when given malicious input. In most cases this serialization format is unnecessary. As a hardening measure against making a mistake that could result in remote code execution, we should avoid this format, even in cases where the serialized data is stored in a trusted data-store (such as the db).

Threats this rfc is intended to counter:

  • A bug in MediaWiki allows a user to inject untrusted data into an unserialize call. Removing unserialize reduces the potential for mistakes.
  • An attacker somehow obtains write access to either the database or memcache, and wants to extend his/her access to arbitrary code execution.

Proposed guideline

This RFC proposes the following:

  • New code SHOULD use JSON instead of PHP serialization whenever possible for serializing data.
  • Serialization of primitive values and key-value structures MUST never use PHP serialization.
  • Any edge cases that require use of serialize or unserialize complicated classes, MUST protect the serialized blob with HMAC (e.g. keyed to $wgSecretKey) to protect against malicious modifications of the blob. This logic should be implemented in a class (e.g. MWSerializeWrapper) to avoid copy-pasted code all over the place
  • Using unserialize is fine if the data never leaves the current process. In particular $clone = unserialize( serialize( $obj ) )

In addition to the new guideline for new code, this RFC proposes that we start to (slowly) convert existing uses of PHP serialization including old data in the db. Most likely by using JSON. The eventual goal being to remove all legacy uses of php unserialize()

Good first candidates for conversion:

  • LocalisationCache
  • MediaHandler metadata. This is particularly risky because the API will unserialize regardless of which MediaHandler class is in use.

Things still allowed under this RFC

  • Using php serialization on data that we never ingest (unserialize) is fine. In particular the php serialization output format of the API is outside of the scope of this RFC.
  • Using unserialize is fine if the data never leaves the current process. In particular using $clone = unserialize( serialize( $obj ) ) as a hack to create a deep clone is fine.

Unanswered questions

  • How to deal with memcached. We could potentially use a custom memcache client - we already have a php implementation. Its unclear what sort of performance loss there would be compared to using the memcache php extension. We could also potentially modify the php memcache extension to do what we want. php Memcache also has a Memcached::SERIALIZER_JSON which is perhaps what we are looking for. More investigation is needed
  • Redis is similar to memcached. There is a SERIALIZER_NONE option we could perhaps use, and handle the serialization ourselves.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We (ArchCom) discussed this new RFC briefly. While we like the idea (who likes PHP serialization anyway), it would be good to flesh this out a bit more. Some questions:

  • is this just asking for a policy like "new code should/must not use php serialization"?
  • which exceptions should be allowed (e.g. what about php ser as an API result format?)
  • what to do about libraries that implicitly use PHP serialization (e.g. memcached)?
  • what bits of code need changing, and what's the migration path?
  • what php-ser artifacts do we have in the database and other storage systems? How do we convert them, or provide B/C some other way?

Perhaps a mail to wikitech-l could help gather some input on these questions.

I think job data is currently also using serialize(). We are working towards making job data compatible with eventbus, which expects JSON or Avro.

@tstarling not only wakeup. There are several issues in play here AFAIK:

  1. Object unserialization can trigger running PHP code via __wakeup and/or __destruct and/or __toString.
  2. There are bugs in most PHP versions that make un-serializing arbitrary user data unsafe (RCE-level unsafe).
  3. In general, the way unserializer is written now in PHP source it makes a lot of assumptions about data and weeding out all corner cases which cause previous one to happen is very hard, due to the fact that extensions can define their own unserializing code, and to the fact that support for references (BTW another thing json does not have) requires allowing all kinds of shenanigans. That makes eliminating all corner cases mentioned above very hard.

Summarily, I would strongly advise against ever unserializing unfiltered user data. That's just hanging a sign saying 'please pwn me' on your back. With filtered and predictable data, as long as you know what's going in it should be fine, but if the data is simple it's also unnecessary. It's only necessary when you need to store complex objects and preserve things like references.

is this just asking for a policy like "new code should/must not use php serialization"?

Basically yes. Or more pedantically, new code must not use unserialize (Its consuming serialized data that is scary, creating it doesn't really matter). First step is to stop new cases from being added. Additionally I would like existing usages of unserialize() to be seen as technical debt, and thus code changes removing unserialize() calls to be seen as a Good Thing(tm) [All other things being equal].

which exceptions should be allowed (e.g. what about php ser as an API result format?)

I don't really care about php ser as an api result format [In the context of this rfc], as the security risk is not to MediaWiki. I do think php ser as an api format is an icky thing, and would strongly advise people not to use it, but I'm fine with letting other people evaluate how much they care about the risk for this. I do think that https://www.mediawiki.org/w/api.php?action=help&modules=php should have a warning on it, but I see that as being outside the scope of what I'm asking here.

Exceptions that I imagine would be times when people actually want to serialize actual classes, and not just arrays of primitive types. As I mentioned above, I'd like such cases to have an HMAC to verify that the data hasn't been messed with. An example of this would be AuthManager when the authentication session spans multiple pages.

At this stage, exceptions will probably be the things that are already using unserialize() but are non-trivial to migrate.

what to do about libraries that implicitly use PHP serialization (e.g. memcached)?

That's a very good point that I didn't think of. Even if we force using the php memcached client and change how serilization works, I imagine that there is probably code relying on being able to serialize random objects. Have to investigate this/think more about it.

what php-ser artifacts do we have in the database and other storage systems? How do we convert them, or provide B/C some other way?

(img|oi|fa)_metadata, us_fileprops db fields is the one that especially comes to mind. Page text contents can in principal be a serialized object. rc_params, log_params in db can potentially be a serialized object sometimes. Localisation cache cdb files, job queue job params, site_data and site_config in site db table

At the beginning, we could try to detect if the field is json or not (json and php serialize look very different, so this should be easily do-able) and call the appropriate handler. Eventually once old artifacts are removed (For ephemeral cached object, perhaps after some set amount of time, we could assume they are all gone. For other things we maybe need a conversion script or something), we could remove the unserialize() fallback. Many of these things, the migration plan would have to be on a case by case basis.


For some background, part of the reason for this proposal, is use of unserialize in LocalisationCache caused T161453 to be a much bigger deal than it otherwise would be.

Catrope added a subscriber: Catrope.EditedApr 26 2017, 8:28 PM

Note that Notifications writes and reads PHP serialized data to/from the event_extra DB field. We'd have to migrate that. We have a few serializeable classes (with __wakeup() methods to restore computed fields that aren't in the serialization; notably Flow\Model\UUID) so it's not completely trivial to migrate, but it shouldn't be hard.

Krinkle updated the task description. (Show Details)Apr 26 2017, 8:31 PM

If it's only internal DB usage, not sure why should it be migrated away from serialize. That seems to be one of the cases where using serialization is fine (with concerns about class changing still being there, but any other serialization format would have them as well).

BTW, also I don't think serialize() has any problems with any data, esp. RCE-level problems. May be forgetting something, but IIRC serializing arbitrary data is fine. Unserializing is when the trouble starts.

One handy (ab)use of php serialization is deep cloning: $clone = unserialize( serialize( $obj ) ). Should that still be allowed? If not, what's the alternative?

daniel moved this task from Inbox to Last Call on the TechCom-RFC board.Apr 28 2017, 3:23 PM

This RFC has seen quite a bit of discussion on Phabricator, but has not been discussed in an IRC meeting. ArchCom feels that the discussion was already quite fruitful, and only minor issues are left to discuss. It was agreed that the RFC be put on Final Call: if no new pertinent concerns are raised and remain ope by May 10, the RFC will be approved for implementation.

daniel updated the task description. (Show Details)Apr 28 2017, 3:25 PM

I'm wondering about the following use case where an extension sets ParserOutput->setExtensionData( 'SomeIdentifier', new SomeComplexObject() ) where it gets temporary serialised in ApiStashEdit (I suspect, I haven't really looked at it) and later is retrieved unserialized for further processing.

Reason for this inquiry is that I several times hit the issue and had to ensure that our SemanticData object has to correct __sleep/__wakeup arguments to be correctly retrieved via ParserOutput::getExtensionData.

PS: Doing a JSON serialization on such "complex" objects is just not feasible.

I'm wondering about the following use case where an extension sets ParserOutput->setExtensionData( 'SomeIdentifier', new SomeComplexObject() ) where it gets temporary serialised in ApiStashEdit (I suspect, I haven't really looked at it) and later is retrieved unserialized for further processing.

I believe that under this RFC the simple answer is that ApiStashEdit or whatever ultimately does the serialize and unserialize (ParserCache maybe?) should add an HMAC. Users of ParserOutput->setExtensionData wouldn't need to care unless someone decides we need to break everything that's passing PHP objects in.

I think before it will be finalized, the guideline on when to use json/serialize should be refined a bit. I'd propose something of the following:

  1. If there is a possibility of the data supplied by or tampered with by external users, always use JSON (or another neutral format), no exceptions.
  2. If there is a need to internally store complex objects/structures, with non-trivial serialization/unserialization logic, use serialize.
  3. For key-value data not requiring objects of specific classes, always use JSON. Consider the same for simple value objects, with appropriate factory methods.
  4. In borderline cases, tend towards JSON unless there is a clear need for advanced functionality offered by serialize.

This is in addition to the HMAC guideline.

One handy (ab)use of php serialization is deep cloning

How often it is needed and not possible to cover with __clone? I'd suggest deprecating this and if the objects are under our control, use proper APIs - either __clone or if for some reason it's not enough, custom interface. If we need to clone objects not under our control (libraries?) I'd advocate marking these cases as technical debt and complaining upstream until they are fixed.

How often it is needed and not possible to cover with __clone? I'd suggest deprecating this and if the objects are under our control, use proper APIs - either __clone or if for some reason it's not enough, custom interface. If we need to clone objects not under our control (libraries?) I'd advocate marking these cases as technical debt and complaining upstream until they are fixed.

Although making __clone do a deep clone assumes that you always want a deep clone for that object, no exceptions.

assumes that you always want a deep clone for that object, no exceptions.

True. We should make a decision then what we mean by clone, and if we mean other thing in this particular case, use custom interface. We're getting a bit offtopic though I think :)

daniel moved this task from Last Call to TechCom-Approved on the TechCom-RFC board.May 3 2017, 7:54 PM
daniel edited projects, added TechCom-RFC (TechCom-Approved); removed TechCom-RFC.

@Smalyshev implementing deep cloning by hand is quite annyoing for complex objects, especially if they are extensible. We currently use serialize/unserialize to clone Wikibase Entities. Works for all subclasses, no brittle traversal code needed.

brion added a subscriber: brion.May 10 2017, 8:24 PM

One possible stop-gap for back-compatibility of old data for usages that don't require complex classes or looping object graphs would be to use a custom unserialize that can create stdObject instances only and never runs code.

This wouldn't cover all cases though, and we should list the ones that require non-trivial classes.

brion added a comment.May 10 2017, 8:31 PM

Note for migration planning of usages -- IIRC serialized database stuff is mostly arrays or stdobjects, while memcache stuff has more serialized complex classes... (Memcache is a potential attack vector in many ways, and this makes it scarier!)

For memcached we could double-encode the values. Use $memcached->setOption( Memcached::OPT_SERIALIZER, Memcached::SERIALIZER_JSON ), and add HMAC authentication to MemcachedBagOStuff. The data would be double-serialized, {"hmac": "...", "value": "O:..."}

daniel edited projects, added TechCom-RFC; removed TechCom-RFC (TechCom-Approved).
daniel updated the task description. (Show Details)

This RFC was due for a decision during the ArchCom meeting on May 10. It seems like not all concerns that were brought up during the Last Call period where addressed. The following additional points were brought up during the ArchCom meeting:

  • it's unclear whether we want to convert existing data (in the database) to JSON
  • we could use a restrictive custom unserialize implementation that.
  • do we have a clear migration plan for all uses of serialize/unserialize?
  • where should the HMAC magic go? It would be bad to spread it all over the codebase.
  • the memcached native library will (per default) unserialize php objects by itself. Even if we don't put objects into memcached, an attacker still could, and trigger unserialize this way. Other services/libraries, like Redis, may have the same problem.
  • Should ParserOutput::setExtensionData support only scalars?

It seems like there is a general consensus that it would indeed be a good idea to not use php's serialize method. The proposal should be amended to reflect the issues mentioned above, and in other comments.

One handy (ab)use of php serialization is deep cloning

I'd never heard of this hack before, but imo its ok as long as the serialized object is unserialized immediately. As long as the serialized data is never stored, it can't be manipulated by an adversary.

t's unclear whether we want to convert existing data (in the database) to JSON

I would say yes (Eventually, we don't have to do it immediately). Stopping use of serialize is pointless if we have code that fallsback to using unserialize() for back-compat.

where should the HMAC magic go? It would be bad to spread it all over the codebase.

Definitely. There should probably be a wrapper that handles this sort of thing. Instead of calling serialize, users could do something along the lines of $s = new MWSerializer( $config ); $s->serialize( $foo ); $s->unserialize( $bar ); etc.

the memcached native library will (per default) unserialize php objects by itself. Even if we don't put objects into memcached, an attacker still could, and trigger unserialize this way. Other services/libraries, like Redis, may have the same problem.

This is hard to fix. I guess we could change the php library version and use that instead if there's no performance difference (I imagine there's a reason why the native library exists, so that's probably a no-go). I suppose the only other option would be to patch the native library and use a custom version for us.

Should ParserOutput::setExtensionData support only scalars?

I don't think that's necessary.

@Bawolff can you please update the task description to reflect the current state of the discussion?

Bawolff updated the task description. (Show Details)Jun 27 2017, 8:22 AM

@Bawolff can you please update the task description to reflect the current state of the discussion?

Done.

As another semi-related note, it may make sense to add a __wakeUp() method that just throws an exception to high risk classes like ScopedCallback

daniel moved this task from Under discussion to Inbox on the TechCom-RFC board.Jun 27 2017, 10:35 AM

I have filed T169328: Protect against PHP code execution via memcached/unserialize for the memcached issue. It should not be part of this RFC. Making it policy to avoid unserialize() in PHP code is sensible regardless of the shortcomings of PHP's memcached library.

daniel moved this task from Inbox to Last Call on the TechCom-RFC board.Jul 10 2017, 5:43 PM

As per the ArchCom meeting on July 5th, this RFC is entering the Last Call period. It will be approved for implementation if now pertinent issues remain unaddressed by July 19.

As per the ArchCom meeting on July 19th, this RFC has been approved for implementation. No concerns where raised during the last call period.

@Bawolff now that this has been approved, can you turn this into a guideline on mediawiki.org? It should fit somehow with https://www.mediawiki.org/wiki/Security_for_developers I suppose, and it should also be mentioned on https://www.mediawiki.org/wiki/Security_checklist_for_developers

Tgr added a subscriber: Tgr.Dec 20 2017, 9:19 PM

PHP 7 introduces a class whitelist to unserialize. That protects against userland attacks, although not necessarily against PHP bugs. So I guess not a reason to reconsider but a good way to secure unserialize calls kept for B/C, once we bump the required PHP version.

Yes, this may reduce attack surface and eliminate the obvious and known issues. Though I can not promise there are no attacks that don't use classes that we need (or that classes that we may use are 100% secure against serialization attacks). Serialize is just too powerful and complex to be secure with arbitrary data...
OTOH, I think now that we are PHP 7 it may be worth checking into this just to have one more security layer there.

Tgr added a comment.Feb 9 2019, 2:41 AM

PHP 7 introduces a class whitelist to unserialize. That protects against userland attacks, although not necessarily against PHP bugs.

Now that we have switched to PHP7, it would be a quick win to add an empty whitelist everywhere where we don't expect classes (MediaHandler, HistoryBlob, Message, SiteConfiguration (hopefully), LogEntry/RecentChanges, probably more).

Krinkle added a subscriber: Krinkle.

Tagging for TechCom internally to talk about this week. Specifically, what are the next steps? To document at https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP?

daniel moved this task from Inbox to In progress on the TechCom board.Apr 17 2019, 8:32 PM
Krinkle closed this task as Resolved.May 9 2019, 11:45 PM
Krinkle assigned this task to daniel.
Krinkle updated the task description. (Show Details)