Page MenuHomePhabricator

RFC: Deprecate using php serialization inside MediaWiki
Closed, ResolvedPublic

Description

The first version of this convention has since been published. See mw:Coding conventions/PHP § Don't use built in serialization.

Problem statement

PHP unserialize() and serialize() can execute code when given malicious input. In most cases this serialization format is unnecessary. As a hardening measure against making a mistake that could result in remote code execution, we should avoid this format, even in cases where the serialized data is stored in a trusted data-store (such as the db).

Threats this rfc is intended to counter:

  • A bug in MediaWiki allows a user to inject untrusted data into an unserialize call. Removing unserialize reduces the potential for mistakes.
  • An attacker somehow obtains write access to either the database or memcache, and wants to extend his/her access to arbitrary code execution.

Proposed guideline

This RFC proposes the following:

  • New code SHOULD use JSON instead of PHP serialization whenever possible for serializing data.
  • Serialization of primitive values and key-value structures MUST never use PHP serialization.
  • Any edge cases that require use of serialize or unserialize complicated classes, MUST protect the serialized blob with HMAC (e.g. keyed to $wgSecretKey) to protect against malicious modifications of the blob. This logic should be implemented in a class (e.g. MWSerializeWrapper) to avoid copy-pasted code all over the place
  • Using unserialize is fine if the data never leaves the current process. In particular $clone = unserialize( serialize( $obj ) )

In addition to the new guideline for new code, this RFC proposes that we start to (slowly) convert existing uses of PHP serialization including old data in the db. Most likely by using JSON. The eventual goal being to remove all legacy uses of php unserialize()

Good first candidates for conversion:

  • LocalisationCache
  • MediaHandler metadata. This is particularly risky because the API will unserialize regardless of which MediaHandler class is in use.

Things still allowed under this RFC

  • Using php serialization on data that we never ingest (unserialize) is fine. In particular the php serialization output format of the API is outside of the scope of this RFC.
  • Using unserialize is fine if the data never leaves the current process. In particular using $clone = unserialize( serialize( $obj ) ) as a hack to create a deep clone is fine.

Unanswered questions

  • How to deal with memcached. We could potentially use a custom memcache client - we already have a php implementation. Its unclear what sort of performance loss there would be compared to using the memcache php extension. We could also potentially modify the php memcache extension to do what we want. php Memcache also has a Memcached::SERIALIZER_JSON which is perhaps what we are looking for. More investigation is needed
  • Redis is similar to memcached. There is a SERIALIZER_NONE option we could perhaps use, and handle the serialization ourselves.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

How often it is needed and not possible to cover with __clone? I'd suggest deprecating this and if the objects are under our control, use proper APIs - either __clone or if for some reason it's not enough, custom interface. If we need to clone objects not under our control (libraries?) I'd advocate marking these cases as technical debt and complaining upstream until they are fixed.

Although making __clone do a deep clone assumes that you always want a deep clone for that object, no exceptions.

assumes that you always want a deep clone for that object, no exceptions.

True. We should make a decision then what we mean by clone, and if we mean other thing in this particular case, use custom interface. We're getting a bit offtopic though I think :)

@Smalyshev implementing deep cloning by hand is quite annyoing for complex objects, especially if they are extensible. We currently use serialize/unserialize to clone Wikibase Entities. Works for all subclasses, no brittle traversal code needed.

One possible stop-gap for back-compatibility of old data for usages that don't require complex classes or looping object graphs would be to use a custom unserialize that can create stdObject instances only and never runs code.

This wouldn't cover all cases though, and we should list the ones that require non-trivial classes.

Note for migration planning of usages -- IIRC serialized database stuff is mostly arrays or stdobjects, while memcache stuff has more serialized complex classes... (Memcache is a potential attack vector in many ways, and this makes it scarier!)

For memcached we could double-encode the values. Use $memcached->setOption( Memcached::OPT_SERIALIZER, Memcached::SERIALIZER_JSON ), and add HMAC authentication to MemcachedBagOStuff. The data would be double-serialized, {"hmac": "...", "value": "O:..."}

This RFC was due for a decision during the ArchCom meeting on May 10. It seems like not all concerns that were brought up during the Last Call period where addressed. The following additional points were brought up during the ArchCom meeting:

  • it's unclear whether we want to convert existing data (in the database) to JSON
  • we could use a restrictive custom unserialize implementation that.
  • do we have a clear migration plan for all uses of serialize/unserialize?
  • where should the HMAC magic go? It would be bad to spread it all over the codebase.
  • the memcached native library will (per default) unserialize php objects by itself. Even if we don't put objects into memcached, an attacker still could, and trigger unserialize this way. Other services/libraries, like Redis, may have the same problem.
  • Should ParserOutput::setExtensionData support only scalars?

It seems like there is a general consensus that it would indeed be a good idea to not use php's serialize method. The proposal should be amended to reflect the issues mentioned above, and in other comments.

One handy (ab)use of php serialization is deep cloning

I'd never heard of this hack before, but imo its ok as long as the serialized object is unserialized immediately. As long as the serialized data is never stored, it can't be manipulated by an adversary.

t's unclear whether we want to convert existing data (in the database) to JSON

I would say yes (Eventually, we don't have to do it immediately). Stopping use of serialize is pointless if we have code that fallsback to using unserialize() for back-compat.

where should the HMAC magic go? It would be bad to spread it all over the codebase.

Definitely. There should probably be a wrapper that handles this sort of thing. Instead of calling serialize, users could do something along the lines of $s = new MWSerializer( $config ); $s->serialize( $foo ); $s->unserialize( $bar ); etc.

the memcached native library will (per default) unserialize php objects by itself. Even if we don't put objects into memcached, an attacker still could, and trigger unserialize this way. Other services/libraries, like Redis, may have the same problem.

This is hard to fix. I guess we could change the php library version and use that instead if there's no performance difference (I imagine there's a reason why the native library exists, so that's probably a no-go). I suppose the only other option would be to patch the native library and use a custom version for us.

Should ParserOutput::setExtensionData support only scalars?

I don't think that's necessary.

@Bawolff can you please update the task description to reflect the current state of the discussion?

@Bawolff can you please update the task description to reflect the current state of the discussion?

Done.

As another semi-related note, it may make sense to add a __wakeUp() method that just throws an exception to high risk classes like ScopedCallback

I have filed T169328: Protect against PHP code execution via memcached/unserialize for the memcached issue. It should not be part of this RFC. Making it policy to avoid unserialize() in PHP code is sensible regardless of the shortcomings of PHP's memcached library.

As per the ArchCom meeting on July 5th, this RFC is entering the Last Call period. It will be approved for implementation if now pertinent issues remain unaddressed by July 19.

As per the ArchCom meeting on July 19th, this RFC has been approved for implementation. No concerns where raised during the last call period.

@Bawolff now that this has been approved, can you turn this into a guideline on mediawiki.org? It should fit somehow with https://www.mediawiki.org/wiki/Security_for_developers I suppose, and it should also be mentioned on https://www.mediawiki.org/wiki/Security_checklist_for_developers

PHP 7 introduces a class whitelist to unserialize. That protects against userland attacks, although not necessarily against PHP bugs. So I guess not a reason to reconsider but a good way to secure unserialize calls kept for B/C, once we bump the required PHP version.

Yes, this may reduce attack surface and eliminate the obvious and known issues. Though I can not promise there are no attacks that don't use classes that we need (or that classes that we may use are 100% secure against serialization attacks). Serialize is just too powerful and complex to be secure with arbitrary data...
OTOH, I think now that we are PHP 7 it may be worth checking into this just to have one more security layer there.

PHP 7 introduces a class whitelist to unserialize. That protects against userland attacks, although not necessarily against PHP bugs.

Now that we have switched to PHP7, it would be a quick win to add an empty whitelist everywhere where we don't expect classes (MediaHandler, HistoryBlob, Message, SiteConfiguration (hopefully), LogEntry/RecentChanges, probably more).

Krinkle subscribed.

Tagging for TechCom internally to talk about this week. Specifically, what are the next steps? To document at https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP?

Krinkle assigned this task to daniel.
Krinkle updated the task description. (Show Details)

Change 662714 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] DNM: WANObjectCache: warn on non-JSONic values.

https://gerrit.wikimedia.org/r/662714