session service
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• GWicke
	Jul 19 2016, 7:54 PM

Description

T122375 calls out the need to segment and minimize access to sensitive data within the WMF cluster. One important subset of this is user-specific information:

Password hashes: T120484: Create password-authentication service for use by CentralAuth
Sessions: T134811: Consider REST with SSL (HyperSwitch/Cassandra) for session storage, T137272: Create BagOStuff subclass for HTTP
Checkuser information, especially IP addresses

Such information is valuable for a variety of attackers, and incidents like this recent one at Yahoo serve as a reminder for why protecting such data is important. Since all of these items need similar protection, it seems to make sense to handle them in a single, firewalled "UserInfo" service.

Requirements

Protect sensitive data
- Firewall protection for both the service & its backend storage.
- Even with remote execution on the client, there is no way to perform random queries (ex: list sessions) on the underlying storage system.
  - Narrow API exposing only what is absolutely needed.
- Clients connecting & authenticating via TLS.
Reliability: No SPOFs, operationally as simple as possible.
Multi DC support: Solid multi-master & fail-over support. Availability for reads and basic session updates more important than transactional cross-DC consistency. However, reliable logout functionality (session deletions) should only return success when the data has made it to the remote DC. Should automatically recover from short network partitions.
Horizontally scalable, especially for session storage:
- 9k reads/s
- ~100 writes/s
- total session size: ~1.4G uncompressed, max 9G
Latency: Read latency comparable to the current redis backend. We do not have metrics for this at the moment, and need to set those up ahead of any migration.
Ideally, support for TTLs & garbage collection in storage backend. If not supported natively, this can be implemented manually in background task.

Options and trade-offs

Service

Our default choice for a new service with fairly moderate requirements like this is to leverage our node.js ecosystem. Performance is unlikely to be an issue. Crypto / TLS support is available via node's native OpenSSL integration.

Storage

The two default candidates for backend storage are MySQL and Cassandra. The query needs are primarily simple key/value storage, ideally with TTL support / garbage collection.

Replication lag should be as short as possible.

MySQL

+ 13+ years of experience, solid performer.
- No automatic horizontal sharding.
+- Multi-DC / multi-master design trade-offs less suited for use case.
- Galera offers synchronous master/master replication, which can't provide high availability and low latency when used across less reliable WAN links between DCs. Temporary partitions lead to outages on one side of the partition.
- Master-slave replication complex to manage across DCs, and does not support session timestamp updates in both DCs.
- Semi-sync replication combined with multi-source replication can be used to build an eventually consistent storage system. Availability would be provided by HAProxy. Replication levels can only be configured globally, which means that we would need to decide whether we want to provide reliable logouts, or fail plain session updates during partitions.
(-) No built-in garbage collection (but relatively easy to automate).
+- Replication lag on the order of seconds (master-slave).

Cassandra

+- ~2 years of experience.
+ Automatic horizontal sharding.
+ Mature multi-master support, last write wins reconciliation after partitions. Per-request consistency options.
(+) Built-in TTL / garbage collection support.
(-) Does not scale too well to > 1T per instance, but data set much smaller (less than 10G).
+ Very low replication lag (parallel writes).

Overall, we are leaning towards using Cassandra. The primary reason for this is better support for multi-DC operation and sharding, combined with very moderate data sizes.

Hardware needs

We prototyped a simple session storage backend, which supports about 3k reads/s on a dual-core laptop, using a Cassandra backend. Based on this performance data, we roughly estimate that we should be able to comfortably handle the expected read load with three nodes per DC.

Timeline and division of labor

Session storage

Session storage is technically simpler than authentication. The timeline depends primarily on making decisions on the general approach, determining hardware needs & procuring / installing the needed hardware. If we decide to continue with the prototype service / MediaWik integration & Cassandra storage by early August, then it might be possible to be ready for a gradual roll-out by the end of Q1. That said, we consciously did not set a hard deadline, and do not intend to rush it.

Authentication

In cooperation with Security, we are planning to prototype the auth service in Q1. Division of labor is planned as follows:

Services will own the service, storage and API.
Security will implement a crypto library for handling MediaWiki password hash schemes to be used by the service.

In Q2, Security and Services intend to work with #reading-infrastructure-team on integrating the auth service as a CentralAuth backend, and gradually rolling it out to production. At this point, we'll need production hardware.

Related Objects
Search...

Status	Assigned	Task
Open	None	T122375 Segment sensitive data within WMF cluster (tracking)
Duplicate	None	T140813 Protect sensitive user-related information with a UserData / auth / session service
Open	None	T120484 Create password-authentication service for use by CentralAuth
Declined	None	T88811 Develop use-cases & user stories for authnz service
Resolved	Smalyshev	T137272 Create BagOStuff subclass for HTTP

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 19 2016, 7:54 PM

• GWicke added a project: Reading-Infrastructure-Team-Old (Don't use).Jul 19 2016, 7:54 PM

• GWicke added a parent task: T122375: Segment sensitive data within WMF cluster (tracking).

• GWicke added subtasks: T120484: Create password-authentication service for use by CentralAuth, T137272: Create BagOStuff subclass for HTTP.

• GWicke renamed this task from Create UserData service to protect sensitive user-related information to Protect sensitive user-related information with a UserData / auth / session service.Jul 19 2016, 8:07 PM

• GWicke updated the task description. (Show Details)Jul 19 2016, 8:19 PM

• GWicke updated the task description. (Show Details)Jul 19 2016, 8:47 PM

• GWicke updated the task description. (Show Details)Jul 20 2016, 12:28 AM

Joe subscribed.Jul 20 2016, 4:17 PM

• GWicke updated the task description. (Show Details)Jul 20 2016, 6:44 PM

elukey triaged this task as High priority.Jul 25 2016, 11:46 AM

elukey subscribed.

• GWicke updated the task description. (Show Details)Jul 27 2016, 3:35 PM

Discussion notes from today's planning meeting are available at https://docs.google.com/document/d/106mEekgDJiWXXHEQEWpXrD2na4HhW-vRSva9aLQMncg/edit#. There will be a follow up meeting next Wednesday, 8:30 SF time.

• GWicke updated the task description. (Show Details)Jul 27 2016, 6:42 PM

• GWicke updated the task description. (Show Details)Jul 27 2016, 6:53 PM

re: requirements I think it was mentioned at the previous meeting too but to reiterate: we should get figures from mw's instrumentation about read/write requests and their latencies now to better assess the requirements and the impact as we try alternative auth/session storage

@fgiunchedi: Added the latency & instrumentation requirement in the task description.

• GWicke updated the task description. (Show Details)Aug 4 2016, 11:04 PM

• mobrovac moved this task from Backlog to Next on the Services board.Aug 10 2016, 7:27 PM

One concern that came up (in the code review for https://gerrit.wikimedia.org/r/#/c/300686) was that MediaWiki relies a lot on DB rollbacks to keep a consistent state (and is moving towards relying on them even more) and turning the user data interaction from a set of DB queries into a set of HTTP calls breaks that.

Ie. the present logic (pending the above patch) is like this:

user creates / autocreates account
record user in local DB
record user in central DB
do various other stuff
something breaks!
roll back DB connections, show error message

the DBs end up in the initial state, internal integrity is preserved.

When using an auth service, there would be an extra step of recording sensitive data in the auth service storage, which could not be rolled back, so after an error we would end up with a user that only exists in the auth service but not in MediaWiki (and then presumably weird errors when the user retries account [auto]creation).

If Mediawiki gates access to auth service then Mediawiki would know this user doesn't exist, and the only weird effect may be stale auth data in storage that belongs to no one, which eventually should expire. If this service talks to clients directly, then yes, it could get weird.

and the only weird effect may be stale auth data in storage that belongs to no one

That depends on what the service does when MediaWiki tries to add data for
a "new" user when the service thinks that user already exists, and how
MediaWiki handles the service's response.

I think there's a question of how the storage is indexed and how much it should even know about users, etc. I.e. we could make it completely blind to the concept of users as such, and MediaWiki would just say "give me (or store) data for key 12345" and it won't know which user that data belongs to. Then, with proper key generation, the situation where Mediawiki tries to add data on top of existing one would not be possible. If user creation fails after it's key had been allocated, the key would be stale, but nobody would reuse this key again.

One idea @aaron brought up is to do something like this:

For each account, the auth service tracks whether the account was used yet, perhaps using a timestamp or boolean.
While creating a new user account, MW will call the auth service's "createUser" end point. The service creates the user in the DB, but does not mark it as used. If an entry already exists, then it can only be overwritten if the account was never activated / used. If the account was already active, an error is returned. In normal operation, this should not happen, as MW checks for the existence of users before attempting an account creation.
If the user creation fails / is rolled back, do nothing.
On successful user login, mark the user account as active.

This way, existing accounts are protected even with remote code execution, but failures during account creation are handled gracefully.

• RobLa-WMF subscribed.Aug 16 2016, 4:17 PM

@GWicke that seems like a good way of handling it, but care should be taken to apply it to all possible changes (user creation, password change, setting/changing/removing 2FA).

The goals state Even with remote execution on the client, there is no way to perform random queries (ex: list sessions) on the underlying storage system.

While in a narrow sense this seems straightforward, a password change performed via a remote execution vulnerability is a bigger danger than a query (no need to crack anything; the attacker knows the new password) and it's less obvious how to defend against it, since a valid and an impersonated password change look the same.

For checkuser we have the same problem: how do you tell the request to the service is not made by the real checkuser?

I guess the password service could return a proof of identity (signed JWT or whatever) which could be stored in the session, but that would mean user_token and gu_token would have to move to the service as well (as they are also identity proofs that we must accept), and even OAuth if we allow password change / checkuser to be done through it (which currently we do). That's a lot of complexity and functionality duplication (all that staff still needs to work on its own in a vanilla MediaWiki).

While in a narrow sense this seems straightforward, a password change performed via a remote execution vulnerability is a bigger danger than a query (no need to crack anything; the attacker knows the new password) and it's less obvious how to defend against it, since a valid and an impersonated password change look the same.

The consequences of exploiting individual passwords are a lot less severe compared to being able to dump & change all password hashes and sessions. Doing so also requires the attacker to control the code involved in the login process, which is a requirement significantly beyond an arbitrary remote code execution.

If we wanted to make this scenario harder to exploit as well, then we could change the login password data flow to no longer involve the MediaWiki codebase at large. This could for example be done by redirecting specific requests to an auth service in Varnish. However, I personally think that the cost / benefit of such a move is a lot less promising than protecting the bulk of password hashes & sessions from SQL injection & code execution attacks.

If we allow to change password without knowing old password (which I assume we must, otherwise there's no way to recover forgotten password), then by itself a storage system can not protect itself - since password is the only non-public piece of information the user may have, IIRC.

That means if we have remote code execution issue, it can be used to change any password or all of them, unless the backend does its own validation - e.g. sends an email code that is not disclosed to any other code, etc.

• MZMcBride subscribed.Sep 16 2016, 4:33 AM

@Smalyshev, my understanding (which is also documented in the design notes ) is that changing the password requires one of

The current password, or
a password recovery token sent to the configured email address (if any).

This was not always the case, but @aaron tightened security by stopping anybody with a session to reset passwords ~2 years ago.

In T140813#2643780, @GWicke wrote:

@Smalyshev, my understanding (which is also documented in the design notes ) is that changing the password requires one of

The current password, or

a password recovery token sent to the configured email address (if any).

This was not always the case, but @aaron tightened security by stopping anybody with a session to reset passwords ~2 years ago.

That is no longer the case with AuthManager, since authentication no longer requires a password at all while at the same time we'd likely want to be able to require two-factor auth. Password change now requires a successful authentication in the current session within the past X seconds (default 300). @csteipp had wanted to give that another look at some point to see if we could come up with a better plan (see T136101), but did not block the merge.

It would be possible to require entry of the old password specifically for password changes (as opposed to changes of any other authentication credentials), although it gets much more complicated if you want to allow temporary passwords or passwords from other providers as well.

a password recovery token sent to the configured email address (if any).

But if we are assuming there's an RCE problem in Mediawiki code, can't that vulnerable code request a password recovery token (for any account) and then immediately use it to change password? If the token is passed to (supposedly) vulnerable code, it's no longer secure. Same for temporary password - if we assume Mediawiki code is vulnerable, we'd somehow have to deliver the temp password to the user without touching this code.

I like the idea of using the 300 second system for doing "elevated" actions that could be destructive, disruptive, or involve private data.

As for password changes specifically, I'd prefer (gut instinct) that it require immediately having the old password sent in the same request like it used to be. OTOH, the 300 seconds narrows the window of attack, and we don't live in the days of non-HTTPs logged in users anymore, and the password change form still has CSRF tokens, so I don't see any new obvious threats.

Password change now requires a successful authentication in the current session within the past X seconds (default 300).

Wait, but this doesn't apply to password recovery, right? Like, if I lost my password, there's still a way to change password without authenticating? Or, alternatively, authenticate without having password.

Also, 300 seconds is pretty long window to sneak something in - i.e. if we assume RCE scenario again, 5-minute vulnerability window after login is pretty wide.

In T140813#2644154, @Smalyshev wrote:

Wait, but this doesn't apply to password recovery, right? Like, if I lost my password, there's still a way to change password without authenticating?

If you lost your password, you do the "email me a temporary password" thing. Then when you use it to authenticate it forces you to change the password before the authentication process finishes, so you're not actually logged in until after the reset.

Code-wise, it works the same way as how with 2FA you authenticate with your password (or whatever) but you're not actually logged in until after you also pass the second factor.

If you lost your password, you do the "email me a temporary password" thing.

Right. So which code does the emailing? If it's the frontend code, which we assume might be vulnerable, then it could capture this password and use it to change it to another password, can't it?

If it's the frontend code, which we assume might be vulnerable, then it could capture this password and use it to change it to another password, can't it?

This is similar to capturing the password itself: You'd need more control than "just" a random SQL injection or remote code execution vulnerability. Again, we could protect against exploits in the password reset code itself by moving that out to an extra-locked-down service, but this is imho much lower priority than defending against the more likely general SQL injection / remote code execution case.

Also, thinking further about it, if I had RCE vulnerability, couldn't I just replace the email in the DB and send the temp password to myself? I could even put the old email back then so nobody would know what happened.

I think we're still pretty far from being RCE-proof here.

In this first iteration the email address would indeed still be vulnerable to SQL injection. We discussed moving more information (like the email, real name, checkuser logs) out of the main DB in a future iteration, which would plug that hole.

Ultimately, the safest design would move all handling of sensitive information to an isolated system. We have to start somewhere though, and eliminating the ability to download / publish everyone's password hashes, 2fa tokens and sessions is a decent first step.

In general if we want to be safe against SQL injection the minimum required effort is to split passwords into a separate database, access that database through a separate connection, and use a dedicated class for dealing with that database, with response sanitization and strong code review requirements. (That's much less effort than setting up an external service.)

If we want to be safe against mass (read: tens of thousands) password hash read / password change through RCE, we need to limit PHP to single-record operations, which means setting up an external service which can get/set password hashes and has some sort of rate limiting. (Still a whole lot less effort.)

If we want to be safe against password hash read / password change (even for single records), we need to set up an external service which encompasses the handling of all data which allows the user to log into or recover their account. That is the goal of this task; I don't think it's a good idea to start with it though unless we actually have a plan of how to get the benefits.

@Tgr, the first iteration of this task aims at protecting password hashes and sessions, basically your second option.

For protecting passwords, we would have to consider the following scenarios:

password change initiated by the user.
- we could stick with the current "X sec after last login" model, in which case the service needs to be able to track it (e.g. return a JWT with expiry on login, stick it into the user's session, require it to be sent back on password change)
- or we could require the user to enter their password. Probably a good thing in general. Would require fixing T136101; as Brad said, that's hard for the general case but it's easy (if inelegant) to special-case passwords. (And 2FA, as long as we only require old password check for setting new password and old 2FA check for setting new 2FA, and not some mix of the two.)
forced password change on login.
- we can require the old password (which was sent in a separate request but we still have it in the session). This will become more complex if we allow alternative login methods (such as Google login).
password change via maintenance script
- we would just have to break this (for Wikimedia wikis at least). I don't think anyone would care.
password reset via email
- sending out the email would have to be done by the service. That's fairly complex, as it includes i18n, email handling hooks and whatnot.
we would have to protect the email address itself, otherwise the attacker can just send the temporary password to themselves
- we would have to store the email addresses inside the service and either require a password to change it. Again lots of trouble with passwordless authentication methods and with passwordless account creation; I don't think it would break any existing Wikimedia use cases but it would require significant changes in MediaWiki. Also the service would probably have to take over LDAP password checking on wikitech (unless we decide wikitech passwords don't need to be protected by a service).
password reset via passwordreset user right (with the temp password displayed on screen)
- this could not be kept in its current form. The service could directly send the password to the resetting adimnistrator's email address; that would mean the service would have to take over user rights management (otherwise an RCE attack can be used to make the attacker's account into an adiministrator, mass reset passwords and get the emails). At which point this is starting to look a bit scary.

I would strongly prefer going through the whole thread-pulling exercise in our heads first and see just how much of MediaWiki would need to be unraveled and rewritten in an external application, instead of just starting to implement it and seeing where that takes us.

In T140813#2644463, @Tgr wrote:

I would strongly prefer going through the whole thread-pulling exercise in our heads first and see just how much of MediaWiki would need to be unraveled and rewritten in an external application, instead of just starting to implement it and seeing where that takes us.

To be clear, I wasn't referring to the session service (which is very easy to integrate with MW, just a switch of backends; and there seem to be good ops reasons for it, and the security argument of preventing the iteration of session IDs also seems solid). That makes perfect sense as a separate project. For the auth service though, we might reevaluate our desire for starting it if it seems impossible to finish.

As for checkuser data, how would that work? If we want to prevent an arbitrary RCE attacker from being able to impersonate a checkuser, we either need to ask for the password with every CU request (which seems way too much burden for checkusers), or have the private data service communicate directly with the session service to validate checkuser identity. That would again mean taking over user rights management, and also disallowing arbitrary session creation - e.g. the auth/private data service would be the only one that could create authenticated sessions, and would only do that after receiving a password. Apart from the aforementioned trouble with passwordless authentication methods, we would have to rethink at least session export/import for jobs. Or have the session service sign session export data somehow, so that it cannot be tampered with.

Several comments on this task seem to imply that an arbitrary RCE would automatically imply full control over the login and session handling code, and thus provide the ability to intercept passwords, sessions etc. This is not generally the case, as the web server user does not have the permissions to modify the code. In order to gain such rights, the attacker would need to escalate their privileges with something like a local root exploit. While this is not impossible, it does represent an extra burden that we can make less likely with tools like firejail.

All RCE and SQL injection attacks currently do provide control over the DB though, which is the threat phase 1 of this project aims to address.

I assume RCE means by definition running any code, that does not require modifying stored code, but that is unnecessary as you can directly run any code you like - e.g. via eval() or similar functions. Meaning, if you have RCE, you can do everything Mediawiki code can do. Am I missing something?

SQL injection of course is smaller scope as I don't think we have direct DB->execution paths. But SQL injection can in theory be converted to RCE by the fact that a) we store and read PHP serialized data in the DB and b) there are many (un)serializer bugs some of which can be converted to RCE potentially, and some of those may be still unpatched in our code. Note that unserializer bugs are known for php implementation, and I'm not sure if hhvm one has the same issues, or if they actually convertable to RCE, but theoreticlaly this vector exists too.

In T140813#2644582, @GWicke wrote:

Several comments on this task seem to imply that an arbitrary RCE would automatically imply full control over the login and session handling code, and thus provide the ability to intercept passwords, sessions etc.

That's a misunderstanding, I think. An RCE vulnerability would not (probably) allow the attacker to run code when a specific user action happens (e.g. someone uses Special:ChangePassword). But it would allow imitating that user action.

So an attacker with any RCE exploit could, for example

create the user account EvilUser the normal way
craft an (unauthenticated) request that triggers the RCE expolit
modify $wgUser so that it appears the current user is EvilUser and it is in the sysop group
run the same code that Special:ChangePassword would run (ie. call PasswordReset::execute( $wgUser, $targetUser, null, /* display password */ true ))

which will return the temporary password, since sysops have the right to request that the password be displayed on-screen instead of being emailed (for users who lost both their password and their email access but can verify their identity via some social means).

The discussion here has moved on beyond protecting against mass dumps of password hashes & sessions. While I understand and share the desire to thwart more limited attacks as well in a future iteration (and have ideas on how we can address specific issues discussed here), I would really prefer us to make progress on fixing the most glaring issues first.

The point I am trying to make is that we are about to implement something that is particularly ineffective (in terms of cost vs. benefit), but might be expanded into a much larger thing which would have a good cost/benefit ratio. Until we have a good understanding of how that expansion would look, that is maybe not such a great idea.

For password dumps via SQL injection, our use of a DB abstraction layer with parametrized queries already gives us pretty good protection, and as I said we could ramp that up by using a separate DB and a dedicated PHP class for handling it. That would give us the same level of protection (against injections, not RCE) as an external service, while being much less effort to implement and maintain.

In terms of its impact, leaking all password hashes or sessions is rather severe. The solution proposed here protects against such a leak in the face of all common attacks (SQL injections or RCE), at a fairly moderate cost.

Solutions targeting only SQL injection aren't addressing a large and fairly likely class of vulnerabilities, and thus have a limited security benefit. Costs aren't actually that much lower than a simple service, and there is no path towards iterative improvement.

Tgr mentioned this in T120484: Create password-authentication service for use by CentralAuth.Sep 17 2016, 9:37 PM

• GWicke updated the task description. (Show Details)Sep 23 2016, 1:38 AM

• GWicke updated the task description. (Show Details)Sep 23 2016, 2:26 PM

• GWicke lowered the priority of this task from High to Medium.Oct 5 2016, 7:45 PM

• GWicke edited projects, added Services (next); removed Services.Oct 12 2016, 3:31 PM

• GWicke mentioned this in T380: RfC: SOA Authentication.Apr 19 2017, 11:14 PM

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:42 PM

• NHarateh_WMF moved this task from Needs triage to Backlog on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:45 PM

Tgr moved this task from Backlog to Ready on the Reading-Infrastructure-Team-Old (Don't use) board.May 2 2017, 4:54 PM

Tgr moved this task from Ready to Backlog on the Reading-Infrastructure-Team-Old (Don't use) board.

• Fjalapeno removed a project: Reading-Infrastructure-Team-Old (Don't use).Jun 8 2017, 5:13 PM

Tgr mentioned this in T183420: Authentication data should not be available through the normal DB abstraction layer.Dec 20 2017, 9:23 PM