T122375 calls out the need to segment and minimize access to sensitive data within the WMF cluster. One important subset of this is user-specific information:
- Password hashes: T120484: Create password-authentication service for use by CentralAuth
- Sessions: T134811: Consider REST with SSL (HyperSwitch/Cassandra) for session storage, T137272: Create BagOStuff subclass for HTTP
- Checkuser information, especially IP addresses
Such information is valuable for a variety of attackers, and incidents like this recent one at Yahoo serve as a reminder for why protecting such data is important. Since all of these items need similar protection, it seems to make sense to handle them in a single, firewalled "UserInfo" service.
- Protect sensitive data
- Firewall protection for both the service & its backend storage.
- Even with remote execution on the client, there is no way to perform random queries (ex: list sessions) on the underlying storage system.
- Narrow API exposing only what is absolutely needed.
- Clients connecting & authenticating via TLS.
- Reliability: No SPOFs, operationally as simple as possible.
- Multi DC support: Solid multi-master & fail-over support. Availability for reads and basic session updates more important than transactional cross-DC consistency. However, reliable logout functionality (session deletions) should only return success when the data has made it to the remote DC. Should automatically recover from short network partitions.
- Horizontally scalable, especially for session storage:
- 9k reads/s
- ~100 writes/s
- total session size: ~1.4G uncompressed, max 9G
- Latency: Read latency comparable to the current redis backend. We do not have metrics for this at the moment, and need to set those up ahead of any migration.
- Ideally, support for TTLs & garbage collection in storage backend. If not supported natively, this can be implemented manually in background task.
Options and trade-offs
Our default choice for a new service with fairly moderate requirements like this is to leverage our node.js ecosystem. Performance is unlikely to be an issue. Crypto / TLS support is available via node's native OpenSSL integration.
The two default candidates for backend storage are MySQL and Cassandra. The query needs are primarily simple key/value storage, ideally with TTL support / garbage collection.
Replication lag should be as short as possible.
- + 13+ years of experience, solid performer.
- - No automatic horizontal sharding.
- +- Multi-DC / multi-master design trade-offs less suited for use case.
- Galera offers synchronous master/master replication, which can't provide high availability and low latency when used across less reliable WAN links between DCs. Temporary partitions lead to outages on one side of the partition.
- Master-slave replication complex to manage across DCs, and does not support session timestamp updates in both DCs.
- Semi-sync replication combined with multi-source replication can be used to build an eventually consistent storage system. Availability would be provided by HAProxy. Replication levels can only be configured globally, which means that we would need to decide whether we want to provide reliable logouts, or fail plain session updates during partitions.
- (-) No built-in garbage collection (but relatively easy to automate).
- +- Replication lag on the order of seconds (master-slave).
- +- ~2 years of experience.
- + Automatic horizontal sharding.
- + Mature multi-master support, last write wins reconciliation after partitions. Per-request consistency options.
- (+) Built-in TTL / garbage collection support.
- (-) Does not scale too well to > 1T per instance, but data set much smaller (less than 10G).
- + Very low replication lag (parallel writes).
Overall, we are leaning towards using Cassandra. The primary reason for this is better support for multi-DC operation and sharding, combined with very moderate data sizes.
We prototyped a simple session storage backend, which supports about 3k reads/s on a dual-core laptop, using a Cassandra backend. Based on this performance data, we roughly estimate that we should be able to comfortably handle the expected read load with three nodes per DC.
Timeline and division of labor
Session storage is technically simpler than authentication. The timeline depends primarily on making decisions on the general approach, determining hardware needs & procuring / installing the needed hardware. If we decide to continue with the prototype service / MediaWik integration & Cassandra storage by early August, then it might be possible to be ready for a gradual roll-out by the end of Q1. That said, we consciously did not set a hard deadline, and do not intend to rush it.
In cooperation with Security, we are planning to prototype the auth service in Q1. Division of labor is planned as follows:
- Services will own the service, storage and API.
- Security will implement a crypto library for handling MediaWiki password hash schemes to be used by the service.
In Q2, Security and Services intend to work with Reading-Infrastructure-Team-Backlog on integrating the auth service as a CentralAuth backend, and gradually rolling it out to production. At this point, we'll need production hardware.