LoginNotify uses CheckUser data to determine whether the user is logging in from a known /24 or /64 subnet, that is, a subnet that the user has previously used. It connects to the database of every wiki the user is centrally attached to, and does a query against cu_changes, looking for a cuc_ip prefix. This can take many seconds.
The latency is hidden from the user by deferring notifications to a job. But this has the following problems:
- Complexity.
- DoS vulnerability.
- Uncontrolled latency between the login attempt and the notification.
- Difficult to reuse in other contexts (e.g. T211542: LoginNotify should expose the "is this an untrusted login?" check as a service).
- Extension dependency for 3rd party users, with privacy implications.
- CPU etc. resource cost of job execution.
There is a cache of the most recent subnet, but a user legitimately moving from one subnet to another will typically cause querying of CheckUser.
Schema
I propose adding a table like
CREATE TABLE loginnotify_seen_net ( -- Primary key lsn_id UNSIGNED AUTO_INCREMENT NOT NULL, -- Time since epoch divided by e.g. 15 days, rollover after ~2700 years lsn_time_bucket UNSIGNED SMALLINT, -- globaluser.gu_id or user.user_id (CentralIdLookup) lsn_user UNSIGNED INT, -- Truncated hash of IP address subnet lsn_subnet BIGINT, -- Index for checking whether a login is from a known IP, in a time range UNIQUE INDEX (lsn_global_user, lsn_subnet, lsn_time_bucket), PRIMARY KEY(lsn_id) );
There would be a table like this in a shared DB, and also in the local wiki DB for non-CentralAuth wikis like officewiki.
The reasons for storing a hash of the subnet instead of the subnet itself are:
- Shorter than IPv6 hex
- Random collisions for a 64-bit hash are unlikely. We are hashing a 64-bit prefix into a 64-bit output so the maximum possible number of collisions globally for a given network is order of magnitude 1. Even if a preimage collision could be achieved by a state-level actor with a large IPv6 network, the payoff would be small.
- Better privacy story if we decide to extend the expiry time beyond CheckUser's 90 days. LoginNotify's cookie expiry is 180 days which seems reasonable to me. Storing only a hash protects it against casual snooping by admins, and prevents scope creep. We could include the user ID in the hash, so that it can't be used to correlate users.
Data set size
Table size would probably be ~200MB.
MariaDB [enwiki]> SELECT /*tstarling*/ COUNT(*) from (SELECT DISTINCT cuc_actor, left(cuc_ip_hex,6) from cu_changes where cuc_timestamp BETWEEN '20230801000000' AND '20230816000000' AND cuc_ip_hex not like 'v6%') as cu_uniq; +----------+ | COUNT(*) | +----------+ | 297359 | +----------+ 1 row in set (1 min 2.226 sec) MariaDB [enwiki]> SELECT /*tstarling*/ COUNT(*) from (SELECT DISTINCT cuc_actor, left(cuc_ip_hex,19) from cu_changes where cuc_timestamp BETWEEN '20230801000000' AND '20230816000000' AND cuc_ip_hex like 'v6%') as cu_uniq; +----------+ | COUNT(*) | +----------+ | 164333 | +----------+ 1 row in set (9.331 sec)
A single 15-day bucket for enwiki at 18 bytes per row would be about 8MB. With a 90 day expiry time it would be 48MB. Multiply by a small single-digit factor to include non-enwiki accounts.
For a 180-day expiry time, you could increase the bucket size so that storage size would be similar.
For comparison, cu_changes on enwiki is 3435 MB. You can see this proposal as creating a summary table for cu_changes. But the main performance advantage is in having a single central table. We are creating a summary table for ~900 cu_changes tables.
Data flow
LoginNotify would hook RecentChange_save similarly to CheckUser. It would check whether the subnet is known using a replica. Then it would check the master. If the subnet is still unknown then, in autocommit mode, it would INSERT IGNORE a new row.
On login, the subnet would be checked against a replica.
The SELECT query would probably be too fast to bother caching. The existing cache could go away. The job could go away too.
There would be a new job to prune old rows, analogous to PruneCheckUserDataJob.
There would be a cron job maintenance script analogous to CheckUser's purgeOldData.php.
Migration
Development:
- Config variable LoginNotifyDatabase by analogy with CentralAuthDatabase. If null (the default), use the local wiki.
- Add an upgrade hook to create the local table.
- Config variable LoginNotifyUseCheckUser, initially true by default. If true, and data is absent from the new table, queue a job to read data from CheckUser.
Pre-deployment:
- Create the table.
- Configure LoginNotifyDatabase.
After 3 months:
- Set LoginNotifyUseCheckUser to false in production.
After 2 years:
- Deprecate LoginNotifyUseCheckUser and imply false.
- Permanently remove CheckUser integration.