Page MenuHomePhabricator

LoginNotify seen subnets table
Closed, ResolvedPublic

Description

LoginNotify uses CheckUser data to determine whether the user is logging in from a known /24 or /64 subnet, that is, a subnet that the user has previously used. It connects to the database of every wiki the user is centrally attached to, and does a query against cu_changes, looking for a cuc_ip prefix. This can take many seconds.

The latency is hidden from the user by deferring notifications to a job. But this has the following problems:

There is a cache of the most recent subnet, but a user legitimately moving from one subnet to another will typically cause querying of CheckUser.

Schema

I propose adding a table like

CREATE TABLE loginnotify_seen_net (
    -- Primary key
    lsn_id UNSIGNED AUTO_INCREMENT NOT NULL,
    -- Time since epoch divided by e.g. 15 days, rollover after ~2700 years
    lsn_time_bucket UNSIGNED SMALLINT,
    -- globaluser.gu_id or user.user_id (CentralIdLookup)
    lsn_user UNSIGNED INT,
    -- Truncated hash of IP address subnet
    lsn_subnet BIGINT,
    -- Index for checking whether a login is from a known IP, in a time range
    UNIQUE INDEX (lsn_global_user, lsn_subnet, lsn_time_bucket),
    PRIMARY KEY(lsn_id)
);

There would be a table like this in a shared DB, and also in the local wiki DB for non-CentralAuth wikis like officewiki.

The reasons for storing a hash of the subnet instead of the subnet itself are:

  • Shorter than IPv6 hex
  • Random collisions for a 64-bit hash are unlikely. We are hashing a 64-bit prefix into a 64-bit output so the maximum possible number of collisions globally for a given network is order of magnitude 1. Even if a preimage collision could be achieved by a state-level actor with a large IPv6 network, the payoff would be small.
  • Better privacy story if we decide to extend the expiry time beyond CheckUser's 90 days. LoginNotify's cookie expiry is 180 days which seems reasonable to me. Storing only a hash protects it against casual snooping by admins, and prevents scope creep. We could include the user ID in the hash, so that it can't be used to correlate users.

Data set size

Table size would probably be ~200MB.

MariaDB [enwiki]> SELECT /*tstarling*/ COUNT(*) from (SELECT DISTINCT cuc_actor, left(cuc_ip_hex,6) from cu_changes where cuc_timestamp BETWEEN '20230801000000' AND '20230816000000' AND cuc_ip_hex not like 'v6%') as cu_uniq;
+----------+
| COUNT(*) |
+----------+
|   297359 |
+----------+
1 row in set (1 min 2.226 sec)

MariaDB [enwiki]> SELECT /*tstarling*/ COUNT(*) from (SELECT DISTINCT cuc_actor, left(cuc_ip_hex,19) from cu_changes where cuc_timestamp BETWEEN '20230801000000' AND '20230816000000' AND cuc_ip_hex like 'v6%') as cu_uniq;
+----------+
| COUNT(*) |
+----------+
|   164333 |
+----------+
1 row in set (9.331 sec)

A single 15-day bucket for enwiki at 18 bytes per row would be about 8MB. With a 90 day expiry time it would be 48MB. Multiply by a small single-digit factor to include non-enwiki accounts.

For a 180-day expiry time, you could increase the bucket size so that storage size would be similar.

For comparison, cu_changes on enwiki is 3435 MB. You can see this proposal as creating a summary table for cu_changes. But the main performance advantage is in having a single central table. We are creating a summary table for ~900 cu_changes tables.

Data flow

LoginNotify would hook RecentChange_save similarly to CheckUser. It would check whether the subnet is known using a replica. Then it would check the master. If the subnet is still unknown then, in autocommit mode, it would INSERT IGNORE a new row.

On login, the subnet would be checked against a replica.

The SELECT query would probably be too fast to bother caching. The existing cache could go away. The job could go away too.

There would be a new job to prune old rows, analogous to PruneCheckUserDataJob.

There would be a cron job maintenance script analogous to CheckUser's purgeOldData.php.

Migration

Development:

  • Config variable LoginNotifyDatabase by analogy with CentralAuthDatabase. If null (the default), use the local wiki.
  • Add an upgrade hook to create the local table.
  • Config variable LoginNotifyUseCheckUser, initially true by default. If true, and data is absent from the new table, queue a job to read data from CheckUser.

Pre-deployment:

  • Create the table.
  • Configure LoginNotifyDatabase.

After 3 months:

  • Set LoginNotifyUseCheckUser to false in production.

After 2 years:

  • Deprecate LoginNotifyUseCheckUser and imply false.
  • Permanently remove CheckUser integration.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The latency is particularly problematic in this case because, if there are a couple hundred milliseconds between the successful login and the outcome of the IP check, the attacker can simply disable login notifications during that time.

After reading https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html I decided to add an auto-incremented primary key, instead of a time index. If you leave off the integer primary key, the secondary index needs to duplicate the 14-byte unique key, so it actually uses more space overall. Purging can be done using the primary key, if you assume it's monotonic in time.

Generally sounds good to me:

  • It must be stored in x1 (in wikishared), centralauth is in s7 but still I'd avoid s7. For private wikis, it's fine to store them in the core db.
  • You could lsn_id make BIGINT to avoid having headaches in the future.
  • We have bots that log in sometimes thousands of times a minute (yup...) so I'd be careful around those. Whether querying replica first and not adding if it's already there or avoiding storing anything on bots, etc. I don't care either way.

You could lsn_id make BIGINT to avoid having headaches in the future.

We should have at least 50 years before it runs out. We'll probably discover other problems with the schema before then. The best future-proofing would be to allow an array of tables instead of a single table, each table having its own configuration. There would be one active table and zero or more archive tables. Because old data is discarded, you could switch to a new active table and drop the old table after the expiry time elapses. Maybe I'll file a task for that and we can come back to it in 30 years.

I'm tempted to just change it, but it adds 29% to the row size, and a lot of other unsigned ints are going to overflow before we hit this one.

We have bots that log in sometimes thousands of times a minute (yup...) so I'd be careful around those. Whether querying replica first and not adding if it's already there or avoiding storing anything on bots, etc. I don't care either way.

I'm querying the replica, and if it's not there, querying the master too. Mostly bots will see their IP address in the replica, but there will be a spike of writes to the table when each time bucket expires, every 15 days at midnight UTC. I put the master SELECT query very close to the INSERT, and the INSERT is done in autocommit mode, so for a frequently-editing bot, the INSERT spike should only last a few milliseconds. The master SELECT spike will last until the transaction is replicated.

When I tested this on my laptop, with replication enabled, the window for an INSERT spike was only 0.5ms.

Change 953749 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/LoginNotify@master] LoginNotify seen subnets table

https://gerrit.wikimedia.org/r/953749

You could lsn_id make BIGINT to avoid having headaches in the future.

We should have at least 50 years before it runs out. We'll probably discover other problems with the schema before then. The best future-proofing would be to allow an array of tables instead of a single table, each table having its own configuration. There would be one active table and zero or more archive tables. Because old data is discarded, you could switch to a new active table and drop the old table after the expiry time elapses. Maybe I'll file a task for that and we can come back to it in 30 years.

Unless a bug, incorrect usage of UPSERT and so on could cause this to run out faster. This has happened before (T296487: geotags table is wasting a lot of auto_increment values)

I'm tempted to just change it, but it adds 29% to the row size, and a lot of other unsigned ints are going to overflow before we hit this one.

I'm not too strongly against it, at the end of the day, if we hit issues, we can run schema changes on it.

We have bots that log in sometimes thousands of times a minute (yup...) so I'd be careful around those. Whether querying replica first and not adding if it's already there or avoiding storing anything on bots, etc. I don't care either way.

I'm querying the replica, and if it's not there, querying the master too. Mostly bots will see their IP address in the replica, but there will be a spike of writes to the table when each time bucket expires, every 15 days at midnight UTC. I put the master SELECT query very close to the INSERT, and the INSERT is done in autocommit mode, so for a frequently-editing bot, the INSERT spike should only last a few milliseconds. The master SELECT spike will last until the transaction is replicated.

Sounds good, maybe possibly add jitter to the bucket? Just thinking out loud. Hopefully it shouldn't be too hard. Like basically treating the old stale bucket at random in the first couple of hours of the switch.

Change 953749 merged by jenkins-bot:

[mediawiki/extensions/LoginNotify@master] LoginNotify seen subnets table

https://gerrit.wikimedia.org/r/953749

@tstarling With the config:

$wgLoginNotifyUseCheckUser = true;
$wgLoginNotifyUseSeenTable = false;

I cannot get it to send a notification for a failed login on a new IP address. I can do for a known IP address. As that is the default at the moment, should that be fixed?

Should the config variables in https://www.mediawiki.org/wiki/Extension:LoginNotify be updated as part of this ticket or should I raise a new one?

CheckUser had its schema changed recently (see T324907). Depending on the value of $wgCheckUserEventTablesMigrationStage, CheckUser may record login events in the cu_private_event table rather than the cu_changes table. Should LoginNotify be updated to support this in the future, or will it be deprecated by then?

@tstarling With the config:

$wgLoginNotifyUseCheckUser = true;
$wgLoginNotifyUseSeenTable = false;

I cannot get it to send a notification for a failed login on a new IP address. I can do for a known IP address. As that is the default at the moment, should that be fixed?

If it's actually broken, it should definitely be fixed. The patch with that configuration is already on testwiki and will be going out with the train to the rest of the wikis this week. I'll test it.

Also, we can keep an eye on the production statistics as it rolls out.

I cannot get it to send a notification for a failed login on a new IP address. I can do for a known IP address. As that is the default at the moment, should that be fixed?

It seems to work for me. Did you confirm that it works before my commits?

A few details to be aware of:

  • You need to use an IP address from a different /24 subnet.
  • You need to remove the cookie. It won't check the table if there is a cookie match.
  • Make sure jobs are run, e.g. use maintenance/runJobs.php.

Check the LoginNotify log channel for information about why it did or didn't send a notification.

Should the config variables in https://www.mediawiki.org/wiki/Extension:LoginNotify be updated as part of this ticket or should I raise a new one?

I'm not planning on updating the wiki.

CheckUser had its schema changed recently (see T324907). Depending on the value of $wgCheckUserEventTablesMigrationStage, CheckUser may record login events in the cu_private_event table rather than the cu_changes table. Should LoginNotify be updated to support this in the future, or will it be deprecated by then?

The obvious solution for that is to complete this task.

I cannot get it to send a notification for a failed login on a new IP address. I can do for a known IP address. As that is the default at the moment, should that be fixed?

  • Make sure jobs are run, e.g. use maintenance/runJobs.php.

I forgot that mediawiki docker sets $wgJobRunRate to 0 by default, so jobs need to be run manually.

Running the jobs manually, the notification is sent for a new IP subnet.

It is curious though. Notifications for known IP subnets were being sent. Are they not done by a job as well?

It is curious though. Notifications for known IP subnets were being sent. Are they not done by a job as well?

If the IP is known via the cache, the notification is sent immediately, a job is not queued. That's pre-existing behaviour. In the new mode with $wgLoginNotifyUseCheckUser = false, all notifications are sent immediately.

Testing what I take to be the future settings:

$wgLoginNotifyUseCheckUser = false;
$wgLoginNotifyUseSeenTable = true;

I see that IP (v4 and v6) subnets are recorded in the loginnotify_seen_net table.

When failing a login, an email is sent, wording depending on whether it was an IP subnet that the user had already logged in with or not.

Similarly, when the entry in loginnotify_seen_net expires[1] and a failed login is attempted, an email is sent referring to a new IP.

I briefly tested that the new maintenance script removes the appropriate entries from the loginnotify_seen_net table.

As discussed above, I also briefly tested the current settings:

$wgLoginNotifyUseCheckUser = true;
$wgLoginNotifyUseSeenTable = false;

which appeared to also perform the same functionality.

I tested with $wgLoginNotifyCookieExpire = 0; so that it would always(?) be checking the database for known IPs.

Test environment: LoginNotify 0.1 (fac8b00) 09:54, 13 September 2023.

Notes:

  1. This is done by setting $wgLoginNotifySeenExpiry and $wgLoginNotifySeenBucketSize to small values. However, you first need to alter the table. Otherwise, the shortest expiry is about 7-8 hours. This makes sense on a production wiki, but makes it harder to test locally. This is done by running the query: ALTER TABLE loginnotify_seen_net MODIFY COLUMN lsn_time_bucket BIGINT NOT NULL;.

Change #1229247 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/LoginNotify@master] Drop CheckUser support

https://gerrit.wikimedia.org/r/1229247

Change #1229247 merged by jenkins-bot:

[mediawiki/extensions/LoginNotify@master] Drop CheckUser support

https://gerrit.wikimedia.org/r/1229247