Page MenuHomePhabricator

CentralAuth: Add globaluser.gu_email_normalized column
Open, Needs TriagePublic

Description

Product Safety and Integrity would like to propose adding a new column to the globaluser table – gu_email_normalized, which would contain the user e-mail, but in a normalized form. This field wouldn't be used for sending e-mails to the target user, but instead would serve as basis for future anti-abuse work.

Context

There are popular e-mail providers who are known to apply certain clean-up steps when resolving the recipient's e-mail. An example would be Gmail, where john.doe@gmail.com and johndoe+spam@gmail.com are resolved to the same address.

As of now, using the account's e-mail address as part of the anti-abuse signals is error-prone, as we cannot reliably match "any known form" of the user's e-mail (nor query for all users with the same "canonical" e-mail).

Proposed change

Add gu_email_normalized to globaluser. It's going to store the normalized value of the gu_email field.

We don't have defined yet how the "normalized" for of e-mail is going to look like – the two options that I can think of is: either human-readable text, stripped out of the unneeded characters or the same but hashed (so that it can have a constant length, smaller than the limit for gu_email). We're open for suggestions from DBA on how this can be done in a sustainable way.

The exact size of the field should make collisions unlikely, but at the same time it doesn't have to be human-readable.


Exact implementation of the normalization algorithm is not part of this request (PSI can figure it out, once the DB details are agreed on).

Event Timeline

Sorry. I missed this email cause gmail sent it to the "Promotions" folder. Let me think about it a bit.

So I looked into this. Adding the column is one aspect, you definitely need to add an index for it too. For the index, since you're not looking for similar results or ordering, maybe you can use hash indexes to improve lookup and reduce space.

The table is rather big (36GB) and it will grow faster given temporary accounts and the initiatives next year to increase logged in users. So I am worried about the scalability of this table.

Here is my suggestion: To offset the growth added by the new column, you can do a couple of things to improve things and even more:

1- There are two indexes on the table that don't make sense to me:

KEY `gu_locked` (`gu_name`,`gu_locked`)
KEY `gu_hidden_level` (`gu_name`,`gu_hidden_level`)

We do have a unique index on gu_name already so the extra columns won't provide any extra look up benefits. Only thing I can think of is that they are covering indexes but I don't know if it's actually the case and whether the performance gain (which would be quite small) is really worth the extra index.

2- Once you added the column and the index for email normalized, maybe drop the index on gu_email (KEY gu_email (gu_email)) but make sure any query that looks up based on gu_email, use the normalized version instead.

3- This is a bit of work but very worth it: gu_password_reset_key and gu_password_reset_expiration are basically empty for every row. (81M rows out of 83.6M). If you create a table similar to editcount one and only insert the row when there is actually a reset key, that could remove a lot of nulls and clean up a decent chunk of the table.

4- Is gu_cas_token really needed? Maybe it can be a smallint instead and then simply overflow back to 1 if it gets over the limit. That'll save 160MB which is not much but if it's easy to do, why not.

temporary accounts

Once passwords, emails and are in dedicated tables, temporary account will no longer have footprint on password and email table. They will also have no footprint on token table after 90 days once CentralAuth's resetAuthToken() is modified to remove the token entry (instead of setting it to a random string).

maybe drop the index on gu_email (KEY gu_email (gu_email))

This assumes the way to normalize email never changes. Otherwise it will be awkward.

gu_locked

See T373388: Merge CentralAuth locks into GlobalBlocking.

Note I propose to split several columns to dedicated tables. T391785: Introduce dedicated tables for CentralAuth passwords, emails and tokens

That won't help much in terms of storage. because you need to repeat the user id, so it's actually gonna increase the size not decrease. Whether that can be compensated with not storing rows for temp accounts need be measured (not to mention that if you want to store emails for temp accounts and co, it might immediately nullify any gain).

maybe drop the index on gu_email (KEY gu_email (gu_email))

This assumes the way to normalize email never changes. Otherwise it will be awkward.

I'm asking the index to be dropped, not the column.

make sure any query that looks up based on gu_email, use the normalized version instead.

It will be awkward if we change how "normalized version" is defined in the future

not to mention that if you want to store emails for temp accounts and co, it might immediately nullify any gain

What I mean: temporary accounts should not have rows in the proposed table at all, not even storing null or empty string.

That won't help much in terms of storage. because you need to repeat the user id, so it's actually gonna increase the size not decrease.

But after we split them to new tables, the next step would be moving password, email and token tables to a new cluster (T120484: Store CentralAuth password hashes outside the main database cluster).