Page MenuHomePhabricator

user_properties table bloat
Open, MediumPublic

Description

On enwiki, the user_properties table has about 60M rows, for only 20M users. This is incredible considering that user_properties is meant to only store non-default options, to reduce DB space. The index length is about 2.2 GB, and the data size is about 3.7GB.

By sampling, the number of user_properties rows per user can be estimated. The problem is very dependent on user_id, and is mostly confined to user_id values less than 10M, i.e. users created before mid-2009.

user_id props/user

0 8.2615
1000000 5.8696
2000000 4.9534
3000000 4.8038
4000000 4.6013
5000000 4.3775
6000000 4.4137
7000000 5.3833
8000000 5.8919
9000000 6.4356
10000000 0.8789
11000000 1.1052
12000000 1.0005
13000000 0.9774
14000000 1.0987

Sampling 1000 users with user_id<10M, we find that the main culprits are:

searchNs-1 : 968 users
skin: 964 users
thumbsize: 912 users

75% of the skin rows have an empty string as their value, which causes Skin::newFromKey() to return the default skin, same as if the row was missing. The rest are mostly "monobook", presumably manually set via the UsabilityInitiative OptIn extension.

"searchNs-1" is a bug, it relates to searching the special namespace, which is not possible. It is "0" in all sampled rows.

"thumbsize" is "3" in all sampled rows, which is not the default, the default is "4" on all WMF wikis other than svwiki. In addition to bloat of the user_properties table, this causes fragmentation of the parser cache. There's no way 91% of users prior to 2009 manually set this value, it must have been set by a bug.

We should remove unnecessary or incorrectly inserted rows, and ensure that this does not happen again (e.g. as a consequence of the resolution of T38316).

Details

Reference
bz52777

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedKStoller-WMF
OpenNone
OpenNone
DuplicateNone
ResolvedTgr
OpenNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedKrenair
ResolvedKrenair
Resolved demon
Duplicate demon
DuplicateNone
ResolvedJdlrobson
ResolvedKrinkle
OpenNone
Resolvedkostajh
OpenNone
ResolvedUrbanecm_WMF
OpenNone
OpenNone
ResolvedUrbanecm_WMF
OpenNone
OpenNone
ResolvedUrbanecm_WMF
ResolvedZabe
StalledNone
OpenNone
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
DeclinedSgs
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
ResolvedBUG REPORTUrbanecm_WMF
ResolvedBUG REPORTUrbanecm_WMF
OpenSpikeNone
ResolvedUrbanecm_WMF
ResolvedUrbanecm_WMF
OpenLadsgroup
Resolvedmatmarex
ResolvedUrbanecm_WMF
ResolvedBUG REPORTUrbanecm_WMF
ResolvedDreamy_Jazz
ResolvedJdlrobson
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I wonder if dropping all preferences from closed wikis would be some low hanging fruit?

I wonder if dropping all preferences from closed wikis would be some low hanging fruit?

I'm not sure it'd help with the original issue ("large prod tables slowing things down") as the preference tables are per-wiki. It'd clean up the "accounts have preferences that we should ignore when analysing what to kill" issue, but that feels like a sub-set of the overall issues (and people often only look at a few big wikis for those numbers anyway)…

It makes the dataset of "things that need cleanup" to be smaller, if we just delete everything from those carte blanche.

(Relatedly, I ran some interesting queries re: subtask of T171643, will be following up there later)

enwiki has about 200 million user_properties rows today (for about 40 million users); a newly registered user has 12 rows:

+------------------------------------------+---------------------------------------------------+
| up_property                              | up_value                                          |
+------------------------------------------+---------------------------------------------------+
| VectorSkinVersion                        | 1                                                 |
| echo-subscriptions-email-article-linked  | 1                                                 |
| echo-subscriptions-email-dt-subscription | 1                                                 |
| echo-subscriptions-email-edit-thank      | 1                                                 |
| echo-subscriptions-email-mention         | 1                                                 |
| echo-subscriptions-email-page-review     | 1                                                 |
| echo-subscriptions-web-article-linked    | 1                                                 |
| echo-subscriptions-web-reverted          | 0                                                 |
| popups                                   | 1                                                 |
| rcenhancedfilters-seen-tour              | 1                                                 |
| welcomesurvey-responses                  | {"_group":"NONE","_render_date":"20220126054124"} |
| wlenhancedfilters-seen-tour              | 1                                                 |
+------------------------------------------+---------------------------------------------------+

There are about a quarter million user registrations globally per month. Most of those means four local user accounts (loginwiki, metawiki, mediawikiwiki + wherever they signed up, which is enwiki about half the time) so about 7 million new user_properties rows on enwiki per year (a 3% growth rate), 15 million on wikis where accounts get autocreated, and 60 million in total.

I'm gathering requirements for this. Can someone involved in GrowthExperiments, e.g. @kostajh , comment on whether it is a hard requirement to vary new user preferences by autocreation status? The user table doesn't have autocreation status, so it's not so easy to determine it after creation is complete. If it is a requirement, an approximation might be to check if the local wiki is the CentralAuth home wiki.

If it's just a workaround for T276720 then that can be fixed in another way. But maybe what GrowthExperiments actually needs is a way to store tour completion globally. So if a user goes from wiki to wiki, they will keep getting prompted for a tour, until they do it on any wiki, then it disappears globally.

I'm gathering requirements for this. Can someone involved in GrowthExperiments, e.g. @kostajh , comment on whether it is a hard requirement to vary new user preferences by autocreation status? The user table doesn't have autocreation status, so it's not so easy to determine it after creation is complete. If it is a requirement, an approximation might be to check if the local wiki is the CentralAuth home wiki.

So far in GrowthExperiments we have set non-default preferences for some percentage of newly created users, and we have ignored autocreated users.

That has recently changed because we are now giving Growth features to 100% of newly created accounts (still excluding autocreated) on almost all Wikipedias (T301820), with a plan to eventually reach 100% on the rest of the Wikipedias.

We are discussing how to enable Growth features for some subset of existing user accounts in T296702: Scale: enable Growth features for existing accounts which would likely involve setting a non-default option for preferences.

We haven't determined a new strategy for autocreated accounts, that is being discussed in T292090: Research Spike: Enable Growth features for autocreated accounts (if the user already has Growth features).

If it's just a workaround for T276720 then that can be fixed in another way. But maybe what GrowthExperiments actually needs is a way to store tour completion globally. So if a user goes from wiki to wiki, they will keep getting prompted for a tour, until they do it on any wiki, then it disappears globally.

We mostly haven't worried about tour completion as a user goes from wiki to wiki, because so far, if you have Growth features enabled on e.g. enwiki, you will not have them automatically enabled when you go to eswiki. But it is something we will need a solution to and will discuss in T292090: Research Spike: Enable Growth features for autocreated accounts (if the user already has Growth features).

I'm thinking about this in terms of the cost of deployment of IP masking. It's proposed to have a user and globaluser row for "temporary" accounts, and there will be a lot of temporary accounts. But probably most extensions will want to treat temporary accounts the same as anonymous users, leaving them with default preferences. Delivering welcome banners and the like is probably best done after the user explicitly creates an account.

So I'm not sure we really need this, but I still had better do a brain dump in case it's needed now or in the future.

Currently several extensions are setting user options from LocalUserCreated. Here's what they are trying to achieve:

  • The default preference value for new users may be a value different from the default for existing users.
  • The default for new users may change when a new version of the extension is deployed.
  • The default for new users may change when a configuration variable changes.
  • New users may be assigned to a random A/B test bucket and then receive different preferences depending on their bucket.
  • Possibly the default should be set based on global newness rather than local account autocreation.
  • The default for new users may later become the default for everyone, or vice versa.

My idea for efficiently achieving those requirements is to have extensions statically declare new user preferences. Have a new table which holds these declarations, say user_property_default:

CREATE TABLE user_property_default (
    upd_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    upd_property VARBINARY(255) NOT NULL,
    upd_user_type INT UNSIGNED NOT NULL,
    upd_min_user INT UNSIGNED NOT NULL,
    upd_min_bucket INT UNSIGNED NOT NULL,
    upd_value BLOB,
    PRIMARY KEY (upd_id),
    UNIQUE KEY (upd_property, upd_user_type, upd_min_user, upd_min_bucket)
);

To figure out a default user preference value for a given user, you search for a user_property_default row with a minimum user less than the given user_id:

SELECT upd_value FROM user_property_default 
WHERE upd_property='$prefname' 
   AND upd_user_type='$my_type'
   AND upd_min_user <= '$my_id'
   AND upd_min_bucket <= '$my_bucket'
ORDER BY upd_min_user DESC, upd_min_bucket DESC
LIMIT 1;

When a new user is created, the configured declaration is compared against the current (highest upd_min_user) value in the database for each preference. If the declaration has changed, a new row is inserted into the database with upd_min_user being the user_id of the user being created.

The user bucket would just be "hash(user_id) mod 1000" or something similar. Most user_property_default rows would have upd_min_bucket=0 and so would catch all buckets. If you insert a row with upd_min_bucket=990 then it will only take effect for 1% of users. There would always be a fallback with upd_min_bucket=0 so that the search doesn't continue back to previous upd_min_user values.

upd_user_type would be a small integer to allow the default to depend on autocreate flag and "temporary" status.

Sprinkle in some caching and stampede protection and I think it would mostly work. If the declared default changed depending on the request parameters, it would cause a bit of a mess, and preventing that would come down to code review.

I like the idea. While we are here, I suggest normalizing up_property so it could be shared with upd_property. While upd_property doesn't get repeated much but it would make renaming user_properties possible and easy.
Something like:

CREATE TABLE user_property_type (
    upt_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    upt_property VARBINARY(255) NOT NULL,
    PRIMARY KEY (upt_id),
    UNIQUE KEY (upt_property)
);

I like the idea. While we are here, I suggest normalizing up_property so it could be shared with upd_property. While upd_property doesn't get repeated much but it would make renaming user_properties possible and easy.
Something like:

CREATE TABLE user_property_type (
    upt_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    upt_property VARBINARY(255) NOT NULL,
    PRIMARY KEY (upt_id),
    UNIQUE KEY (upt_property)
);

So this would be something like the NameTableStore system ? Makes sense to me and can be done separately and before the other refactoring. Perhaps create a separate ticket for that?

Yes but I don't have time to do it (currently doing actor migration and templatelinks normalization). Batching schema changes would make things easier but I don't know if @tstarling would want the extra work.

I talked to @Tgr about this today and I actually come up with a counter-proposal to the user_property_default table which I think is a bit more flexible.

We have done a similar thing with per-page A/B testing. You feed the page id to a hash function (to make sure the bucketing is random) and then feed that to a bucketing system. You can even change that without much trouble. Say "I want first bucket" and later say "first and second buckets".

In the page-id based A/B testing system that was built. The core is basically this:

	public function getPageRandom( int $pageId ): float {
		$random = intval( substr( md5( (string)$pageId ), 0, 6 ), 16 ) / 16777216;
		return round( $random, 3 );
	}

We can use something similar but instead of feeding pageId, you can feed (string)$userId . 'name-of-experiment' to the md5 function. That can be easily computed on the fly (we don't need cryptgraphically secure randomness here) and won't take much space and is rather flexible because you can just write code for it if you need to. For example enable A/B testing for users that registered after certain time.

There are several use cases which currently contribute to user_properties row proliferation, for most user_property_default would work well (e.g. changing the default for new users without affecting old users).

The problem with A/B tests is that you can have several of them running at the same time, and often it's important that the buckets are independent of each other, or correlated in some specific way. Simple modulo-based bucketing will make them correlated in a way that's probably unhelpful. Passing the user ID + pseudo-property-name through md5 or some other pseudo-random function makes the bucketing properties independent, which is better but still not always the right thing - when testing multiple elements of the same feature in parallel, you can easily have rules like "experiment 1 has buckets A, B, C, experiment 2 has buckets X, Y, Z, users should be uniformly distributed except users in bucket A should always be in bucket X". I guess you can work around those in the code, but it will make it less intuitive.

The other problem with pseudorandom bucketing is that you can't batch select users in some given bucket, which is sometimes needed (e.g. right now we need to export users from a certain experimental bucket into a mass mailer for an experiment with email prompts). I guess you can calculate md5 in SQL, but it will be inefficient for a large wiki.

My idea for efficiently achieving those requirements is to have extensions statically declare new user preferences. Have a new table which holds these declarations, say user_property_default

I wonder if instead of user ID, the registration timestamp could be used. That way, the segment definitions could usually be identical between wikis, so the whole thing could be handled in configuration instead of a database table. That would make it easier to inspect what groups a feature is currently enabled for, and changing the default would be a simple config change, instead of running a maintenance script on every wiki.

In T54777#7856746, @Tgr wrote:

There are several use cases which currently contribute to user_properties row proliferation, for most user_property_default would work well (e.g. changing the default for new users without affecting old users).

The problem with A/B tests is that you can have several of them running at the same time, and often it's important that the buckets are independent of each other, or correlated in some specific way. Simple modulo-based bucketing will make them correlated in a way that's probably unhelpful. Passing the user ID + pseudo-property-name through md5 or some other pseudo-random function makes the bucketing properties independent, which is better but still not always the right thing - when testing multiple elements of the same feature in parallel, you can easily have rules like "experiment 1 has buckets A, B, C, experiment 2 has buckets X, Y, Z, users should be uniformly distributed except users in bucket A should always be in bucket X". I guess you can work around those in the code, but it will make it less intuitive.

You can give it a shared value. e.g. user_id . name of experiment A . name of experiment B (or simply just take experiment A) and ask it to be in different buckets.

The other problem with pseudorandom bucketing is that you can't batch select users in some given bucket, which is sometimes needed (e.g. right now we need to export users from a certain experimental bucket into a mass mailer for an experiment with email prompts). I guess you can calculate md5 in SQL, but it will be inefficient for a large wiki.

For this one, the simplest solution would be to actually simply iterate over user id ranges inside the code and calculate the md5 value in the code and then pick users ids you want to send mass email through that. The whole point of this idea is to decouple storage from random bucketing.

I do understand you might end up in some edge cases that it would be hard or impossible to use with the md5 bucketing but this is a trade-off between storage and computation and I honestly think it's worth it.

The other problem with pseudorandom bucketing is that you can't batch select users in some given bucket, which is sometimes needed (e.g. right now we need to export users from a certain experimental bucket into a mass mailer for an experiment with email prompts). I guess you can calculate md5 in SQL, but it will be inefficient for a large wiki.

For this one, the simplest solution would be to actually simply iterate over user id ranges inside the code and calculate the md5 value in the code and then pick users ids you want to send mass email through that. The whole point of this idea is to decouple storage from random bucketing.

To iterate on that, you can basically get highest and lowest user id you want to send, split that number into batches of 1000 (or 10k) and without querying the db at all, calculate the bucketing one by one ($userId++) and pick user ids you need sending email to and then do a query on user_id IN (list) on that batch

But maybe what GrowthExperiments actually needs is a way to store tour completion globally. So if a user goes from wiki to wiki, they will keep getting prompted for a tour, until they do it on any wiki, then it disappears globally.

We currently have a bunch of tours where the "tour not seen yet" flag (non-default preference value) is set on LocalUserCreated, then deleted when the user sees the tour. We should flip that, so that tour flags are not stored for temporary accounts.

We wanted to look into using GlobalPreferences for tour flags, but it's not a priority. Currently tours just aren't show on wikis where the user is autocreated (which also means no user properties spam on auto-registration wikis).

@tstarling do you have a vague idea of when the core changes you indicated in T54777#7724456 would happen? Updating extensions which deal with lots of user preferences (like Echo or GrowthExperiments) would be a nontrivial amount of followup work, so I'm trying to figure out when to reserve some time for that.

Copied Tim's idea from T54777#7724456 to T321527: Support conditional defaults for user properties so it's easier to reference. Besides reducing user_property bloat it would also make progressive rollouts of features simpler.

Echo notification user properties have reached more than 100M rows now just in enwiki. It's because of Hooks::getNewUserPreferenceOverrides() in Echo extension that adds five rows per each new user. Please stop this.

I dropped 4M rows in enwiki belonging to VectorSkinVersion up_property which the code was removed but not the rows. I haven't looked at other wikis though.

Mentioned in SAL (#wikimedia-operations) [2024-12-12T14:30:08Z] <Amir1> ladsgroup@mwmaint2002:~$ foreachwikiindblist all userOptions.php --delete VectorSkinVersion (T54777)