Page MenuHomePhabricator

Clean up watchlist and user properties of users if they don't log in for certain time
Open, Needs TriagePublic

Description

Currently, we have problems with overly large watchlist and user_properties, and it's adding non-negligible cost to our maintenance. A rather low-hanging fruit could be to simply purge these rows if the users don't log in or any other action such as editing, saving preferences, watching a page, etc. (tracked via user_touched) after a certain time.

My proposal goes as follows:

  • If the user hasn't logged-in (or any other action) for two years. Remove some user_properties rows such as rememberpassword, growthexperiments-tour-* (aka the tour has been seen and don't need to be repeated), etc.
  • After five years, all user_properties rows should be deleted
  • After ten years, their watchlist can be purged. We can send a reminder to the user via email one month before deletion of watchlist, but I don't think it's needed for user_properties purge.

This is a very common practice in the industry. Storage costs us real donor money and if the users have moved on, passed away, been banned, etc. we shouldn't be holding on to their settings or their watchlist forever.

For example. Even going with almost six years. We have this:

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select count(*) from user_properties;
+-----------+
| count(*)  |
+-----------+
| 265352966 |
+-----------+
1 row in set (5 min 38.219 sec)

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select count(*) from user_properties join user on user_id = up_user where user_touched < '20200000000000';
+-----------+
| count(*)  |
+-----------+
| 149216973 |
+-----------+
1 row in set (3 min 3.951 sec)

It's 56% of enwiki's user properties table.

See also:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Seems like a good idea, although I'd limit it to users with <100 edits or something like that. (The user properties for sure. For watchlist maybe entries for inactive users with lots of edits amount to a lot of rows.)

One thing to be mindful of is that you can use an account without logging in (OAuth, Pywikibot perma-cookies etc) and if it's used specifically for keeping a watchlist then there will be no logged actions either. Seems unlikely in practice; will become more likely if mobile apps switch to OAuth (which is being discussed).

Seems like a good idea, although I'd limit it to users with <100 edits or something like that. (The user properties for sure. For watchlist maybe entries for inactive users with lots of edits amount to a lot of rows.)

Yeah, for user properties, starting with users without any edits and then ramping it up to 100 edits max sounds like a good compromise. I do agree also watchlist should not have this limit specially because we want Special:UnwatchedPages and so on to be reflecting the reality. If someone is not checking their watchlist, their watchlist shouldn't hide those pages.

One thing to be mindful of is that you can use an account without logging in (OAuth, Pywikibot perma-cookies etc) and if it's used specifically for keeping a watchlist then there will be no logged actions either. Seems unlikely in practice; will become more likely if mobile apps switch to OAuth (which is being discussed).

Can we make OAuth and so on to bump user_touched after a hand-shake? It shouldn't be too hard to do. Also many other actions still bump touched value (saving preferences for example) so we have other layers of protections too.

I do agree also watchlist should not have this limit specially because we want Special:UnwatchedPages and so on to be reflecting the reality.

Although for that purpose specifically, you could also just dynamically filter out users by user_touched. I think FlaggedRevs does that for its pseudo-watchlist of unreviewed changes.

Can we make OAuth and so on to bump user_touched after a hand-shake?

We could try but 1) some OAuth tokens are valid forever without the need for further handshakes 2) it could backfire if it turns out that some broken client does handshakes on every request and is also doing parallel requests so you get lock contention on the user row (granted, it doesn't seem likely).

Also many other actions still bump touched value (saving preferences for example) so we have other layers of protections too.

For real users, yes. For an API client which is used to provide an alternative reading experience, I think it would be normal to not hit anything like that.

(I guess if you use it actively, you'd at least add items to your watchlist periodically. Does that bump user_touched? Or can we use wl_notificationtimestamp to somehow determine if the user is still using their watchlist?)

I guess we could just bump user_touched on every authenticated request if it's sufficiently old.

In T258098#6982046 I propose to introduce a gravestone table for watchlist (and may also apply for user_property). This is not intended to be a permanent long-term solution, and I do not oppose removing user_property and watchlist for inactive users

Are there any timelines for this change? Asking for Tech News.

I'm planning to start the clean up next week (I will start with echo and growth experiment rows that are the biggest offenders). Something like this would be great for tech news:

If a user hasn't logged in to their account for more than five years and has below 100 edits, their preferences will be removed and default preferences will be used if they log back in. This is to improve the storage of user data in our databases.

Change #1195200 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Introduce purgeUserOptions maint script

https://gerrit.wikimedia.org/r/1195200

Bugreporter2 renamed this task from Clean up watchlist and user properties of users if they don't login for certain time to Clean up watchlist and user properties of users if they don't log in for certain time.Oct 10 2025, 1:36 PM
Bugreporter2 updated the task description. (Show Details)
Bugreporter2 awarded a token.

Is there a chance of a user preference row containing something unrecoverable (a secret key or such)? Should we maybe send emails to users who will be affected?

Is there a chance of a user preference row containing something unrecoverable (a secret key or such)?

The only thing could be the watchlist rss key but I will double check.

Should we maybe send emails to users who will be affected?

For watchlist sure. But the rows added are mostly software generated stuff (with no actual knowledge of the users of what they mean and so on) and the number of users is way too large who mostly won't care.

Change #1195200 merged by jenkins-bot:

[mediawiki/core@master] Introduce purgeUserOptions maint script

https://gerrit.wikimedia.org/r/1195200

Mentioned in SAL (#wikimedia-operations) [2025-10-14T10:04:21Z] <Amir1> mwscript-k8s --follow --dblist=group0 -- purgeUserOptions.php (T406724)

In mediawikiwiki the number of rows of user_properties table has gone from 10,630,380 to 2,687,779. It's a bit of an outlier since I assume we created users automatically on mediawiki (at least for a period of time).

Top offenders before:

+------------------------------------------+----------+
| up_property                              | count(*) |
+------------------------------------------+----------+
| echo-subscriptions-web-wikibase-action   |  4304647 |
| echo-subscriptions-email-edit-thank      |   688500 |
| echo-subscriptions-web-article-linked    |   583628 |
| echo-subscriptions-email-mention         |   582398 |
| echo-subscriptions-web-reverted          |   579627 |
| echo-subscriptions-email-article-linked  |   576689 |
| uls-preferences                          |   446863 |
| echo-subscriptions-email-dt-subscription |   243959 |
| discussiontools-autotopicsub             |   218459 |
| visualeditor-hidebetawelcome             |   191623 |
| language                                 |   181336 |
| skin                                     |   180879 |
| thumbsize                                |   168195 |
| watchcreations                           |   138974 |
| watchlisttoken                           |   105089 |

After:

| echo-subscriptions-email-edit-thank      |   315817 |
| echo-subscriptions-email-dt-subscription |   243959 |
| discussiontools-autotopicsub             |   218459 |
| echo-subscriptions-web-article-linked    |   209977 |
| echo-subscriptions-email-mention         |   209640 |
| echo-subscriptions-web-reverted          |   208726 |
| echo-subscriptions-email-article-linked  |   207672 |
| uls-preferences                          |   168745 |
| visualeditor-hidebetawelcome             |    78966 |
| language                                 |    53802 |
| visualeditor-editor                      |    38905 |
| rcfilters-wl-collapsed                   |    37773 |
| rcfilters-wl-saved-queries               |    36069 |
| echo-subscriptions-email-edit-user-talk  |    35193 |
| mobile-editor                            |    18788 |
| visualeditor-hideusered                  |    16546 |
| usecodemirror-colorblind                 |    16351 |
| VectorSidebarVisible                     |    14664 |
| realme-urls                              |    12259 |
| visualeditor-diffmode-visual             |    10262 |

Mentioned in SAL (#wikimedia-operations) [2025-10-15T08:59:50Z] <Amir1> mwscript-k8s -- purgeUserOptions.php --wiki=loginwiki (T406724)

Mentioned in SAL (#wikimedia-operations) [2025-10-15T08:59:50Z] <Amir1> mwscript-k8s -- purgeUserOptions.php --wiki=loginwiki (T406724)

FTR: Number of rows in user_properties of loginwiki has gone from 10,608,436 to 3,820,137. The ones left:

mysql:research@dbstore1007.eqiad.wmnet [loginwiki]> select up_property, count(*) from user_properties group by up_property order by count(*) desc limit 10;
+-----------------------------------------+----------+
| up_property                             | count(*) |
+-----------------------------------------+----------+
| rememberpassword                        |  2784929 |
| echo-subscriptions-email-edit-user-talk |   850767 |
| cx-entrypoint-fd-status                 |    39101 |
| gender                                  |     5196 |
| language                                |     4566 |
| timecorrection                          |     3237 |
| cx_campaign_newarticle_shown            |     2996 |
| date                                    |     2777 |
| compact-language-links                  |     2651 |
| multimediaviewer-enable                 |     2504 |
+-----------------------------------------+----------+
10 rows in set (2.472 sec)

Is it possible someone is logged in forever so will never edit user touched but be active with using their watchlist?

Is it possible someone is logged in forever so will never edit user touched but be active with using their watchlist?

Nope, cookies expires after a year not to mention that due to unpredictable issues with session store sometimes everyone gets logged out (I wouldn't say it happens every month, but every half year or so. I remember T390514 for example)

You can manipulate cookie expiry, it's not enforced on the server side. Some bots do that. The session store has a one-day expiry but you don't need a valid session in the store if you have a valid user token cookie.

I assume we created users automatically on mediawiki (at least for a period of time).

Yeah, back when it was the central wiki for OAuth.

You can manipulate cookie expiry, it's not enforced on the server side. Some bots do that. The session store has a one-day expiry but you don't need a valid session in the store if you have a valid user token cookie.

Do you think it'd matter in this case? I don't think so. It's edge case of edge cases.

Someone could use an account for years just as a database for a watchlist and we wouldn't notice because user_touched wouldn't necessarily be updated. It's pretty unlikely though.

Ladsgroup moved this task from Triage to In progress on the DBA board.

When is this change happening @Ladsgroup?

Next week.

I was under the impression that user_touched was not updated on login. Maybe I'm mistaken but a quick look through the code I don't see anything that would set it on login.

Also if a user regularly logs in and edits on say enwiki, but has preferences on other wikis, is the plan to delete it on the other wikis that they don't regularly log in to?

I was under the impression that user_touched was not updated on login. Maybe I'm mistaken but a quick look through the code I don't see anything that would set it on login.

SpecialUserLogin::successfulAction() calls $user->touch(); which bumps the touch timestamp on WAN but I'm not seeing that being persistent on the db. Let me dig around this a bit.

Also if a user regularly logs in and edits on say enwiki, but has preferences on other wikis, is the plan to delete it on the other wikis that they don't regularly log in to?

Yes. They should global preferences but also no matter what, they get logged out after a while and would have to log back in (at the time scale we are talking about)

I do agree also watchlist should not have this limit specially because we want Special:UnwatchedPages and so on to be reflecting the reality.

Although for that purpose specifically, you could also just dynamically filter out users by user_touched. I think FlaggedRevs does that for its pseudo-watchlist of unreviewed changes.

This was removed in b9f54bb1e28a943849eae92e83688a8bbcdbc522 since login doesn't bump that field anymore after https://gerrit.wikimedia.org/r/c/mediawiki/core/+/202659 .

I was under the impression that user_touched was not updated on login. Maybe I'm mistaken but a quick look through the code I don't see anything that would set it on login.

Right.

Maybe using the loginnotify_seen_net table would work for this, from LoginNotify? AFAIK, LoginNotifySeenBucketSize is 8 days, which is the "granularity" of the recency of login "computers" for a user ID. Seems possible useful. The maintenance script is in core though. As a side note, it looks like LoginNotify is only enabled for Echo wikis.

cookies expires after a year

SessionBackend::renew() refreshes the cookies when there's less than half the lifetime to go, so as long as you visit often enough they'll never actually expire.

Generally speaking, I abhor when websites, forums, or shops delete user accounts after some period of inactivity.

I don't have many examples of such websites, and contrary to what OP stated, I would tend to say that it is an uncommon practice.

Rather than the purge itself (which I may actually agree with), the problem is the short length of time—usually only one or two years. To me, that's definitely too short. We may have hundreds of accounts, go on a trip for months, and it results in a mental burden of accounts to maintain. On several occasions, I've already gone back to some forums, shops, etc. after many years.


Regarding this ticket specifically, I can understand the need for this kind of database cleanup (even if WMF spends only about 1–2% of its budget on servers…), and the thresholds you've defined seem conservative enough.

The only data deletion that could really be harmful to the user is the watchlist deletion, and you've set it to ten years—which is the kind of duration I think is adequate.

I would just suggest sending several email reminders, not only one.

I do agree also watchlist should not have this limit specially because we want Special:UnwatchedPages and so on to be reflecting the reality.

Although for that purpose specifically, you could also just dynamically filter out users by user_touched. I think FlaggedRevs does that for its pseudo-watchlist of unreviewed changes.

This was removed in b9f54bb1e28a943849eae92e83688a8bbcdbc522 since login doesn't bump that field anymore after https://gerrit.wikimedia.org/r/c/mediawiki/core/+/202659 .

That actually gave me the idea that I can delete anything before 2015 since until that time logins bumped the value. It'll be a much smaller clean up but a decent one nonetheless. I will see after this, what we can do to gather the info of when the users have logged in last.

cookies expires after a year

SessionBackend::renew() refreshes the cookies when there's less than half the lifetime to go, so as long as you visit often enough they'll never actually expire.

In theory, yes. In reality and the timescale we are talking about, no. For example, as I said, we have had incidents with session storage in which all sessions have been lost and people had to log back in.

Mentioned in SAL (#wikimedia-operations) [2025-10-22T17:53:17Z] <Amir1> mwscript-k8s --dblist=small --follow -- purgeUserOptions.php --login-age 11 (T406724)

Is there a chance of a user preference row containing something unrecoverable (a secret key or such)? Should we maybe send emails to users who will be affected?

From gadgets?

I'm planning to start the clean up next week (I will start with echo and growth experiment rows that are the biggest offenders). Something like this would be great for tech news:

If a user hasn't logged in to their account for more than five years and has below 100 edits, their preferences will be removed and default preferences will be used if they log back in. This is to improve the storage of user data in our databases.

Not mentioning watchlists.

I'm planning to start the clean up next week (I will start with echo and growth experiment rows that are the biggest offenders). Something like this would be great for tech news:

If a user hasn't logged in to their account for more than five years and has below 100 edits, their preferences will be removed and default preferences will be used if they log back in. This is to improve the storage of user data in our databases.

Not mentioning watchlists.

We are not cleaning up watchlist yet. We need to built a notification system first.

In theory, yes. In reality and the timescale we are talking about, no. For example, as I said, we have had incidents with session storage in which all sessions have been lost and people had to log back in.

As was already pointed out to you, loss of the session does not force a re-login if you have the "remember me" token cookie.

Change #1199878 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] user: Update user_touched on db on login if it hasn't been updated

https://gerrit.wikimedia.org/r/1199878

In theory, yes. In reality and the timescale we are talking about, no. For example, as I said, we have had incidents with session storage in which all sessions have been lost and people had to log back in.

As was already pointed out to you, loss of the session does not force a re-login if you have the "remember me" token cookie.

Let me rephrase, we have had incidents that everyone got logged out as result and had to log back in. I clearly remember several. If you disagree, please bring someone who didn't need to login in the past five years at all and even if it's possible, it's so unlikely and improbable that it won't make a difference.

Change #1199878 merged by jenkins-bot:

[mediawiki/core@master] user: Update user_touched on db on login if it hasn't been updated

https://gerrit.wikimedia.org/r/1199878

Mentioned in SAL (#wikimedia-operations) [2025-11-10T13:14:56Z] <Amir1> mwscript-k8s --dblist=medium --follow -- purgeUserOptions.php --login-age 15 (T406724)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T14:25:47Z] <Amir1> mwscript-k8s --dblist=large -- purgeUserOptions.php --login-age 15 (T406724)

Mentioned in SAL (#wikimedia-operations) [2025-12-03T06:56:46Z] <Amir1> ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 popups (T406724)

Mentioned in SAL (#wikimedia-operations) [2025-12-03T07:01:07Z] <Amir1> ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 rememberpassword (T406724)

Change #1217226 had a related patch set uploaded (by E75ti; author: E75ti):

[mediawiki/core@master] Avoid empty ID range scans in purgeUserOptions

https://gerrit.wikimedia.org/r/1217226

Mentioned in SAL (#wikimedia-operations) [2026-01-05T04:07:37Z] <Amir1> mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 echo-subscriptions-web-article-linked (T406724)

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select up_property, count(*) from user_properties join user on user_id = up_user where user_touched < '20150000000000' group by up_property order by count(*) desc
 limit 20;
+-----------------------------------------+----------+
| up_property                             | count(*) |
+-----------------------------------------+----------+
| echo-subscriptions-web-article-linked   |  222197x |
| echo-subscriptions-email-page-review    |  221986x |
| echo-subscriptions-email-mention        |  221979x |
| echo-subscriptions-email-article-linked |  221975x |
| echo-subscriptions-web-reverted         |  221802x |
| echo-subscriptions-email-edit-thank     |  214986x |
| watchlisttoken                          |   83423x |
| timecorrection                          |   37871x |
| gettingstarted-task-toolbar-show-intro  |   31711x |
| thumbsize                               |   20651x |
| gender                                  |   19651x |
| language                                |   15248x |
| uls-preferences                         |   13006x |
| date                                    |   11725x |
| nickname                                |   10401x |
| skin                                    |   10182x |
| watchcreations                          |    8663x |
| mfWatchlistView                         |    6783x |
| fancysig                                |    6423x |
| underline                               |    4986x |
+-----------------------------------------+----------+
20 rows in set (2 min 51.755 sec)

Mentioned in SAL (#wikimedia-operations) [2026-01-06T11:58:11Z] <Amir1> ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 echo-subscriptions-email-page-review (T406724)

Mentioned in SAL (#wikimedia-operations) [2026-01-07T13:15:34Z] <Amir1> mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 echo-subscriptions-email-mention (T406724)

Mentioned in SAL (#wikimedia-operations) [2026-01-22T14:59:16Z] <Amir1> mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 5 thumbsize (T406724)