Page MenuHomePhabricator

[RFC] Devise plan for a cross-wiki watchlist back-end
Closed, ResolvedPublic5 Estimated Story Points

Description

One of the most requested features in the 2015 Community Wishlist Survey was a cross-wiki watchlist. In order to implement a cross-wiki watchlist that is reasonably performant we will need to build a back-end data infrastructure to support it. There are many options for such an infrastructure as well as options for the actual feature implementation.

Ideally, we would like to support the ability to view watchlist items of all cluster wikis from any watchlist (i.e. truly global watchlists). This would likely require creating a centralized table for storing all watchlist items from all wikis (as well as a table for all watchlist item properties once T100508 is implemented), similar to CentralAuth. In addition, we would either need to create a centralized table of all recent changes for all wikis, or support doing separate queries for each wiki that the user has watchlist items on (which could potentially be hundreds of wikis).

Questions we would like to answer:

  • Is keeping a central table of all recent changes for all wikis a doable option?
  • Is doing separate queries for each wiki a better or worse idea?
  • Are there ways that we could mitigate the impact of either of those options?
  • Are there other options that we haven't even thought about yet?

2016-07-20 notes:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This probably makes more sense as a technical RFC than a task. Retitling...

kaldari renamed this task from Develop a technical implementation plan for a global watchlist (back-end only) to [RFC] Technical implementation plan for a global watchlist (back-end only).Jul 12 2016, 10:05 PM

However, be aware of the work others are doing on watchlist stuff (e.g. T124752) or you'll wind up undoing or redoing their work. And don't forget that watchlists still need to work for non-global wikis.

+1, and note that (IIUC) T124752 is aiming to use a solution that resolves (or is a step towards) the underlying problem, aka T3492: Multiple watchlists / T1352: RFC: Support for user-specific page lists.

@Anomie: I wonder how painful your second suggestion would actually be. Most people will just have watchlist items on their home wiki, Commons, and maybe 1 or 2 other wikis. I imagine very few people will have items on more than 10 wikis. [...]

Would it be possible to get stats out of Crosswatch, to examine how the long-tail of power-users utilize that tool? I would guess that the SWMT members and Stewards, and perhaps Commonists and Wikidatans, are the users with the most watchlists on diverse wikis, but it would be useful to get proof and numbers.

+1, and note that (IIUC) T124752 is aiming to use a solution that resolves (or is a step towards) the underlying problem, aka T3492: Multiple watchlists / T1352: RFC: Support for user-specific page lists.

My understanding is that they will probably be implementing a watchlist_props table, so we may need a global watchlist_props table as well.

@Anomie: I wonder how painful your second suggestion would actually be. Most people will just have watchlist items on their home wiki, Commons, and maybe 1 or 2 other wikis. I imagine very few people will have items on more than 10 wikis.

Then someone comes along and watches pages on every wiki on a bunch of sockpuppet accounts and reloads their watchlists constantly.

Also, wouldn't the global_watchlist table need to mirrored to all the shards for either solution? Even if you have a global_recentchanges table to join against locally, you still need to be able to do that from any wiki on any shard, right? Not just your home wiki. How is this handled with CentralAuth?

When something needs to access CentralAuth, it connects to the one shard with the centralauth database. Other global features generally do the same thing, either with a special database or connecting to e.g. metawiki's db.

When something needs to access CentralAuth, it connects to the one shard with the centralauth database. Other global features generally do the same thing, either with a special database or connecting to e.g. metawiki's db.

@Anomie: So does that mean that the CentralAuth tables are never joined against other wiki-specific tables (but are always queried separately)?

As far as I know, that is correct.

Either way it sounds pretty painful, and could use input from @jcrespo as our DBA.

I agree. "People does not use more than 10 wikis" may be true in 99.9% of the cases is the same as saying "a template will not be used more than 1 million times". And then you have the exceptions that literally bring down the site :-P (this is not speculation, this has happened already many times) because they were the exception and we didn't think of them on code.

This is not a "do not do this", this is a warning about 2 problems: a multi-wiki implementation (second suggestion) would require to potentially connect to 800 wikis per user. Even if you only connected to the wikis you have locally-attached. Probably not what we want. The second issue I want to bring up is that integrating features are nice for the end user when they work, but separate per-wiki features have an advantage, which is higher High Availability. There is probably always an outage going on Wikimedia sites, but as you are only a user of a few wikis and a few features, users only notice 1% of the issues. Whenever you implement such a feature, make sure you do not tightly couple the features- look at the recent centralauth issues (which by definition have to be central)- they have a large impact on all users of all wikis.

If I were to implement something like this, I would suggest some kind of opt-in extension, (let's say, on meta), that caches the items from the lists from all wikis in a non-synchronous way (jobs) -but that may create more problems that it solves. I do not have a good suggestion right now, but please have in mind the 2 things I mentioned before. The 2 wikis that tend to generate more "problems" right now are Commons and Wikidata, and that is because its "centralizing" nature.

DannyH renamed this task from [RFC] Technical implementation plan for a global watchlist (back-end only) to [RFC] Technical implementation plan for a cross-wiki watchlist (back-end only).Jul 13 2016, 6:14 PM
DannyH updated the task description. (Show Details)

The proposals and votes actually say "cross-wiki", not "global" -- I just checked, and nobody expressed a desire for an 800-wiki watchlist. One person mentions 5 wikis, another says 3 or 4, and a couple people refer to their home wiki plus Commons.

So we could build a watchlist that caps at 10 wikis. You choose which wikis that you want to include, using an interface that's on the page. If you want to add an 11th wiki, you have to take another one out.

I'm sure that the stewards will be disappointed, because they could really use a truly global watchlist. But if the choice is a more limited 10-wiki list vs not building it at all, then we should build the 10-wiki version.

Krinkle renamed this task from [RFC] Technical implementation plan for a cross-wiki watchlist (back-end only) to [RFC] Devise plan for a cross-wiki watchlist back-end.Jul 13 2016, 8:45 PM

So we could build a watchlist that caps at 10 wikis. You choose which wikis that you want to include, using an interface that's on the page. If you want to add an 11th wiki, you have to take another one out.

This is a really nice way of thinking about this. Expanding on that line of thought, how's this for a possible rollout strategy?

  • Version 0.1 - allow users on some smaller wiki a "list" of options, which is basically "commons"
  • Version 0.2 - allow users on many wikis to add "commons" to their list
  • Version 0.3 - allow users on those same wikis to add both commons and meta to their watchlist
  • etc...

The basic idea here is have all of the following that we grow (in parallel?):

  • Watchlist wikis: List of wikis where users can watch pages on target wikis
  • Target wikis: List of wikis that can host target pages for crosswiki watchlists
  • Per-user target lists : list of user-selected wikis that the user gets to watch (limit of 10(?))

In the versioned example above:

VersionWatchlist wikisTarget wikisPer-user target lists
0.111(none, global target list on watchlist wiki)
0.251"
0.352"
............
0.880011implemented, limit 10

At first, when the list of target wikis is smaller than 10, we don't need to worry about per-user target lists. If/when "target wikis" grows past 10, it will become important to have the per-user target list feature implemented.

The proposals and votes actually say "cross-wiki", not "global" -- I just checked, and nobody expressed a desire for an 800-wiki watchlist. One person mentions 5 wikis, another says 3 or 4, and a couple people refer to their home wiki plus Commons.

So we could build a watchlist that caps at 10 wikis. You choose which wikis that you want to include, using an interface that's on the page. If you want to add an 11th wiki, you have to take another one out.

I'm sure that the stewards will be disappointed, because they could really use a truly global watchlist. But if the choice is a more limited 10-wiki list vs not building it at all, then we should build the 10-wiki version.

I agree that we don't need to build in support for 800+ wikis.
I'm guessing that 10 is probably just a random low number, purely to start discussion; if so, jump to the end for a more specific number! -- but if we're being specific about the idea of 10, then I think we need more understanding of 1) the use-cases, 2) the technical limitations.

For example, I'm monolingual, and primarily active on a single project (Wikipedia), but I have actively used watchlists on at least 7 projects.

We want to encourage more cross-project collaboration and activity, and cross-wiki watchlists will be a fundamental part of that.

I think the number of projects, as seen by a bi-lingual user, might be a reasonable number to use for discussion/investigation purposes?

Specifically, I'd count/group them something like this: (with "2x" representing a bi-lingual person)

  • Metawiki - core
  • MediaWiki - core
  • Commons - core
  • Wikidata - core
  • 2x Wikipedia
  • 2x Wiktionary
  • 2x Wikibooks
  • 2x Wikinews
  • 2x Wikiversity
  • 2x Wikisource
  • 2x Wikiquote
  • 2x Wikivoyage
  • Wikispecies - multilingual
  • Outreach - less used
  • Wikimania - less used (multiple, but usually only 1 active at a time)
  • Chapter - less used (multiple, but probably only 1 of great interest to someone)
  • Incubator - less used
  • Test - less used

13 main projects, or 18 lines altogether, or 26 for a bi-lingual person.
From that, I'd suggest 20 as an absolute minimum to aim for, and 30-40 as a more ideal long-term target, for multi-lingual users, and for steward/global-sysop/interface-editor/SWMT/etc types of users.

Well, if we follow RobLa's plan -- which looks excellent to me -- then we're going to start with two wikis, and work up from there. If we get close to ten wikis without the world exploding, then we can definitely negotiate up.

Right now I want to survive first contact with ArchCom, before I start worrying about the bilingual Wikiversity/Wikispecies crossovers. :)

Right now I want to survive first contact with ArchCom, before I start worrying about the bilingual Wikiversity/Wikispecies crossovers. :)

I concur with Nick that we should come up with a safe number we might want to aim for in future before the RFC. Scaling down is easy, scaling up is hard. :)

To characterize the scaling challenges, do we have data on these bits?

  • Maximum / median number of watchers per page.
  • Maximum / median number of watched pages per user, across projects.
  • Maximum / median number of projects users watch pages on.

This problem also reminds me of a paper on feed propagation for sites like twitter. They found that using a hybrid push/pull solution is best for combining fast denormalized reads with limited load from high-fan-out sources.

I'm not going to try to make an argument for supporting someone who has watchlists on hundreds of wikis, but I will point out https://en.wikipedia.org/wiki/Zero_one_infinity_rule and that Quiddity already increased the number being discussed from 10 to 30–40.

The number being discussed is 10, which is a nice round number that will help us focus on whether this product is possible vs not. Negotiating over 10 vs 26, 40 or hundreds can happen after we know whether a cap works at all.

https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good

I think I have a proposed solution that would be terribly simple- but would
be mostly by infrastructure rather than code.

Watchlist and recentchanges are relatively small-medium tables. Assuming:

Their updates are independent from other tables.

Functionality only depends on these 2 tables.

You can get budget for a couple of extra servers.

Watchlist_ids have been implemented

We can replicate all the watchlists and the rcs only of all selected wikis
(global login) to this central server. I am thinking regular mysql
(multisource) replication, as it is probably the most reliable way, but
kafka or other method could be possible, too. We use this already for
analytic slaves.

With some trickery even single megatables with wiki key + original row
would be possible, if that's helpful.

Once that is set, assuming no watchlist edits happen from the global lists
(only reads) we can use almost the exact same code to query them, with no
overhead on the current servers.

Queries would be only slightly slower than a regular watchlist query
[verification pending], and no cache would be involved, nor 800 different
connections to different servers. Also, if it fails, or has performance
problems, local watchlists are not affected. Code investment would be
minimized too. Infrastructure setup time would be minimal, too.

I think I have a proposed solution that would be terribly simple- but would
be mostly by infrastructure rather than code.

This is a really interesting solution. I'm looking forward to discussing.

With some trickery even single megatables with wiki key + original row
would be possible, if that's helpful.

Without the megatable, you would still have to query each wiki separately (but from one server, which might help).

This sounds really helpful, though, since it gets around the need for a real separate global_recentchanges table.

I'm not sure a psuedo-global_watchlist table with just wiki key + original row is enough. You would still have the issue that there are multiple wl_user users in that table which are really the same user (but different local user ID). That's why we'd want a global user ID in the watchlist megatable, per @Anomie.

What's the relationship between this RFC and T105766: RFC: Dependency graph storage; sketch: adjacency list in DB? I mean, if we have a scalable mechanism that tracks de3pendencies, we can just model users as "depending" on pages tehy watch, and we'd have a global watchlist system, right?

E235: ArchCom RFC Meeting W29: Devise plan for a cross-wiki watchlist back-end (2016-07-20, #wikimedia-office)

Meeting started by robla at 21:00:02 UTC. The full logs are available at
https://tools.wmflabs.org/meetbot/wikimedia-office/2016/wikimedia-office.2016-07-20-21.00.log.html (and P3528)
.

Meeting summary

  • _LINK:_ https://phabricator.wikimedia.org/T126641 (robla, 21:00:27)
  • WIKIMEDIA MEETING CHANNEL | PLEASE NOTE: CHANNEL IS LOGGED AND PUBLICLY POSTED (DO NOT REMOVE THIS NOTE) | LOGS: HTTP://BOTS.WMFLABS.ORG/~WM-BOT/LOGS/%23WIKIMEDIA-OFFICE/ (robla, 21:00:35)
    • question discussed: Is keeping a central table of all recent changes for all wikis a doable option? (robla, 21:05:06)
    • 14:28:25  Getting back to the original question, it sounds like no one has ruled out the possibility of having a centralized mega recentchanges table. Is that correct? In other words, is it an option we should keep on the table? (robla, 21:29:20)
    • 14:29:21  kaldari: I think that's right, but I really miss having jynus in this meeting to talk about it (robla, 21:29:47)
    • Check total size of RC tables and watchlist tables (matt_flaschen, 21:32:25)
    • robla says: this doesn't have to be the only RFC-related meeting of the week; just the one regular time I currently plan to facilitate (robla, 21:32:52)
    • Special:Watchlist in the current code joins on watchlist, recentchanges, change_tags, and optionally page (DanielK_WMDE_, 21:35:52)
    • kaldari: Well you can't usefully replicate watchlist rows verbatim, because then wl_user will be the local user ID (DanielK_WMDE_, 21:36:37)
    • just add wl_global_user to the watchlist (DanielK_WMDE_, 21:38:48)
    • I think it's a bug, I think user_id mapping should be added to CA (DanielK_WMDE_, 21:46:05)
    • RoanKattouw notes tentative buy-in for 1) Add user ID to the CA tables 2) Jaime's replication strategy + rewrite/populate wl_user/wl_global_user w/global user ID (derived from the CA tables) (robla, 21:51:28)
    • robla plans to assign RoanKattouw as shepherd (robla, 21:52:33)
    • _LINK:_ https://phabricator.wikimedia.org/T2424 -- patch waiting for more reviews. (quiddity, 21:56:07)
    • RoanKattouw plans to write a summary on the task (robla, 21:59:39)

Meeting ended at 22:00:00 UTC.

Action items

  • (none)

People present (lines said)

  • RoanKattouw (99)
  • matt_flaschen (42)
  • TimStarling (39)
  • gwicke (38)
  • kaldari (31)
  • robla (27)
  • bd808 (17)
  • addshore (14)
  • anomie (12)
  • quiddity (11)
  • DanielK_WMDE_ (7)
  • James_F (6)
  • Scott_WUaS (6)
  • tgr (5)
  • SMalyshev (5)
  • MaxSem (3)
  • dannyh (3)
  • wm-labs-meetbot` (3)
  • musikanimal (3)
  • stashbot (2)
  • ori (2)
  • legoktm (2)

Generated by MeetBot 0.1.4 (http://wiki.debian.org/MeetBot)

I am sorry I could not attend but I am really overloaded these days- offline updates is what works better for me.

Some questions to things I read:

Recentchanges is only a 4GB table tops (commonswiki, enwiki). Watchlist is a 25GB one on enwiki, but I assume that is an exception, and everywhere else it is a small one. This makes a total of 150-200GB of combined size, something that could confortably fit into memory in our reference server.

I was not thinking of binary log manipulation, but something like triggers that updates a larger table (and hence we can add and remove fields as needed). E.g.:

DELIMITER //
CREATE TRIGGER recentchanges_AI
AFTER INSERT ON `enwiki`.`recentchanges` FOR EACH ROW
BEGIN
    INSERT INTO global_recentchanges (grc_wiki_id, ...)
        VALUES ('enwiki', ...)
END;
//
DELIMITER ;

To avoid data duplication, the original tables can be BLACKHOLE.

This is only a quick and dirty idea, as a proof of concept, and it would have a lot of problems, but it demonstrates that it works (assuming append only). I can think of more refined options, but that doesn't bring me concern. I am not worried (and you should not) about this being "possible" (I know it is, one way or another -e.g. I mentioned kafka or any other options), my worry is, supposed we have a megatable, can we really query it efficiently? That is a not-so-evident issue, as that extra column could throw away any change of this being a thing- the right answer is to create a proof of concept to demonstrate it can be done efficiently. Also the whole local user id vs. global id solving. how?

In my proposal, I do not intend to move current data from the current state- first because it would be completely backwards compatible, second because it would make that a SPOF. I am not going to like that I have plans to create dedicated RC slave (the role), but that is an infrastructure-only change (database consolidation), 100% transparent to the application. This proposal would mean duplicating data because it can be done *now* with no code change and does not affect the availability of the current wikis (while a single global megatable would be a SPOF for many people that would want. So that means "wasting" disk and servers (I would call it investing in a functionality that I assume is worth it (?)).

Of course this is only one potential solution, and I would mind you prefering something less db-focused (which has indeed some reliability issues) and more code-focused, but I liked because it has a large effort/results ratio.

Avoid unions, we would be talking a union of ~800 tables- that would not be viable.

Please avoid cross shard queries. We already have issues with 2 shards (centralauth + another wiki).

I do not like adding wl_global_user to the local tables (but I do not discard it either)- it will depend on the final "authoritative" answer of the storeage (duplication vs. replacing), but I do not have a perfect solution. If possible it should be added on the fly (at replication time).

I like in general your list of conclusions, and I agree with "global watchlsit table and the CA tables have to be on the same DB cluster" but it scares me for availability/security reasons. Also, this becoming the authoritative place for watchlists looks a complicated transition. I would suggest the "extra/optional functionality" option, then continue evolving from that.

I would strongly suggest setup a test global table(s) on the replica labs db and test if that works before investing more time on it.

(assuming append only).

You probably know this already, but for the record: recentchanges isn't append only. Updates occur for at least the rc_patrolled and rc_deleted fields, and deletes occur when the corresponding page is deleted and for rows older than a configured age (30 days in our setup). I'm sure appropriate trigger rules for UPDATE and DELETE could be written.

If possible it [wl_global_user] should be added on the fly (at replication time).

The one potential drawback is that it would probably be harder to handle attach and unattach events. On the other hand, we shouldn't have attach and unattach events anymore outside of when the account is created and when the account is renamed (in which case it's unattaching then reattaching the same local↔global ids), so we may not need to worry about it.

I like in general your list of conclusions, and I agree with "global watchlsit table and the CA tables have to be on the same DB cluster" but it scares me for availability/security reasons.

With some discussed changes to the table (namely including the local and global user IDs instead of depending on lu_name), only the CA localusers table would need to be on the same cluster. That table doesn't have any sensitive information.

I would suggest the "extra/optional functionality" option, then continue evolving from that.

+1

You probably know this already, but for the record: recentchanges isn't append only.

Well, I knew it, but I didn't have it into account. It makes things slightly more complex, but nothing blocking.

The one potential drawback is that it would probably be harder to handle attach and unattach events.

I already supposed things were not as clear, and that is why I put user control as one of the main "blockers" (we need to think a way to overcome that).

only the CA localusers table would need to be on the same cluster

My worries were not that CA localusers should be together, but putting it on the centralauth shard (which has all data). It would be better if the table was also copied the same way RC and WL, but that may create inconsistencies? I do not know, not a big deal (except the large amount of edge cases to consider).

Thing are not as easy as it looks in the beginning :-/

I am sorry I could not attend but I am really overloaded these days- offline updates is what works better for me.

Jaime, thank you for all of your hard work, and please don't feel bad about missing this particular discussion! Given the global nature of our movement, timing for these discussions will usually not be optimal for someone.

I don't want to hijack this task to talk about meeting timing; Conpherence room Z425 is a great place for that.

Check total size of RC tables and watchlist tables (matt_flaschen, 21:32:25)

Not sure how to calculate the total sizes of all the tables, but here are some spot-checks:

  • enwiki recentchanges: 6,775,678
  • enwiki watchlist: 73,838,521
  • dewiki recentchanges: 2,351,080
  • dewiki watchlist: 39,287,345
  • frwiki recentchanges: 4,835,281
  • frwiki watchlist: 24,969,254

@Addshore, @jcrespo: 2 questions...

  • When should we expect the new watchlist IDs (T130067) to be in place in production? A very rough estimate is fine.
  • What's the final plan for the watchlist expiry schema changes (T124752)? Are we doing a new watchlist_props table? Will there be a new timestamp field in the watchlist table?

These are row counts?

Yes, sorry. They are row counts.

Total row counts across all wikis:
recentchanges: 85M
watchlist: 548M

I didn't sum table sizes because the size info reported by s3-analytics-slave doesn't seem to match what @jcrespo reports (25GB vs 15GB for enwiki.watchlist for example).

My worries were not that CA localusers should be together, but putting it on the centralauth shard (which has all data). It would be better if the table was also copied the same way RC and WL, but that may create inconsistencies? I do not know, not a big deal (except the large amount of edge cases to consider).

I was about to suggest something similar: would we be able to replicate the centralauth.localuser table from the CA cluster to this new RC+WL cluster, for lookup purposes? I think it's probably good if the global RC+WL tables are not on the same cluster as the CA tables, so that heavy global watchlist usage can't cause availability issues for the CA tables (which could break login and cause lots of other problems; MW really doesn't like it when CA goes away).

@kaldari Sadly, because watchlist was designed without a primary key, it needs to be done in total read only mode, which means either a period of read only mode or a datacenter failover. This is tracked on T138810. Sadly, there is not a date for the next datacenter failover (even if in theory one will happen every 6 months) so we could program a master failover at some point so we do not keep waiting.

@Addshore, @jcrespo: 2 questions...

  • When should we expect the new watchlist IDs (T130067) to be in place in production? A very rough estimate is fine.

See @jcrespo's response above The initial estimate (T130067#2130097) was some time in September 2016!

  • What's the final plan for the watchlist expiry schema changes (T124752)? Are we doing a new watchlist_props table? Will there be a new timestamp field in the watchlist table?

We haven't been actively working on the watch list expiry for some months now and we would likely have to reevaluate everything before knowing for sure.

I'm pretty sure the below will be done:

  1. T130067 Add wl_id to watchlist tables on production dbs
  2. T125991 Add wl_timestamp to the watchlist table
  3. T127953 Populate wl_timestamp field
  4. T67187 Show when pages were added to watchlist

The watchlist props (T129486) table covers a number of cases as discussed in one of the RFCs. It would allow setting expiry times as well as tags for watched items among other things.
The alternative to this for expiring items would be to add another field to the watch list table to hold and expiry time stamp (IMO not the bets idea) or even a table to only hold the expiry.

All of the above including a bunch of other refactoring is blocked on getting the wl_id field on production dbs.

Someone notified ops on SOS about going forward with this. The right way would be to perform a hardware request *now* (as they take a lot of time to be processed, and depends on external vendors). I think starting with 2 servers on each datacenter would be the right call for a minimal HA. Maybe it doesn't even need new hardware, but the request should be done no matter what (if we reassign existing hardware, it should be trazable. I think the process for this is documented on wikitech (hardware-requests) but contact me if it is not clear.

In parallel, I would do a proof of concept on labs or somewhere else to show this is viable? I can help with that by setting some test triggers at some existing place in the current infrastructure. Please contact me the person(s) interested on moving this forward to setup an initial demo (it should not take very long for a couple of wikis).