Page MenuHomePhabricator
Paste P3528

ArchCom-RFC-2016W29-irc-E235.txt
ActivePublic

Authored by RobLa-WMF on Jul 20 2016, 10:04 PM.
21:00:02 <robla> #startmeeting [RFC] Devise plan for a cross-wiki watchlist back-end
21:00:02 <wm-labs-meetbot`> Meeting started Wed Jul 20 21:00:02 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:02 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:02 <wm-labs-meetbot`> The meeting name has been set to '_rfc__devise_plan_for_a_cross_wiki_watchlist_back_end'
21:00:27 <robla> #link https://phabricator.wikimedia.org/T126641
21:00:35 <robla> #topic Wikimedia meeting channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
21:00:58 <robla> hi everyone!
21:01:10 <kaldari> hello
21:01:36 <dannyh> hi
21:01:51 <musikanimal> hey!
21:02:03 <robla> kaldari: thanks for getting this filed!
21:02:19 <kaldari> NP :)
21:02:49 <robla> Niharika here? (or is timezone math kinda awful?)
21:02:54 <Scott_WUaS> Hey :)
21:02:57 <bd808> I don't think Niharika is going to make it. It's 02:30 in her timezone
21:02:58 <bd808> a bit late/early for an RfC meeting ;)
21:03:01 <kaldari> robla: too late for her :(
21:03:24 <MaxSem> wow, she has a life?
21:03:57 <robla> ok...we can potentially have a followup meeting at a different time to accomodate. this was billed as a "Problem definition" meeting, so more discussions to come (hopefully)
21:04:12 <kaldari> true
21:04:41 <robla> anyway, the list of questions: "Is keeping a central table of all recent changes for all wikis a doable option?"
21:05:06 <robla> #info question discussed: Is keeping a central table of all recent changes for all wikis a doable option?
21:05:14 <robla> so....is it?
21:05:35 <RoanKattouw> I think the discussion on the task already veered to "you don't need a global RC table, just a global watchlist table"
21:05:40 <TimStarling> I just pinged jynus, hopefully he can join
21:05:42 <matt_flaschen> It also would be nice if jynus was here.
21:05:43 <MaxSem> really a DBA question. how much data are we going to store? how often will it deadlock on inserts?
21:05:45 <gwicke> do we have data on the perf requirements?
21:05:50 <RoanKattouw> Or at least that's what I said on the task :D
21:05:51 * RoanKattouw is catching up on task comments now
21:05:56 <MaxSem> s/dead//
21:05:56 <gwicke> mean / max watch list sizes, etc
21:06:05 <TimStarling> I thought his idea of recentchanges tables replicated to a single server was interesting
21:06:10 <matt_flaschen> RoanKattouw, yeah, but I'm not sure if that's really because we can't do global, or just because "well, I guess we can limit to 10" (which might still have worse perf).
21:06:19 <kaldari> RoanKattouw: There have been comments suggesting a central RC table (and also comments suggesting other approaches)
21:06:27 <TimStarling> if they are all on one server, then 10 is probably way too low
21:06:30 <matt_flaschen> Yeah, me too. It should solve at least global_rc, but global_watchlist I'm not sure about.
21:07:01 <matt_flaschen> (me too re the replication ideas).
21:07:34 <kaldari> The idea of limiting to 10 was mainly to limit the impact of having to do cross-wiki queries, which would be eliminated by having a cental RC table AFAIK.
21:08:01 <Scott_WUaS> (Might this not be inevitable eventually possibly?)
21:08:23 <gwicke> Jaime said "Watchlist and recentchanges are relatively small-medium tables."
21:08:24 <matt_flaschen> I particularly thought jynus's idea of a central RC table (using the wiki tags + original row) was interesting. It sounded like some kind of view. I don't think that will work for watchlist, but it should for RC.
21:09:24 <gwicke> do we expect issues from heavy watchers across many projects?
21:09:37 <SMalyshev> I wonder why having single watchlist but separate rctables means a lot of queries? Can't we do a union query? Or it'll be too heavy?
21:09:47 <kaldari> gwicke: depends on the implementation
21:09:51 <TimStarling> with regular mysql replication you would presumably end up with one table per wiki
21:10:01 <RoanKattouw> SMalyshev: You can't union across multiple DBs that are on different servers
21:10:06 <matt_flaschen> gwicke, There are some legitimate heavy watcher use cases (especially a few pages on many, many wikis), such as stewards and small-wiki counter-vandalism.
21:10:15 <matt_flaschen> And we need to handle deliberate attempts to overload the system.
21:10:21 <TimStarling> with one table per wiki replicated to a single server, you could do a union
21:10:23 <quiddity> (and global interface editors, and global sysops)
21:10:26 <gwicke> matt_flaschen: do we have numbers (order of magnitude) for those?
21:10:39 <TimStarling> unions still have a cost per table associated with them
21:10:47 <SMalyshev> RoanKattouw: ah, we don't have all wikis in same server... can't we then do N=number of shares queries at least instead of number of wikis? Won't it be better?
21:11:13 <matt_flaschen> gwicke, I don't.
21:11:13 <TimStarling> but if you know which wikis you want to query in advance, you don't have to put every wiki in the union every time
21:11:28 <legoktm> when Gather was being developed, we had an RfC discussion about re-doing the watchlist schema...were those ideas taken into account for how a cross-wiki watchlist would work?
21:11:29 <musikanimal> ^
21:11:31 <matt_flaschen> SMalyshev, what does "number of shares" mean?
21:11:45 <RoanKattouw> matt_flaschen: Shards I think
21:11:48 <RoanKattouw> So 7 or 8
21:11:48 <SMalyshev> matt_flaschen: number of shards
21:11:49 <TimStarling> jynus said "With some trickery even single megatables with wiki key + original row would be possible"
21:12:11 <matt_flaschen> Yeah, I think that solves global_recentchanges.
21:12:16 <TimStarling> presumably he means modifying the binlog stream
21:12:28 <kaldari> legoktm: IDs have been added to the watchlist table, and there is strong potential for a new watchlist_props table in the near future.
21:12:48 <TimStarling> I guess there is a library to do that since I know there are other tools that modify the binlog stream
21:12:52 <kaldari> legoktm: not aware of other proposed schema changes
21:12:58 <RoanKattouw> If you had a global RC megatable and a global watchlist table would that be enough? Do we currently join RC/watchlist against any other tables like revision, logging, etc.?
21:13:07 <SMalyshev> also, for most watchlists most entries won't be changed... so we don't need a lot of data to know that. We need more data only for those that changed
21:13:10 <RoanKattouw> (I suspect not because RC is a denormalized table, but it's MediaWiki so who knows)
21:13:21 <kaldari> legoktm: do you have a URL for that RfC?
21:13:27 <gwicke> TimStarling: I understood his comment to mean that it contains the original row's content
21:13:34 <TimStarling> no, there's no need to join on other tables
21:13:49 <gwicke> which can be achieved with or without the binlog
21:13:50 <RoanKattouw> Good, because we would cut off that possibility permanently if we globalized the RC table
21:13:51 <matt_flaschen> gwicke, I think he said the original row's content *plus* the wiki ID (e.g. enwiki)
21:13:52 <TimStarling> watchlist/recentchanges is the only massive intersection
21:13:57 <RoanKattouw> (of joining I mean)
21:13:59 <kaldari> RoanKattouw: I believe that would be enough
21:14:09 <gwicke> matt_flaschen: yeah
21:14:12 <TimStarling> after you join those two, you have a chronological event list
21:14:35 <gwicke> he also mentioned the Kafka route
21:14:39 <legoktm> kaldari: the discussion was https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-01-14-21.00.log.html
21:14:40 <gwicke> eventbus already has RC events
21:14:49 <gwicke> we could basically follow that to keep a global table up to date
21:14:53 <matt_flaschen> T1352
21:14:54 <stashbot> T1352: RFC: Support for user-specific page lists in core - https://phabricator.wikimedia.org/T1352
21:14:55 <matt_flaschen> ?
21:14:56 <RoanKattouw> What about the tagging tables?
21:15:14 <RoanKattouw> The RC view on Special:Recentchanges displays tags, does it do that by joining or by doing separate lookups?
21:15:19 <anomie> Re joining, the existing Special:Watchlist joins against page for the non-expanded version, and against change_tag to display or filter by tags.
21:15:23 <RoanKattouw> (I suppose it could always be changed from the former to the latter if need be)
21:15:32 <TimStarling> well, I was going to say that display queries could be done after the join, in separate queries
21:15:50 <RoanKattouw> Yeah, sure
21:15:53 <TimStarling> but I suppose if you display 500 results from 50 wikis that could end up being a lot of display queries
21:15:57 <RoanKattouw> As long as it's just pulling in extra data for display
21:15:59 * DanielK_WMDE_ thinks modifying the binlog stream sounds scary
21:16:09 <RoanKattouw> But if you're using it for WHERE conditions it's harder
21:16:13 <TimStarling> yes
21:16:19 <RoanKattouw> For example, how do you do the tag filter on Special:RC
21:16:33 <TimStarling> it's probably inefficient already
21:16:34 <gwicke> group by project & send out a handful of batches at most
21:16:43 <matt_flaschen> DanielK_WMDE_, it does a little from the sound of it, but jynus does not normally take unnecessary risks from what I can tell.
21:16:54 <RoanKattouw> TimStarling: Yeah I'm pretty sure those queries already suck
21:17:06 <gwicke> could also denormalize some data into this global table if that speeds things up
21:17:06 <TimStarling> if we want efficient tag filters then we will have to replicate those tables to the central server
21:17:38 <matt_flaschen> We could probably use jynus's original row + wiki tag megatable for the tags.
21:17:44 <RoanKattouw> Hmm, I was thinking if we could denormalize tags into the RC schema but because of the one-to-many thing that's probably hard
21:18:10 <RoanKattouw> Also I think you can tag things after the fact
21:18:52 <RoanKattouw> matt_flaschen: Maybe, but let's be careful here
21:19:11 <TimStarling> so tags can go in the too-hard basket for now?
21:19:13 <RoanKattouw> Now if there's something else that is joined against the tag tables, you need to pull that in too, and before long you're globalizing the entire database
21:19:19 <gwicke> suppression needs to be handled as well; there are events for that in eventbus, too, but it's work to follow those
21:19:38 * robla lets this conversation continue because it looks like a good conv. PM me if you'd like me to move things along
21:19:38 <TimStarling> unless we need them for feature parity?
21:19:43 <RoanKattouw> gwicke: How would suppression work differently from how it already does?
21:20:24 <gwicke> RoanKattouw: this depends on whether we go to the primary db for the watchlist view, or not
21:20:30 <RoanKattouw> Re tagging, if you globalize that then you need to do lookup queries against the global database when rendering the history page, and if there's a feature somewhere that, say, filters Special:Log or history by tags (don't know if that exists) then you're screwed
21:20:47 <gwicke> if we have everything in the global tables, then we'd also need to handle suppression at that level
21:20:51 <RoanKattouw> Right, I see
21:20:56 <RoanKattouw> It's handled in the revision table now
21:21:15 <matt_flaschen> RoanKattouw, isn't it also in RC (rc_deleted), or is that a red herring?
21:21:17 <TimStarling> is the idea to show global watchlists via the existing UI? or would there be a separate special page?
21:21:27 <kaldari> Having aggregated central mega-tables for watchlist, watchlist_props (doesn't exist yet), recentchanges, and tags sounds like it would take a lot of DB storage. If we went down that route, would this likely require additional hardware?
21:21:28 <matt_flaschen> TimStarling, existing UI with current mockups.
21:21:36 <kaldari> TimStarling: existing UI
21:21:39 <RoanKattouw> matt_flaschen: Yeah maybe it is
21:21:57 <gwicke> kaldari: some extra machines / new cluster, probably
21:21:58 <RoanKattouw> I don't know enough about how suppression works
21:22:05 <TimStarling> so you need feature parity with all its filters, ideally
21:22:08 <anomie> I hope not existing UI, I'd prefer to keep my per-wiki watchlists instead of one giant mashup.
21:22:27 <TimStarling> namespace dropdown anyone?
21:22:28 <matt_flaschen> anomie, it has UI to choose which to show, you should just be able to leave it on current-wiki and have a sticky preference.
21:22:38 <RoanKattouw> Ouch, the namespace dropdown is painful
21:22:56 <RoanKattouw> You could denormalize your way out of that but the change propagation would be annoying
21:23:25 <matt_flaschen> anomie: https://meta.wikimedia.org/wiki/Community_Tech/Cross-wiki_watchlist#/media/File:XWL_03_just_Enwp.jpg
21:23:29 <gwicke> we do have the events already, so it's really not too hard technically
21:23:29 <kaldari> We can decide on the UI later. Either way, I don't think it will affect the back-end implementation.
21:23:31 <gwicke> but it's annoying
21:23:55 <DanielK_WMDE_> so, what if we leave SQL out of the picture entirely? Can we use Elastic for this? What other options do we have?
21:23:57 <gwicke> in terms of code, it would probably an event handler of a few dozen lines for suppression
21:24:44 <gwicke> DanielKWMDE: we currently use a join strategy, which makes updates cheap
21:24:48 <kaldari> is Redis an option?
21:24:56 <anomie> RoanKattouw: RevDel (including suppression via RevDel) uses rc_deleted. I don't know if we still have other suppression, or if we got rid of that now.
21:25:03 <TimStarling> I guess the namespace dropdown can refer only to the local wiki
21:25:06 <RoanKattouw> Right
21:25:10 <RoanKattouw> I think there's rev_deleted too right?
21:25:15 <robla> I just realized that we don't have a shepherd associated with this RFC. I'm going to assign RoanKattouw, but we can discuss that in the last 10 minutes of this hour
21:25:16 <gwicke> watchlists are accessed rarely & are not that performance critical, so this is a decent trade-off
21:25:20 <TimStarling> if you pick an option from it, you're implicitly selecting the local wiki
21:25:21 <RoanKattouw> rc_deleted can't be the only thing that's used because it's not persistent forever
21:25:50 <gwicke> so re storage options, it would ideally support joins
21:25:52 <anomie> A RevDel action updates any/all of rev_deleted, log_deleted, and rc_deleted, as applicable.
21:25:52 <matt_flaschen> RoanKattouw, yeah, there is rev_deleted, ar_deleted, log_deleted. Hopefully rc_deleted is just a de-normalized copy.
21:25:59 <gwicke> which narrows the field substantially
21:26:49 <DanielK_WMDE_> gwicke: wouldn't support for intersections be enough?
21:26:56 <musikanimal> I think we should definitely have a separate Special page. I can already hear the angry people spitting expletives, but additionally there are many watchlist user scripts or gadgets that could break
21:27:01 <gwicke> afaik, of the storage tech we currently use, only MySQL fits the bill
21:27:19 <anomie> matt_flaschen: Eew, lots of screen space taken for a feature I don't care about.
21:27:46 <gwicke> elasticsearch can do intersection, but it's not nearly as efficient as a join
21:27:50 <matt_flaschen> anomie, they're early mockups. We shouldn't focus too much on UI, though I take the point that limiting the UI can allow simpler technical decisions sometimes.
21:28:24 <kaldari> Getting back to the original question, it sounds like no one has ruled out the possibility of having a centralized mega recentchanges table. Is that correct? In other words, is it an option we should keep on the table?
21:28:37 <gwicke> if we went for denormalized (push) storage instead, then there would be a lot more options, and reads would be very fast
21:28:50 <gwicke> but you get a lot of other issues in exchange
21:28:51 <matt_flaschen> kaldari, yeah it seems pretty good especially with jynus's replication idea.
21:29:20 <robla> #info 14:28:25 <kaldari> Getting back to the original question, it sounds like no one has ruled out the possibility of having a centralized mega recentchanges table. Is that correct? In other words, is it an option we should keep on the table?
21:29:20 <RoanKattouw> kaldari: I think that's right, but I really miss having jynus in this meeting to talk about it
21:29:47 <robla> #info 14:29:21 <RoanKattouw> kaldari: I think that's right, but I really miss having jynus in this meeting to talk about it
21:30:09 <gwicke> it sounds like he has an idea of the sizes, but it might be worth double-checking the combined rctable sizes in particular
21:30:49 <RoanKattouw> Yeah
21:30:50 <kaldari> gwicke: definitely
21:31:26 <matt_flaschen> And the watchlist tables. On enwiki, there are about 18x as many watchlist entries as RC, but I didn't check the data size.
21:31:57 <gwicke> oh, right - I forgot that rc is not complete
21:32:20 <RoanKattouw> kaldari: One issue with a global RC table that I mentioned is that you'd also need to use it for the normal RC (i.e. Special:Recentchanges), and then some of the filters are harder to do, especially tag filtering
21:32:25 <matt_flaschen> #info Check total size of RC tables and watchlist tables
21:32:39 <SMalyshev> I still wonder though if it' won't we useful to have smaller global table for just matching against watchlist, and then use particular tables for data retrieval (which will be small probably)
21:32:48 <RoanKattouw> (Thankfully the namespace is denormalized in RC so that's not an issue)
21:32:52 <robla> #info robla says: this doesn't have to be the only RFC-related meeting of the week; just the one regular time I currently plan to facilitate
21:33:11 <anomie> RoanKattouw: Why, unless you wanted it to be Special:GlobalRecentChanges?
21:33:12 <TimStarling> Special:Watchlist in the current code joins on watchlist, recentchanges, change_tags, and optionally page
21:33:30 <RoanKattouw> anomie: Well I suppose we could store all RC data twice... but ideally we wouldn't, right?
21:33:34 <TimStarling> depending on the user option extendwatchlist
21:33:46 <RoanKattouw> Or did jynus's plan mean that he wanted to *replicate* the existing RC tables to a separate global table using Magic?
21:34:01 <kaldari> RoanKattouw: Why would we need to use the centralized table for Special:RecentChanges as well? Do you think it would inefficient to have local copies for each wiki?
21:34:02 <RoanKattouw> (As opposed to *moving* RC storage from local to global)
21:34:12 <gwicke> RoanKattouw: that's how I understood him
21:34:20 <anomie> "replicate" was how I read it, but I could have misread.
21:34:23 <gwicke> whether the old tables are dropped or not is a separate question
21:34:26 <RoanKattouw> kaldari: I mean other that it being a theoretical waste of space, no
21:34:26 <RoanKattouw> Oh, OK
21:34:27 <RoanKattouw> Then I misunderstood his plan, and there's no issue
21:34:31 <Scott_WUaS> kaldari: sounds likely
21:34:38 <TimStarling> he said "We can replicate all the watchlists and the rcs only of all selected wikis (global login) to this central server."
21:34:53 <RoanKattouw> The watchlist table however is not easy to replicate that way, we may want/need to actually globalize that one
21:34:55 <TimStarling> "You can get budget for a couple of extra servers."
21:34:56 <RoanKattouw> Because of the user ID issue
21:35:15 <RoanKattouw> kaldari: That answers your money question then. Sounds like this would live on dedicated hardware
21:35:45 <kaldari> RoanKattouw: can you elaborate on the User ID issue? I thought SUL would make this work OK.
21:35:47 <RoanKattouw> OK now I also understand why people were making such a big deal of wl_user and CentralIDLookup in the task
21:35:52 <DanielK_WMDE_> #info Special:Watchlist in the current code joins on watchlist, recentchanges, change_tags, and optionally page
21:35:59 <matt_flaschen> kaldari, it does, but the existing watchlist tables don't have global user ID.
21:36:05 <RoanKattouw> kaldari: Well you can't usefully replicate watchlist rows verbatim, because then wl_user will be the local user ID
21:36:06 <kaldari> oh
21:36:07 <matt_flaschen> So you can't just use wiki tag + original row, as I was saying in the task.
21:36:12 <RoanKattouw> So there are a few ways of dealing with that
21:36:35 <matt_flaschen> I was hoping there was a clever way to join from local user ID to global user ID, but I don't see local user ID in the centralauth tables.
21:36:36 <RoanKattouw> 1) you modify MW core to store the global user ID in wl_user always. That's ugly and confusing and probably breaks other joins/things
21:36:37 <DanielK_WMDE_> #info <RoanKattouw> kaldari: Well you can't usefully replicate watchlist rows verbatim, because then wl_user will be the local user ID
21:36:46 <RoanKattouw> 2) The replication script is very smart and maps the user IDs when replicating from local to global
21:36:54 <RoanKattouw> 3) You kill the local table and only use the global one
21:37:26 <tgr> just add wl_global_user to the watchlist
21:37:28 <kaldari> Would it make sense to add a new "global user ID" row to the watchlist table?
21:37:30 <tgr> not elegant but works
21:37:34 <RoanKattouw> Hmm that would work
21:37:36 <RoanKattouw> Great idea
21:37:43 <RoanKattouw> We could then index the local and global tables differnetly
21:38:00 <RoanKattouw> Thanks tgr and kaldari for coming up with that idea within 2 seconds of each other ;)
21:38:00 <bd808> and core has a global user concept now
21:38:02 <anomie> Note then you'd have to make sure to update it on attachment and unattachment.
21:38:33 <RoanKattouw> anomie: That does suck, but realistically, do users get attached/unattached after having made edits/performed actions in practice?
21:38:37 <RoanKattouw> I guess maybe the user creation log entry?
21:38:48 <DanielK_WMDE_> #info <tgr> just add wl_global_user to the watchlist
21:38:49 <anomie> RoanKattouw: A global rename now unattaches users, then reattaches them after the rename.
21:38:54 <RoanKattouw> aah
21:39:01 <RoanKattouw> OK yeah
21:39:24 <RoanKattouw> Also Matt claims that CA has no table mapping local user IDs to global ones but that could presumably be added if we needed it
21:39:46 <RoanKattouw> not that I look forward to the join gymnastics that he seems to be envisioning
21:40:08 <matt_flaschen> RoanKattouw, when I first thought of that I thought the wiki tag + original row was a special view, not binlog dark arts.
21:40:10 <tgr> on Wikimedia servers global attachedness status should never really change, rename unattaches/attaches but the ID will be the same
21:40:11 <matt_flaschen> But it could work.
21:40:20 <tgr> on other wikis anything could happen though
21:40:23 <matt_flaschen> RoanKattouw, also, technically the local user ID is in there as gu_salt. ;)
21:40:30 <robla> could the global ID be a hash of the local ID and the wiki ID? that way, the mapping would stay the same, but we could make it clear: "don't try assuming this mapping"
21:40:32 <RoanKattouw> LOL
21:40:56 <TimStarling> that is true, there is no mapping to user_id in CA
21:41:15 <TimStarling> that was an implementation error in the original CA, IMHO, which has never been corrected
21:41:17 <RoanKattouw> robla: But then I can't easily do queries like "give me all watchlist rows for User:Catrope across all wikis". That's what the global ID is or
21:41:21 <RoanKattouw> *is for
21:41:47 <quiddity> <tgr> on other wikis anything could happen though <--- would this prevent usage of the new functionality by other wikifarms?
21:41:54 <RoanKattouw> Even with Matt and Tim's hypothetical table that provides that mapping, I don't think that query would be practical
21:42:07 <kaldari> I imagine some maintenance script could be written to populate the global_user_id fields in the local watchlist tables
21:42:09 <RoanKattouw> You probably just need the global user ID to be there, one way or another
21:42:12 <anomie> RoanKattouw: `localwiki.user JOIN centralauth.globaluser ON(user_name = gu_name)`. Add in `JOIN localuser ON (lu_wiki=$wiki AND lu_name = gu_name)` if you care about the possibility of unattached accounts.
21:42:36 <RoanKattouw> anomie: Sure but now how do I do "all watchlist rows for global user ID 57 regardless of wiki"
21:42:51 <RoanKattouw> That, in my understanding, is the main motivation for globalizing the tables etc
21:43:21 <gwicke> if we keep local tables, can't we just keep using those for unattached local users?
21:43:22 <RoanKattouw> Oh, sorry, you're telling me how to map user IDs
21:43:41 <RoanKattouw> You can't join between the centralauth DB and local wiki DB generally speaking because of sharding
21:44:00 <RoanKattouw> kaldari: Yes, we would need to do that
21:44:07 <TimStarling> unless you also replicate the user table to the central server
21:44:15 <kaldari> lol
21:44:42 <matt_flaschen> I guess that mega-user table would only be need to rewrite the user ID at replication time.
21:44:53 <gwicke> given a local user, can we figure out the global user id if the user is attached?
21:45:10 <TimStarling> but like I say, I think it's a bug, I think user_id mapping should be added to CA
21:45:14 <matt_flaschen> Or we can short-circuit this and just make global_watchlist a real mega-table, not a view/binlog thing.
21:45:14 <RoanKattouw> gwicke: Yes, but only by using >=2 DB connections at present
21:45:23 <TimStarling> I don't think we should bend over backwards trying to accomodate that limitation
21:45:47 <RoanKattouw> Or as Tim says we could add that mapping in CA, and then co-locate the global watchlist table with the CA database
21:45:59 <bd808> Adding local wiki user_id to CA should be pretty easy I think
21:46:05 <DanielK_WMDE_> #info <TimStarling> I think it's a bug, I think user_id mapping should be added to CA
21:46:11 <RoanKattouw> That might save kaldari's budget from dedicated hardware purchases :) if the hardware can take it
21:46:19 <addshore> And adding user_id to ca makes sense!
21:46:24 <kaldari> Which team is in charge of CA?
21:46:31 <RoanKattouw> LOL
21:46:35 <bd808> kaldari: a fine question
21:46:38 <matt_flaschen> RoanKattouw, does that really save any storage, or just change what software does the writes (replication vs. explicit)?
21:46:41 <RoanKattouw> We literally had a meeting hours ago where we argued about that
21:46:53 <anomie> I'd say add the column to the localuser table. And at the same time possibly add gu_id, instead of having to use the name to join.
21:46:59 <bd808> s/argued/discussed/
21:47:00 <ori> last person to say "not it"
21:47:05 <bd808> not it
21:47:07 <anomie> not it
21:47:08 <RoanKattouw> matt_flaschen: It doesn't, I just meant that it would incentivize us to locate CA and global watchlist on the same DB servers, meaning kaldari doesn't have to buy his own servers
21:47:10 <robla> not it
21:47:14 <RoanKattouw> I was mostly joking
21:47:16 <TimStarling> we need a fake team that we can use as a sarcastic answer for questions like this
21:47:29 <bd808> "mediawiki-core"
21:47:29 <Scott_WUaS> TimStarling: user_id mapping should be added to CA makes much sense
21:47:31 <James_F> TimStarling: I know!
21:47:34 <ori> Site Reliability
21:47:36 <quiddity> Tiger Team! >.>
21:47:43 <RoanKattouw> haha
21:47:47 <James_F> TimStarling: We could call it the "Admin tools team" which was going to be responsible.
21:47:52 <gwicke> Platypus
21:47:56 <dannyh> ha
21:47:59 <RoanKattouw> Anyway, summing up
21:48:00 * James_F coughs.
21:48:06 <matt_flaschen> Stability team
21:48:08 <quiddity> Does this discussion help answer/refine the question about how many simultaneous watchlists we could support? (i.e. the previously suggested limit of 10 wikis per account. I'm hoping we can raise that to at least 20, and perhaps 40 or more.)
21:48:09 <RoanKattouw> It sounds like people are in favor of the following approach, if feasible:
21:48:19 <RoanKattouw> 1) Add user ID to the CA tables
21:48:21 <robla> yes...so RoanKattouw , can I assign you as shepherd for this one?
21:48:34 <matt_flaschen> quiddity, we may not need that limitation if we take the global approach. No promises yet. :)
21:48:45 <RoanKattouw> 2) Use Jaime's replication dark magic and additionally have it rewrite/populate wl_user/wl_global_user with the global user ID, which it derives from the CA tables
21:48:58 <RoanKattouw> 2b) this probably means the global watchlsit table and the CA tables have to be on the same DB cluster
21:48:59 <bd808> quiddity: I *think* if we go for the central storage option then the answer is "all wikis in the farm"
21:49:11 <RoanKattouw> Yes my belief is also that we wouldn't need a limit
21:49:32 <gwicke> the replication system could just look up the ids from the CA cluster
21:49:37 <gwicke> I don't think it needs to be co-located, as long as it can access CA
21:49:47 <bd808> Do we limit number of pages watched on a single wiki?
21:49:55 <James_F> Currently we don't.
21:50:00 <TimStarling> I think we should
21:50:00 <addshore> Nope
21:50:08 <RoanKattouw> Another alternative that was mentioned was adding wl_global_user to the local watchlist tables, have local MW instances populate it by using existing CA stuff, and have that mindlessly be replicated acros
21:50:09 <James_F> E.g. my personal account has ~20k pages watched.
21:50:14 <quiddity> there are many users with watchlists in the low tens-of-thousands.
21:50:15 <James_F> Which would be… bad to encourage.
21:50:23 <TimStarling> I remember there was a guy with 150k pages on his watchlist at a time when enwiki only had 1M pages total
21:50:28 <RoanKattouw> Do people feel that's better or worse than doing it in the replications tep?
21:50:32 <RoanKattouw> *replication step
21:50:39 <TimStarling> there wasn't really a good rationale for it, I asked him
21:50:58 <RoanKattouw> TimStarling: Betacommandbot had 200k or 400k or something, when it tried to view Special:Watchlist it caused an OOM
21:50:58 <TimStarling> if there was a limit he would have stopped clicking the button
21:51:10 <RoanKattouw> Brion ran DELETE FROM watchlist WHERE wl_user=NNN and that query took 14 minutes
21:51:15 <quiddity> Any discussion system that allows Topic/Thread watchlisting, is also a prolific incrementer of watchlist numbers...
21:51:17 <matt_flaschen> RoanKattouw, I sort of think better in replication step. Otherwise, it's arbitrary why just there.
21:51:28 <robla> #info RoanKattouw notes tentative buy-in for 1) Add user ID to the CA tables 2) Jaime's replication strategy + rewrite/populate wl_user/wl_global_user w/global user ID (derived from the CA tables)
21:51:33 <RoanKattouw> matt_flaschen: I guess it would pollute the general MW core schema
21:51:44 <matt_flaschen> Why not rev_global_user? (Clearly the answer is "we didn't need it", but that's not that satisfying when you're looking at a DB schema)
21:51:50 <RoanKattouw> Right :
21:51:51 <RoanKattouw> :)
21:52:05 <addshore> I think one of the reasons for large watchlists is the difficulty people have to selectively clear them.
21:52:12 <RoanKattouw> matt_flaschen: Other answer: "on enwiki the revision table is way too freaking large" ;)
21:52:29 <bd808> addshore: weren't you going to fix that? ;)
21:52:33 <robla> #info robla plans to assign RoanKattouw as shepherd
21:52:36 <addshore> Mhhmhm
21:52:41 <RoanKattouw> robla: Sounds reasonable
21:53:09 <addshore> bd808: *in progress*
21:53:23 <kaldari> A watchlist limit of 50,000 or 100,000 seems like it would be reasonable, and maybe deter some beans up noses.
21:53:41 <bd808> for implementation, would this be an extension that replaces the normal Special:Watchlist?
21:53:50 <addshore> I wonder how many people would hit that at present!
21:53:51 <matt_flaschen> Watchlist expiry would also help with this, especially if it's a prominent option.
21:54:03 <kaldari> bd808: probably
21:54:03 <tgr> keeping global IDs segregated in the CA database is a good thing since it makes it much easier to switch to a different single-sign-on extension
21:54:04 <bd808> or will Special:Watchlist in core have to learn about weird central tables that only exist on some farms?
21:54:12 <addshore> matt_flaschen: indeed
21:54:17 <matt_flaschen> T100508
21:54:18 <stashbot> T100508: Watchlist expiry: Watch pages for a specified time frame (2013) - https://phabricator.wikimedia.org/T100508
21:54:19 <RoanKattouw> addshore: Challenge accepted, I'll see if that query runs in a reasonable amount of time
21:54:30 <matt_flaschen> (That is also a community wishlist request in its own right)
21:55:01 <bd808> matt_flaschen: that's what I was taunting addshore about
21:55:25 <addshore> On that front we are still waiting for the wl_id field to land on production databases.
21:55:41 <kaldari> addshore: any ETA?
21:55:58 <bd808> "when jamie gets it done"
21:56:06 <addshore> It's getting closer :)
21:56:07 <quiddity> https://phabricator.wikimedia.org/T2424 -- patch waiting for more reviews.
21:56:11 <quiddity> See also*
21:56:23 <addshore> By the end of the year? (:
21:56:36 <quiddity> ("Add a direct unwatch link to entries on Special:Watchlist") /me pats stashbot
21:56:54 <addshore> quiddity: I think we also looked at that a bit!
21:57:18 * quiddity adds addshore to review https://gerrit.wikimedia.org/r/#/c/293892/ >.>
21:57:21 <bd808> quiddity: stashbot won't expand urls, just task numbers
21:57:29 <robla> we can continue the conversation in another channel. #wikimedia-tech? #wikimedia-operations? (or just move to Phab/email/async)
21:57:41 <bd808> next steps?
21:57:44 * robla gives 2 minute warning
21:57:46 <addshore> Oooh, please do quiddity
21:57:52 <matt_flaschen> quiddity, it's marked as a WIP, also is it blocked on design?
21:58:03 <kaldari> Do we want to discuss the pluses and minuses of a non-central-recentchanges-table approach?
21:58:17 <kaldari> And what that might look like?
21:58:17 <Scott_WUaS> kaldari: yes
21:58:20 * addshore waves goodbye!
21:58:27 <quiddity> matt_flaschen, yes, it needs code and design feedback.
21:58:40 <kaldari> although I guess we're out of time for now
21:58:42 <gwicke> the big minus of local watchlists is that it's hard to scale to many projects
21:58:47 <kaldari> I have to run do an interview
21:58:53 <robla> yeah, we should wrap this up. I think next steps are more async conversation, and then RoanKattouw can bring this up for another one of these
21:58:54 <addshore> quiddity: I'm sure we (wmde) would be happy to provide that 'maybe'
21:59:06 <RoanKattouw> I'll write a summary on the task
21:59:12 <Scott_WUaS> robla: great!
21:59:14 <robla> RoanKattouw: thank you!
21:59:35 <dannyh> thanks, everybody
21:59:36 <RoanKattouw> kaldari: Input on that welcome. I have opinions too. So I'll ask that on the task and then reply with my view
21:59:39 <robla> #info RoanKattouw plans to write a summary on the task
21:59:48 <matt_flaschen> Thanks to kaldari, dannyh, jynus, and everyone
21:59:53 <robla> thank you everyone!
22:00:00 <robla> #endmeeting