- Location: #wikimedia-office IRC channel
- Meeting type: Problem definition
- Time: 2016-07-20 (Wednesday) 21:00 UTC (2pm PDT, 23:00 CEST)
- Agenda:
- T126641: [RFC] Devise plan for a cross-wiki watchlist back-end
- Discussing these questions
- Is keeping a central table of all recent changes for all wikis a doable option?
- Is doing separate queries for each wiki a better or worse idea?
- Are there ways that we could mitigate the impact of either of those options?
- Are there other options that we haven't even thought about yet?
- Discussing these questions
- T126641: [RFC] Devise plan for a cross-wiki watchlist back-end
Auto-generated summary: T126641#2482285
Full log:
1 | 21:00:02 <robla> #startmeeting [RFC] Devise plan for a cross-wiki watchlist back-end |
---|---|
2 | 21:00:02 <wm-labs-meetbot`> Meeting started Wed Jul 20 21:00:02 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot. |
3 | 21:00:02 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. |
4 | 21:00:02 <wm-labs-meetbot`> The meeting name has been set to '_rfc__devise_plan_for_a_cross_wiki_watchlist_back_end' |
5 | 21:00:27 <robla> #link https://phabricator.wikimedia.org/T126641 |
6 | 21:00:35 <robla> #topic Wikimedia meeting channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ |
7 | 21:00:58 <robla> hi everyone! |
8 | 21:01:10 <kaldari> hello |
9 | 21:01:36 <dannyh> hi |
10 | 21:01:51 <musikanimal> hey! |
11 | 21:02:03 <robla> kaldari: thanks for getting this filed! |
12 | 21:02:19 <kaldari> NP :) |
13 | 21:02:49 <robla> Niharika here? (or is timezone math kinda awful?) |
14 | 21:02:54 <Scott_WUaS> Hey :) |
15 | 21:02:57 <bd808> I don't think Niharika is going to make it. It's 02:30 in her timezone |
16 | 21:02:58 <bd808> a bit late/early for an RfC meeting ;) |
17 | 21:03:01 <kaldari> robla: too late for her :( |
18 | 21:03:24 <MaxSem> wow, she has a life? |
19 | 21:03:57 <robla> ok...we can potentially have a followup meeting at a different time to accomodate. this was billed as a "Problem definition" meeting, so more discussions to come (hopefully) |
20 | 21:04:12 <kaldari> true |
21 | 21:04:41 <robla> anyway, the list of questions: "Is keeping a central table of all recent changes for all wikis a doable option?" |
22 | 21:05:06 <robla> #info question discussed: Is keeping a central table of all recent changes for all wikis a doable option? |
23 | 21:05:14 <robla> so....is it? |
24 | 21:05:35 <RoanKattouw> I think the discussion on the task already veered to "you don't need a global RC table, just a global watchlist table" |
25 | 21:05:40 <TimStarling> I just pinged jynus, hopefully he can join |
26 | 21:05:42 <matt_flaschen> It also would be nice if jynus was here. |
27 | 21:05:43 <MaxSem> really a DBA question. how much data are we going to store? how often will it deadlock on inserts? |
28 | 21:05:45 <gwicke> do we have data on the perf requirements? |
29 | 21:05:50 <RoanKattouw> Or at least that's what I said on the task :D |
30 | 21:05:51 * RoanKattouw is catching up on task comments now |
31 | 21:05:56 <MaxSem> s/dead// |
32 | 21:05:56 <gwicke> mean / max watch list sizes, etc |
33 | 21:06:05 <TimStarling> I thought his idea of recentchanges tables replicated to a single server was interesting |
34 | 21:06:10 <matt_flaschen> RoanKattouw, yeah, but I'm not sure if that's really because we can't do global, or just because "well, I guess we can limit to 10" (which might still have worse perf). |
35 | 21:06:19 <kaldari> RoanKattouw: There have been comments suggesting a central RC table (and also comments suggesting other approaches) |
36 | 21:06:27 <TimStarling> if they are all on one server, then 10 is probably way too low |
37 | 21:06:30 <matt_flaschen> Yeah, me too. It should solve at least global_rc, but global_watchlist I'm not sure about. |
38 | 21:07:01 <matt_flaschen> (me too re the replication ideas). |
39 | 21:07:34 <kaldari> The idea of limiting to 10 was mainly to limit the impact of having to do cross-wiki queries, which would be eliminated by having a cental RC table AFAIK. |
40 | 21:08:01 <Scott_WUaS> (Might this not be inevitable eventually possibly?) |
41 | 21:08:23 <gwicke> Jaime said "Watchlist and recentchanges are relatively small-medium tables." |
42 | 21:08:24 <matt_flaschen> I particularly thought jynus's idea of a central RC table (using the wiki tags + original row) was interesting. It sounded like some kind of view. I don't think that will work for watchlist, but it should for RC. |
43 | 21:09:24 <gwicke> do we expect issues from heavy watchers across many projects? |
44 | 21:09:37 <SMalyshev> I wonder why having single watchlist but separate rctables means a lot of queries? Can't we do a union query? Or it'll be too heavy? |
45 | 21:09:47 <kaldari> gwicke: depends on the implementation |
46 | 21:09:51 <TimStarling> with regular mysql replication you would presumably end up with one table per wiki |
47 | 21:10:01 <RoanKattouw> SMalyshev: You can't union across multiple DBs that are on different servers |
48 | 21:10:06 <matt_flaschen> gwicke, There are some legitimate heavy watcher use cases (especially a few pages on many, many wikis), such as stewards and small-wiki counter-vandalism. |
49 | 21:10:15 <matt_flaschen> And we need to handle deliberate attempts to overload the system. |
50 | 21:10:21 <TimStarling> with one table per wiki replicated to a single server, you could do a union |
51 | 21:10:23 <quiddity> (and global interface editors, and global sysops) |
52 | 21:10:26 <gwicke> matt_flaschen: do we have numbers (order of magnitude) for those? |
53 | 21:10:39 <TimStarling> unions still have a cost per table associated with them |
54 | 21:10:47 <SMalyshev> RoanKattouw: ah, we don't have all wikis in same server... can't we then do N=number of shares queries at least instead of number of wikis? Won't it be better? |
55 | 21:11:13 <matt_flaschen> gwicke, I don't. |
56 | 21:11:13 <TimStarling> but if you know which wikis you want to query in advance, you don't have to put every wiki in the union every time |
57 | 21:11:28 <legoktm> when Gather was being developed, we had an RfC discussion about re-doing the watchlist schema...were those ideas taken into account for how a cross-wiki watchlist would work? |
58 | 21:11:29 <musikanimal> ^ |
59 | 21:11:31 <matt_flaschen> SMalyshev, what does "number of shares" mean? |
60 | 21:11:45 <RoanKattouw> matt_flaschen: Shards I think |
61 | 21:11:48 <RoanKattouw> So 7 or 8 |
62 | 21:11:48 <SMalyshev> matt_flaschen: number of shards |
63 | 21:11:49 <TimStarling> jynus said "With some trickery even single megatables with wiki key + original row would be possible" |
64 | 21:12:11 <matt_flaschen> Yeah, I think that solves global_recentchanges. |
65 | 21:12:16 <TimStarling> presumably he means modifying the binlog stream |
66 | 21:12:28 <kaldari> legoktm: IDs have been added to the watchlist table, and there is strong potential for a new watchlist_props table in the near future. |
67 | 21:12:48 <TimStarling> I guess there is a library to do that since I know there are other tools that modify the binlog stream |
68 | 21:12:52 <kaldari> legoktm: not aware of other proposed schema changes |
69 | 21:12:58 <RoanKattouw> If you had a global RC megatable and a global watchlist table would that be enough? Do we currently join RC/watchlist against any other tables like revision, logging, etc.? |
70 | 21:13:07 <SMalyshev> also, for most watchlists most entries won't be changed... so we don't need a lot of data to know that. We need more data only for those that changed |
71 | 21:13:10 <RoanKattouw> (I suspect not because RC is a denormalized table, but it's MediaWiki so who knows) |
72 | 21:13:21 <kaldari> legoktm: do you have a URL for that RfC? |
73 | 21:13:27 <gwicke> TimStarling: I understood his comment to mean that it contains the original row's content |
74 | 21:13:34 <TimStarling> no, there's no need to join on other tables |
75 | 21:13:49 <gwicke> which can be achieved with or without the binlog |
76 | 21:13:50 <RoanKattouw> Good, because we would cut off that possibility permanently if we globalized the RC table |
77 | 21:13:51 <matt_flaschen> gwicke, I think he said the original row's content *plus* the wiki ID (e.g. enwiki) |
78 | 21:13:52 <TimStarling> watchlist/recentchanges is the only massive intersection |
79 | 21:13:57 <RoanKattouw> (of joining I mean) |
80 | 21:13:59 <kaldari> RoanKattouw: I believe that would be enough |
81 | 21:14:09 <gwicke> matt_flaschen: yeah |
82 | 21:14:12 <TimStarling> after you join those two, you have a chronological event list |
83 | 21:14:35 <gwicke> he also mentioned the Kafka route |
84 | 21:14:39 <legoktm> kaldari: the discussion was https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-01-14-21.00.log.html |
85 | 21:14:40 <gwicke> eventbus already has RC events |
86 | 21:14:49 <gwicke> we could basically follow that to keep a global table up to date |
87 | 21:14:53 <matt_flaschen> T1352 |
88 | 21:14:54 <stashbot> T1352: RFC: Support for user-specific page lists in core - https://phabricator.wikimedia.org/T1352 |
89 | 21:14:55 <matt_flaschen> ? |
90 | 21:14:56 <RoanKattouw> What about the tagging tables? |
91 | 21:15:14 <RoanKattouw> The RC view on Special:Recentchanges displays tags, does it do that by joining or by doing separate lookups? |
92 | 21:15:19 <anomie> Re joining, the existing Special:Watchlist joins against page for the non-expanded version, and against change_tag to display or filter by tags. |
93 | 21:15:23 <RoanKattouw> (I suppose it could always be changed from the former to the latter if need be) |
94 | 21:15:32 <TimStarling> well, I was going to say that display queries could be done after the join, in separate queries |
95 | 21:15:50 <RoanKattouw> Yeah, sure |
96 | 21:15:53 <TimStarling> but I suppose if you display 500 results from 50 wikis that could end up being a lot of display queries |
97 | 21:15:57 <RoanKattouw> As long as it's just pulling in extra data for display |
98 | 21:15:59 * DanielK_WMDE_ thinks modifying the binlog stream sounds scary |
99 | 21:16:09 <RoanKattouw> But if you're using it for WHERE conditions it's harder |
100 | 21:16:13 <TimStarling> yes |
101 | 21:16:19 <RoanKattouw> For example, how do you do the tag filter on Special:RC |
102 | 21:16:33 <TimStarling> it's probably inefficient already |
103 | 21:16:34 <gwicke> group by project & send out a handful of batches at most |
104 | 21:16:43 <matt_flaschen> DanielK_WMDE_, it does a little from the sound of it, but jynus does not normally take unnecessary risks from what I can tell. |
105 | 21:16:54 <RoanKattouw> TimStarling: Yeah I'm pretty sure those queries already suck |
106 | 21:17:06 <gwicke> could also denormalize some data into this global table if that speeds things up |
107 | 21:17:06 <TimStarling> if we want efficient tag filters then we will have to replicate those tables to the central server |
108 | 21:17:38 <matt_flaschen> We could probably use jynus's original row + wiki tag megatable for the tags. |
109 | 21:17:44 <RoanKattouw> Hmm, I was thinking if we could denormalize tags into the RC schema but because of the one-to-many thing that's probably hard |
110 | 21:18:10 <RoanKattouw> Also I think you can tag things after the fact |
111 | 21:18:52 <RoanKattouw> matt_flaschen: Maybe, but let's be careful here |
112 | 21:19:11 <TimStarling> so tags can go in the too-hard basket for now? |
113 | 21:19:13 <RoanKattouw> Now if there's something else that is joined against the tag tables, you need to pull that in too, and before long you're globalizing the entire database |
114 | 21:19:19 <gwicke> suppression needs to be handled as well; there are events for that in eventbus, too, but it's work to follow those |
115 | 21:19:38 * robla lets this conversation continue because it looks like a good conv. PM me if you'd like me to move things along |
116 | 21:19:38 <TimStarling> unless we need them for feature parity? |
117 | 21:19:43 <RoanKattouw> gwicke: How would suppression work differently from how it already does? |
118 | 21:20:24 <gwicke> RoanKattouw: this depends on whether we go to the primary db for the watchlist view, or not |
119 | 21:20:30 <RoanKattouw> Re tagging, if you globalize that then you need to do lookup queries against the global database when rendering the history page, and if there's a feature somewhere that, say, filters Special:Log or history by tags (don't know if that exists) then you're screwed |
120 | 21:20:47 <gwicke> if we have everything in the global tables, then we'd also need to handle suppression at that level |
121 | 21:20:51 <RoanKattouw> Right, I see |
122 | 21:20:56 <RoanKattouw> It's handled in the revision table now |
123 | 21:21:15 <matt_flaschen> RoanKattouw, isn't it also in RC (rc_deleted), or is that a red herring? |
124 | 21:21:17 <TimStarling> is the idea to show global watchlists via the existing UI? or would there be a separate special page? |
125 | 21:21:27 <kaldari> Having aggregated central mega-tables for watchlist, watchlist_props (doesn't exist yet), recentchanges, and tags sounds like it would take a lot of DB storage. If we went down that route, would this likely require additional hardware? |
126 | 21:21:28 <matt_flaschen> TimStarling, existing UI with current mockups. |
127 | 21:21:36 <kaldari> TimStarling: existing UI |
128 | 21:21:39 <RoanKattouw> matt_flaschen: Yeah maybe it is |
129 | 21:21:57 <gwicke> kaldari: some extra machines / new cluster, probably |
130 | 21:21:58 <RoanKattouw> I don't know enough about how suppression works |
131 | 21:22:05 <TimStarling> so you need feature parity with all its filters, ideally |
132 | 21:22:08 <anomie> I hope not existing UI, I'd prefer to keep my per-wiki watchlists instead of one giant mashup. |
133 | 21:22:27 <TimStarling> namespace dropdown anyone? |
134 | 21:22:28 <matt_flaschen> anomie, it has UI to choose which to show, you should just be able to leave it on current-wiki and have a sticky preference. |
135 | 21:22:38 <RoanKattouw> Ouch, the namespace dropdown is painful |
136 | 21:22:56 <RoanKattouw> You could denormalize your way out of that but the change propagation would be annoying |
137 | 21:23:25 <matt_flaschen> anomie: https://meta.wikimedia.org/wiki/Community_Tech/Cross-wiki_watchlist#/media/File:XWL_03_just_Enwp.jpg |
138 | 21:23:29 <gwicke> we do have the events already, so it's really not too hard technically |
139 | 21:23:29 <kaldari> We can decide on the UI later. Either way, I don't think it will affect the back-end implementation. |
140 | 21:23:31 <gwicke> but it's annoying |
141 | 21:23:55 <DanielK_WMDE_> so, what if we leave SQL out of the picture entirely? Can we use Elastic for this? What other options do we have? |
142 | 21:23:57 <gwicke> in terms of code, it would probably an event handler of a few dozen lines for suppression |
143 | 21:24:44 <gwicke> DanielKWMDE: we currently use a join strategy, which makes updates cheap |
144 | 21:24:48 <kaldari> is Redis an option? |
145 | 21:24:56 <anomie> RoanKattouw: RevDel (including suppression via RevDel) uses rc_deleted. I don't know if we still have other suppression, or if we got rid of that now. |
146 | 21:25:03 <TimStarling> I guess the namespace dropdown can refer only to the local wiki |
147 | 21:25:06 <RoanKattouw> Right |
148 | 21:25:10 <RoanKattouw> I think there's rev_deleted too right? |
149 | 21:25:15 <robla> I just realized that we don't have a shepherd associated with this RFC. I'm going to assign RoanKattouw, but we can discuss that in the last 10 minutes of this hour |
150 | 21:25:16 <gwicke> watchlists are accessed rarely & are not that performance critical, so this is a decent trade-off |
151 | 21:25:20 <TimStarling> if you pick an option from it, you're implicitly selecting the local wiki |
152 | 21:25:21 <RoanKattouw> rc_deleted can't be the only thing that's used because it's not persistent forever |
153 | 21:25:50 <gwicke> so re storage options, it would ideally support joins |
154 | 21:25:52 <anomie> A RevDel action updates any/all of rev_deleted, log_deleted, and rc_deleted, as applicable. |
155 | 21:25:52 <matt_flaschen> RoanKattouw, yeah, there is rev_deleted, ar_deleted, log_deleted. Hopefully rc_deleted is just a de-normalized copy. |
156 | 21:25:59 <gwicke> which narrows the field substantially |
157 | 21:26:49 <DanielK_WMDE_> gwicke: wouldn't support for intersections be enough? |
158 | 21:26:56 <musikanimal> I think we should definitely have a separate Special page. I can already hear the angry people spitting expletives, but additionally there are many watchlist user scripts or gadgets that could break |
159 | 21:27:01 <gwicke> afaik, of the storage tech we currently use, only MySQL fits the bill |
160 | 21:27:19 <anomie> matt_flaschen: Eew, lots of screen space taken for a feature I don't care about. |
161 | 21:27:46 <gwicke> elasticsearch can do intersection, but it's not nearly as efficient as a join |
162 | 21:27:50 <matt_flaschen> anomie, they're early mockups. We shouldn't focus too much on UI, though I take the point that limiting the UI can allow simpler technical decisions sometimes. |
163 | 21:28:24 <kaldari> Getting back to the original question, it sounds like no one has ruled out the possibility of having a centralized mega recentchanges table. Is that correct? In other words, is it an option we should keep on the table? |
164 | 21:28:37 <gwicke> if we went for denormalized (push) storage instead, then there would be a lot more options, and reads would be very fast |
165 | 21:28:50 <gwicke> but you get a lot of other issues in exchange |
166 | 21:28:51 <matt_flaschen> kaldari, yeah it seems pretty good especially with jynus's replication idea. |
167 | 21:29:20 <robla> #info 14:28:25 <kaldari> Getting back to the original question, it sounds like no one has ruled out the possibility of having a centralized mega recentchanges table. Is that correct? In other words, is it an option we should keep on the table? |
168 | 21:29:20 <RoanKattouw> kaldari: I think that's right, but I really miss having jynus in this meeting to talk about it |
169 | 21:29:47 <robla> #info 14:29:21 <RoanKattouw> kaldari: I think that's right, but I really miss having jynus in this meeting to talk about it |
170 | 21:30:09 <gwicke> it sounds like he has an idea of the sizes, but it might be worth double-checking the combined rctable sizes in particular |
171 | 21:30:49 <RoanKattouw> Yeah |
172 | 21:30:50 <kaldari> gwicke: definitely |
173 | 21:31:26 <matt_flaschen> And the watchlist tables. On enwiki, there are about 18x as many watchlist entries as RC, but I didn't check the data size. |
174 | 21:31:57 <gwicke> oh, right - I forgot that rc is not complete |
175 | 21:32:20 <RoanKattouw> kaldari: One issue with a global RC table that I mentioned is that you'd also need to use it for the normal RC (i.e. Special:Recentchanges), and then some of the filters are harder to do, especially tag filtering |
176 | 21:32:25 <matt_flaschen> #info Check total size of RC tables and watchlist tables |
177 | 21:32:39 <SMalyshev> I still wonder though if it' won't we useful to have smaller global table for just matching against watchlist, and then use particular tables for data retrieval (which will be small probably) |
178 | 21:32:48 <RoanKattouw> (Thankfully the namespace is denormalized in RC so that's not an issue) |
179 | 21:32:52 <robla> #info robla says: this doesn't have to be the only RFC-related meeting of the week; just the one regular time I currently plan to facilitate |
180 | 21:33:11 <anomie> RoanKattouw: Why, unless you wanted it to be Special:GlobalRecentChanges? |
181 | 21:33:12 <TimStarling> Special:Watchlist in the current code joins on watchlist, recentchanges, change_tags, and optionally page |
182 | 21:33:30 <RoanKattouw> anomie: Well I suppose we could store all RC data twice... but ideally we wouldn't, right? |
183 | 21:33:34 <TimStarling> depending on the user option extendwatchlist |
184 | 21:33:46 <RoanKattouw> Or did jynus's plan mean that he wanted to *replicate* the existing RC tables to a separate global table using Magic? |
185 | 21:34:01 <kaldari> RoanKattouw: Why would we need to use the centralized table for Special:RecentChanges as well? Do you think it would inefficient to have local copies for each wiki? |
186 | 21:34:02 <RoanKattouw> (As opposed to *moving* RC storage from local to global) |
187 | 21:34:12 <gwicke> RoanKattouw: that's how I understood him |
188 | 21:34:20 <anomie> "replicate" was how I read it, but I could have misread. |
189 | 21:34:23 <gwicke> whether the old tables are dropped or not is a separate question |
190 | 21:34:26 <RoanKattouw> kaldari: I mean other that it being a theoretical waste of space, no |
191 | 21:34:26 <RoanKattouw> Oh, OK |
192 | 21:34:27 <RoanKattouw> Then I misunderstood his plan, and there's no issue |
193 | 21:34:31 <Scott_WUaS> kaldari: sounds likely |
194 | 21:34:38 <TimStarling> he said "We can replicate all the watchlists and the rcs only of all selected wikis (global login) to this central server." |
195 | 21:34:53 <RoanKattouw> The watchlist table however is not easy to replicate that way, we may want/need to actually globalize that one |
196 | 21:34:55 <TimStarling> "You can get budget for a couple of extra servers." |
197 | 21:34:56 <RoanKattouw> Because of the user ID issue |
198 | 21:35:15 <RoanKattouw> kaldari: That answers your money question then. Sounds like this would live on dedicated hardware |
199 | 21:35:45 <kaldari> RoanKattouw: can you elaborate on the User ID issue? I thought SUL would make this work OK. |
200 | 21:35:47 <RoanKattouw> OK now I also understand why people were making such a big deal of wl_user and CentralIDLookup in the task |
201 | 21:35:52 <DanielK_WMDE_> #info Special:Watchlist in the current code joins on watchlist, recentchanges, change_tags, and optionally page |
202 | 21:35:59 <matt_flaschen> kaldari, it does, but the existing watchlist tables don't have global user ID. |
203 | 21:36:05 <RoanKattouw> kaldari: Well you can't usefully replicate watchlist rows verbatim, because then wl_user will be the local user ID |
204 | 21:36:06 <kaldari> oh |
205 | 21:36:07 <matt_flaschen> So you can't just use wiki tag + original row, as I was saying in the task. |
206 | 21:36:12 <RoanKattouw> So there are a few ways of dealing with that |
207 | 21:36:35 <matt_flaschen> I was hoping there was a clever way to join from local user ID to global user ID, but I don't see local user ID in the centralauth tables. |
208 | 21:36:36 <RoanKattouw> 1) you modify MW core to store the global user ID in wl_user always. That's ugly and confusing and probably breaks other joins/things |
209 | 21:36:37 <DanielK_WMDE_> #info <RoanKattouw> kaldari: Well you can't usefully replicate watchlist rows verbatim, because then wl_user will be the local user ID |
210 | 21:36:46 <RoanKattouw> 2) The replication script is very smart and maps the user IDs when replicating from local to global |
211 | 21:36:54 <RoanKattouw> 3) You kill the local table and only use the global one |
212 | 21:37:26 <tgr> just add wl_global_user to the watchlist |
213 | 21:37:28 <kaldari> Would it make sense to add a new "global user ID" row to the watchlist table? |
214 | 21:37:30 <tgr> not elegant but works |
215 | 21:37:34 <RoanKattouw> Hmm that would work |
216 | 21:37:36 <RoanKattouw> Great idea |
217 | 21:37:43 <RoanKattouw> We could then index the local and global tables differnetly |
218 | 21:38:00 <RoanKattouw> Thanks tgr and kaldari for coming up with that idea within 2 seconds of each other ;) |
219 | 21:38:00 <bd808> and core has a global user concept now |
220 | 21:38:02 <anomie> Note then you'd have to make sure to update it on attachment and unattachment. |
221 | 21:38:33 <RoanKattouw> anomie: That does suck, but realistically, do users get attached/unattached after having made edits/performed actions in practice? |
222 | 21:38:37 <RoanKattouw> I guess maybe the user creation log entry? |
223 | 21:38:48 <DanielK_WMDE_> #info <tgr> just add wl_global_user to the watchlist |
224 | 21:38:49 <anomie> RoanKattouw: A global rename now unattaches users, then reattaches them after the rename. |
225 | 21:38:54 <RoanKattouw> aah |
226 | 21:39:01 <RoanKattouw> OK yeah |
227 | 21:39:24 <RoanKattouw> Also Matt claims that CA has no table mapping local user IDs to global ones but that could presumably be added if we needed it |
228 | 21:39:46 <RoanKattouw> not that I look forward to the join gymnastics that he seems to be envisioning |
229 | 21:40:08 <matt_flaschen> RoanKattouw, when I first thought of that I thought the wiki tag + original row was a special view, not binlog dark arts. |
230 | 21:40:10 <tgr> on Wikimedia servers global attachedness status should never really change, rename unattaches/attaches but the ID will be the same |
231 | 21:40:11 <matt_flaschen> But it could work. |
232 | 21:40:20 <tgr> on other wikis anything could happen though |
233 | 21:40:23 <matt_flaschen> RoanKattouw, also, technically the local user ID is in there as gu_salt. ;) |
234 | 21:40:30 <robla> could the global ID be a hash of the local ID and the wiki ID? that way, the mapping would stay the same, but we could make it clear: "don't try assuming this mapping" |
235 | 21:40:32 <RoanKattouw> LOL |
236 | 21:40:56 <TimStarling> that is true, there is no mapping to user_id in CA |
237 | 21:41:15 <TimStarling> that was an implementation error in the original CA, IMHO, which has never been corrected |
238 | 21:41:17 <RoanKattouw> robla: But then I can't easily do queries like "give me all watchlist rows for User:Catrope across all wikis". That's what the global ID is or |
239 | 21:41:21 <RoanKattouw> *is for |
240 | 21:41:47 <quiddity> <tgr> on other wikis anything could happen though <--- would this prevent usage of the new functionality by other wikifarms? |
241 | 21:41:54 <RoanKattouw> Even with Matt and Tim's hypothetical table that provides that mapping, I don't think that query would be practical |
242 | 21:42:07 <kaldari> I imagine some maintenance script could be written to populate the global_user_id fields in the local watchlist tables |
243 | 21:42:09 <RoanKattouw> You probably just need the global user ID to be there, one way or another |
244 | 21:42:12 <anomie> RoanKattouw: `localwiki.user JOIN centralauth.globaluser ON(user_name = gu_name)`. Add in `JOIN localuser ON (lu_wiki=$wiki AND lu_name = gu_name)` if you care about the possibility of unattached accounts. |
245 | 21:42:36 <RoanKattouw> anomie: Sure but now how do I do "all watchlist rows for global user ID 57 regardless of wiki" |
246 | 21:42:51 <RoanKattouw> That, in my understanding, is the main motivation for globalizing the tables etc |
247 | 21:43:21 <gwicke> if we keep local tables, can't we just keep using those for unattached local users? |
248 | 21:43:22 <RoanKattouw> Oh, sorry, you're telling me how to map user IDs |
249 | 21:43:41 <RoanKattouw> You can't join between the centralauth DB and local wiki DB generally speaking because of sharding |
250 | 21:44:00 <RoanKattouw> kaldari: Yes, we would need to do that |
251 | 21:44:07 <TimStarling> unless you also replicate the user table to the central server |
252 | 21:44:15 <kaldari> lol |
253 | 21:44:42 <matt_flaschen> I guess that mega-user table would only be need to rewrite the user ID at replication time. |
254 | 21:44:53 <gwicke> given a local user, can we figure out the global user id if the user is attached? |
255 | 21:45:10 <TimStarling> but like I say, I think it's a bug, I think user_id mapping should be added to CA |
256 | 21:45:14 <matt_flaschen> Or we can short-circuit this and just make global_watchlist a real mega-table, not a view/binlog thing. |
257 | 21:45:14 <RoanKattouw> gwicke: Yes, but only by using >=2 DB connections at present |
258 | 21:45:23 <TimStarling> I don't think we should bend over backwards trying to accomodate that limitation |
259 | 21:45:47 <RoanKattouw> Or as Tim says we could add that mapping in CA, and then co-locate the global watchlist table with the CA database |
260 | 21:45:59 <bd808> Adding local wiki user_id to CA should be pretty easy I think |
261 | 21:46:05 <DanielK_WMDE_> #info <TimStarling> I think it's a bug, I think user_id mapping should be added to CA |
262 | 21:46:11 <RoanKattouw> That might save kaldari's budget from dedicated hardware purchases :) if the hardware can take it |
263 | 21:46:19 <addshore> And adding user_id to ca makes sense! |
264 | 21:46:24 <kaldari> Which team is in charge of CA? |
265 | 21:46:31 <RoanKattouw> LOL |
266 | 21:46:35 <bd808> kaldari: a fine question |
267 | 21:46:38 <matt_flaschen> RoanKattouw, does that really save any storage, or just change what software does the writes (replication vs. explicit)? |
268 | 21:46:41 <RoanKattouw> We literally had a meeting hours ago where we argued about that |
269 | 21:46:53 <anomie> I'd say add the column to the localuser table. And at the same time possibly add gu_id, instead of having to use the name to join. |
270 | 21:46:59 <bd808> s/argued/discussed/ |
271 | 21:47:00 <ori> last person to say "not it" |
272 | 21:47:05 <bd808> not it |
273 | 21:47:07 <anomie> not it |
274 | 21:47:08 <RoanKattouw> matt_flaschen: It doesn't, I just meant that it would incentivize us to locate CA and global watchlist on the same DB servers, meaning kaldari doesn't have to buy his own servers |
275 | 21:47:10 <robla> not it |
276 | 21:47:14 <RoanKattouw> I was mostly joking |
277 | 21:47:16 <TimStarling> we need a fake team that we can use as a sarcastic answer for questions like this |
278 | 21:47:29 <bd808> "mediawiki-core" |
279 | 21:47:29 <Scott_WUaS> TimStarling: user_id mapping should be added to CA makes much sense |
280 | 21:47:31 <James_F> TimStarling: I know! |
281 | 21:47:34 <ori> Site Reliability |
282 | 21:47:36 <quiddity> Tiger Team! >.> |
283 | 21:47:43 <RoanKattouw> haha |
284 | 21:47:47 <James_F> TimStarling: We could call it the "Admin tools team" which was going to be responsible. |
285 | 21:47:52 <gwicke> Platypus |
286 | 21:47:56 <dannyh> ha |
287 | 21:47:59 <RoanKattouw> Anyway, summing up |
288 | 21:48:00 * James_F coughs. |
289 | 21:48:06 <matt_flaschen> Stability team |
290 | 21:48:08 <quiddity> Does this discussion help answer/refine the question about how many simultaneous watchlists we could support? (i.e. the previously suggested limit of 10 wikis per account. I'm hoping we can raise that to at least 20, and perhaps 40 or more.) |
291 | 21:48:09 <RoanKattouw> It sounds like people are in favor of the following approach, if feasible: |
292 | 21:48:19 <RoanKattouw> 1) Add user ID to the CA tables |
293 | 21:48:21 <robla> yes...so RoanKattouw , can I assign you as shepherd for this one? |
294 | 21:48:34 <matt_flaschen> quiddity, we may not need that limitation if we take the global approach. No promises yet. :) |
295 | 21:48:45 <RoanKattouw> 2) Use Jaime's replication dark magic and additionally have it rewrite/populate wl_user/wl_global_user with the global user ID, which it derives from the CA tables |
296 | 21:48:58 <RoanKattouw> 2b) this probably means the global watchlsit table and the CA tables have to be on the same DB cluster |
297 | 21:48:59 <bd808> quiddity: I *think* if we go for the central storage option then the answer is "all wikis in the farm" |
298 | 21:49:11 <RoanKattouw> Yes my belief is also that we wouldn't need a limit |
299 | 21:49:32 <gwicke> the replication system could just look up the ids from the CA cluster |
300 | 21:49:37 <gwicke> I don't think it needs to be co-located, as long as it can access CA |
301 | 21:49:47 <bd808> Do we limit number of pages watched on a single wiki? |
302 | 21:49:55 <James_F> Currently we don't. |
303 | 21:50:00 <TimStarling> I think we should |
304 | 21:50:00 <addshore> Nope |
305 | 21:50:08 <RoanKattouw> Another alternative that was mentioned was adding wl_global_user to the local watchlist tables, have local MW instances populate it by using existing CA stuff, and have that mindlessly be replicated acros |
306 | 21:50:09 <James_F> E.g. my personal account has ~20k pages watched. |
307 | 21:50:14 <quiddity> there are many users with watchlists in the low tens-of-thousands. |
308 | 21:50:15 <James_F> Which would be… bad to encourage. |
309 | 21:50:23 <TimStarling> I remember there was a guy with 150k pages on his watchlist at a time when enwiki only had 1M pages total |
310 | 21:50:28 <RoanKattouw> Do people feel that's better or worse than doing it in the replications tep? |
311 | 21:50:32 <RoanKattouw> *replication step |
312 | 21:50:39 <TimStarling> there wasn't really a good rationale for it, I asked him |
313 | 21:50:58 <RoanKattouw> TimStarling: Betacommandbot had 200k or 400k or something, when it tried to view Special:Watchlist it caused an OOM |
314 | 21:50:58 <TimStarling> if there was a limit he would have stopped clicking the button |
315 | 21:51:10 <RoanKattouw> Brion ran DELETE FROM watchlist WHERE wl_user=NNN and that query took 14 minutes |
316 | 21:51:15 <quiddity> Any discussion system that allows Topic/Thread watchlisting, is also a prolific incrementer of watchlist numbers... |
317 | 21:51:17 <matt_flaschen> RoanKattouw, I sort of think better in replication step. Otherwise, it's arbitrary why just there. |
318 | 21:51:28 <robla> #info RoanKattouw notes tentative buy-in for 1) Add user ID to the CA tables 2) Jaime's replication strategy + rewrite/populate wl_user/wl_global_user w/global user ID (derived from the CA tables) |
319 | 21:51:33 <RoanKattouw> matt_flaschen: I guess it would pollute the general MW core schema |
320 | 21:51:44 <matt_flaschen> Why not rev_global_user? (Clearly the answer is "we didn't need it", but that's not that satisfying when you're looking at a DB schema) |
321 | 21:51:50 <RoanKattouw> Right : |
322 | 21:51:51 <RoanKattouw> :) |
323 | 21:52:05 <addshore> I think one of the reasons for large watchlists is the difficulty people have to selectively clear them. |
324 | 21:52:12 <RoanKattouw> matt_flaschen: Other answer: "on enwiki the revision table is way too freaking large" ;) |
325 | 21:52:29 <bd808> addshore: weren't you going to fix that? ;) |
326 | 21:52:33 <robla> #info robla plans to assign RoanKattouw as shepherd |
327 | 21:52:36 <addshore> Mhhmhm |
328 | 21:52:41 <RoanKattouw> robla: Sounds reasonable |
329 | 21:53:09 <addshore> bd808: *in progress* |
330 | 21:53:23 <kaldari> A watchlist limit of 50,000 or 100,000 seems like it would be reasonable, and maybe deter some beans up noses. |
331 | 21:53:41 <bd808> for implementation, would this be an extension that replaces the normal Special:Watchlist? |
332 | 21:53:50 <addshore> I wonder how many people would hit that at present! |
333 | 21:53:51 <matt_flaschen> Watchlist expiry would also help with this, especially if it's a prominent option. |
334 | 21:54:03 <kaldari> bd808: probably |
335 | 21:54:03 <tgr> keeping global IDs segregated in the CA database is a good thing since it makes it much easier to switch to a different single-sign-on extension |
336 | 21:54:04 <bd808> or will Special:Watchlist in core have to learn about weird central tables that only exist on some farms? |
337 | 21:54:12 <addshore> matt_flaschen: indeed |
338 | 21:54:17 <matt_flaschen> T100508 |
339 | 21:54:18 <stashbot> T100508: Watchlist expiry: Watch pages for a specified time frame (2013) - https://phabricator.wikimedia.org/T100508 |
340 | 21:54:19 <RoanKattouw> addshore: Challenge accepted, I'll see if that query runs in a reasonable amount of time |
341 | 21:54:30 <matt_flaschen> (That is also a community wishlist request in its own right) |
342 | 21:55:01 <bd808> matt_flaschen: that's what I was taunting addshore about |
343 | 21:55:25 <addshore> On that front we are still waiting for the wl_id field to land on production databases. |
344 | 21:55:41 <kaldari> addshore: any ETA? |
345 | 21:55:58 <bd808> "when jamie gets it done" |
346 | 21:56:06 <addshore> It's getting closer :) |
347 | 21:56:07 <quiddity> https://phabricator.wikimedia.org/T2424 -- patch waiting for more reviews. |
348 | 21:56:11 <quiddity> See also* |
349 | 21:56:23 <addshore> By the end of the year? (: |
350 | 21:56:36 <quiddity> ("Add a direct unwatch link to entries on Special:Watchlist") /me pats stashbot |
351 | 21:56:54 <addshore> quiddity: I think we also looked at that a bit! |
352 | 21:57:18 * quiddity adds addshore to review https://gerrit.wikimedia.org/r/#/c/293892/ >.> |
353 | 21:57:21 <bd808> quiddity: stashbot won't expand urls, just task numbers |
354 | 21:57:29 <robla> we can continue the conversation in another channel. #wikimedia-tech? #wikimedia-operations? (or just move to Phab/email/async) |
355 | 21:57:41 <bd808> next steps? |
356 | 21:57:44 * robla gives 2 minute warning |
357 | 21:57:46 <addshore> Oooh, please do quiddity |
358 | 21:57:52 <matt_flaschen> quiddity, it's marked as a WIP, also is it blocked on design? |
359 | 21:58:03 <kaldari> Do we want to discuss the pluses and minuses of a non-central-recentchanges-table approach? |
360 | 21:58:17 <kaldari> And what that might look like? |
361 | 21:58:17 <Scott_WUaS> kaldari: yes |
362 | 21:58:20 * addshore waves goodbye! |
363 | 21:58:27 <quiddity> matt_flaschen, yes, it needs code and design feedback. |
364 | 21:58:40 <kaldari> although I guess we're out of time for now |
365 | 21:58:42 <gwicke> the big minus of local watchlists is that it's hard to scale to many projects |
366 | 21:58:47 <kaldari> I have to run do an interview |
367 | 21:58:53 <robla> yeah, we should wrap this up. I think next steps are more async conversation, and then RoanKattouw can bring this up for another one of these |
368 | 21:58:54 <addshore> quiddity: I'm sure we (wmde) would be happy to provide that 'maybe' |
369 | 21:59:06 <RoanKattouw> I'll write a summary on the task |
370 | 21:59:12 <Scott_WUaS> robla: great! |
371 | 21:59:14 <robla> RoanKattouw: thank you! |
372 | 21:59:35 <dannyh> thanks, everybody |
373 | 21:59:36 <RoanKattouw> kaldari: Input on that welcome. I have opinions too. So I'll ask that on the task and then reply with my view |
374 | 21:59:39 <robla> #info RoanKattouw plans to write a summary on the task |
375 | 21:59:48 <matt_flaschen> Thanks to kaldari, dannyh, jynus, and everyone |
376 | 21:59:53 <robla> thank you everyone! |
377 | 22:00:00 <robla> #endmeeting |
Architecture meetings | ||
---|---|---|
13:00 PT ArchCom Planning Meetings | upcoming | all since 2016-03-30 |
14:00 PT ArchCom-RFC Meetings | upcoming | all since 2015-09-09 |