Page MenuHomePhabricator

Investigation: Add old and new length columns to cu_changes
Closed, DeclinedPublic5 Estimated Story Points

Description

The cu_changes table has data copied from the recentchanges table, via a hook, but does not copy the rc_old_len and rc_new_len columns. These aren't that important for the CheckUser, but they are very much a "nice to have" for the Special:RangeContributions tool (T145912), so we can show the diff size for each edit, like you see on all other lists of edits (watchlist, recent changes, Special:Contribs). We could also start showing diff sizes in the CheckUser tool itself, which may be a welcomed change.

Because the data is copied via the [[ https://phabricator.wikimedia.org/diffusion/ECHU/browse/master/extension.json;ef4ca3223329cd45ec3db992e257af0792bd8189$35 | RecentChange_save hook ]], I think if we were to ALTER the table we'd have to first remove the hook or else it would block updates to the recentchanges table.

The questions are:

  • Is it even feasible to add the cuc_old_len and cuc_new_len columns to cu_changes given the size of the table (17,705,197 rows on enwiki at the time of writing)?
  • Roughly how much time will the migration take?
  • Will CheckUser's be OK with the downtime, given the benefit of being able to efficiently show diff sizes moving forward?
  • As an alternative, can we compute the diff sizes as needed within the RangeContributions tool without noticeable effect on load time?
  • Would we be able to backfill the old data? What about the CU data that wasn't being copied while the migration was taking place?

Event Timeline

MusikAnimal updated the task description. (Show Details)
MusikAnimal edited projects, added Community-Tech-Sprint; removed Community-Tech.
MusikAnimal set the point value for this task to 5.
MusikAnimal added a subscriber: jcrespo.

@jcrespo Sorry for the ping, hoping this one is a simple yes/no/try-this-instead. For most wikis I don't think altering cu_changes will be much of an issue, but on enwiki there are 17,705,197 rows (at the time of writing). So the main questions are if it is even feasible to alter this table, and if so, how long the migration would take. Many thanks!

This is not a simple question, you are basically asking me what I think on a design of a dataset/schema change I do not know much about. I will have to study the code and structure;,any extra links will speed up the process.

This is not a simple question, you are basically asking me what I think on a design of a dataset/schema change I do not know much about. I will have to study the code and structure;,any extra links will speed up the process.

Thanks for the quick reply! Not sure how much code you need to see, but here's a synopsis:

Is there any other information you need? Let me know :) As I was collecting this for you, I'm noticing other possible implications caused by the cu_changes table being locked. For instance, the AbuseFilter extension writes to cu_changes directly via updateCheckUserData so that information is available to CheckUsers. We may need to temporarily remove that hook as well. Obviously this investigation is in the early stages, but if there is a "not gonna happen" as far as altering the main cu_changes table, I won't need to dive any deeper.

The following may sound like a noob question, but please have into account I wasn't familiar at all with this extension's persistent layer at all- not with the extension inner workings itself.

Basically, the cu_changes table is a copy of recentchanges

Why are some columns duplicated from rc, log, revisions? Is it because rows don't correspond 1:1 to rc/revision/logging rows, because it can have different purge policies or because the queries done require denormalization, or for simplification of the queries/reads?

Why are some columns duplicated from rc, log, revisions? Is it because rows don't correspond 1:1 to rc/revision/logging rows, because it can have different purge policies or because the queries done require denormalization, or for simplification of the queries/reads?

I think as far as edits go they more or less do correspond 1:1, but yes given CU involves sensitive data I imagine the authors kept in mind that for some third party wikis privacy is not as important and they'd rather be able to run checks on older data. On top of that we need the user agent which isn't stored in recentchanges, and finally an efficient way to query for an IP range, which is accomplished with the cuc_ip_hex column (hence why we're piggybacking off of it for T145912)

1978Gage2001 moved this task from Triage to In progress on the DBA board.

I don't think this is well thought, and I see some proposals here are not properly designed, at least from the point of view of WMF. Comment should be store on the comment store, there is a lot of redundant data to recentchanges, etc.- I don't think this should go on without a rethinking- at least from the WMF perspective (I have no opinion on mediawiki for non-WMF setups).

MusikAnimal moved this task from Inbox to Done on the CheckUser board.

This was originally requested as part of T145912, when we were intent on using cu_changes to add public IP range contributions functionality. As I'm sure you recall we ended up creating the ip_changes table with T163562. As such, I'll close this since I'm unaware of any desire for it beyond the original use case. Thanks for your time!

As I'm sure you recall

Don't be so sure :-), but kudos to me from the past, I guess. Thank you, I was just reviewing old requests, it is nice things are ok now.