Page MenuHomePhabricator

Investigation: Add old and new length columns to cu_changes
Closed, DeclinedPublic5 Story Points

Description

The cu_changes table has data copied from the recentchanges table, via a hook, but does not copy the rc_old_len and rc_new_len columns. These aren't that important for the CheckUser, but they are very much a "nice to have" for the Special:RangeContributions tool (T145912), so we can show the diff size for each edit, like you see on all other lists of edits (watchlist, recent changes, Special:Contribs). We could also start showing diff sizes in the CheckUser tool itself, which may be a welcomed change.

Because the data is copied via the RecentChange_save hook, I think if we were to ALTER the table we'd have to first remove the hook or else it would block updates to the recentchanges table.

The questions are:

  • Is it even feasible to add the cuc_old_len and cuc_new_len columns to cu_changes given the size of the table (17,705,197 rows on enwiki at the time of writing)?
  • Roughly how much time will the migration take?
  • Will CheckUser's be OK with the downtime, given the benefit of being able to efficiently show diff sizes moving forward?
  • As an alternative, can we compute the diff sizes as needed within the RangeContributions tool without noticeable effect on load time?
  • Would we be able to backfill the old data? What about the CU data that wasn't being copied while the migration was taking place?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2017, 2:10 PM
MusikAnimal added a project: CheckUser.
MusikAnimal updated the task description. (Show Details)Jan 23 2017, 8:18 PM
MusikAnimal triaged this task as Normal priority.
MusikAnimal edited projects, added Community-Tech-Sprint; removed Community-Tech.
MusikAnimal set the point value for this task to 5.
MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.EditedJan 23 2017, 8:22 PM
MusikAnimal added a subscriber: jcrespo.

@jcrespo Sorry for the ping, hoping this one is a simple yes/no/try-this-instead. For most wikis I don't think altering cu_changes will be much of an issue, but on enwiki there are 17,705,197 rows (at the time of writing). So the main questions are if it is even feasible to alter this table, and if so, how long the migration would take. Many thanks!

This is not a simple question, you are basically asking me what I think on a design of a dataset/schema change I do not know much about. I will have to study the code and structure;,any extra links will speed up the process.

This is not a simple question, you are basically asking me what I think on a design of a dataset/schema change I do not know much about. I will have to study the code and structure;,any extra links will speed up the process.

Thanks for the quick reply! Not sure how much code you need to see, but here's a synopsis:

  • Basically, the cu_changes table is a copy of recentchanges, with some added info like the user agent. What's missing that we would really like to have is a copy of rc_old_len (int) and rc_new_len (int).
  • The data is copied via the RecentChange_save hook, so cu_changes gets written to every time recentchanges gets written to. The code where it does the actual copying (for edits) can be found here.
  • Data in cu_changes that is older than 90 days (as specified by $wgCUDMaxAge) is purged.
    • If $wgPutIPinRC is set, the CheckUser extension will also purge such rows in recentchanges, though I'm not really sure why because it seems this is also handled by the RecentChangesUpdateJob. That aside, I'm not aware of any other core tables that the CheckUser extension modifies.
  • There are also hooks to store logged actions like emails and account creation in cu_changes, but for those obviously old and new lengths of a revision will not apply, and be given a zero value.

Is there any other information you need? Let me know :) As I was collecting this for you, I'm noticing other possible implications caused by the cu_changes table being locked. For instance, the AbuseFilter extension writes to cu_changes directly via updateCheckUserData so that information is available to CheckUsers. We may need to temporarily remove that hook as well. Obviously this investigation is in the early stages, but if there is a "not gonna happen" as far as altering the main cu_changes table, I won't need to dive any deeper.

The following may sound like a noob question, but please have into account I wasn't familiar at all with this extension's persistent layer at all- not with the extension inner workings itself.

Basically, the cu_changes table is a copy of recentchanges

Why are some columns duplicated from rc, log, revisions? Is it because rows don't correspond 1:1 to rc/revision/logging rows, because it can have different purge policies or because the queries done require denormalization, or for simplification of the queries/reads?

MusikAnimal added a comment.EditedJan 23 2017, 11:55 PM

Why are some columns duplicated from rc, log, revisions? Is it because rows don't correspond 1:1 to rc/revision/logging rows, because it can have different purge policies or because the queries done require denormalization, or for simplification of the queries/reads?

I think as far as edits go they more or less do correspond 1:1, but yes given CU involves sensitive data I imagine the authors kept in mind that for some third party wikis privacy is not as important and they'd rather be able to run checks on older data. On top of that we need the user agent which isn't stored in recentchanges, and finally an efficient way to query for an IP range, which is accomplished with the cuc_ip_hex column (hence why we're piggybacking off of it for T145912)

MusikAnimal updated the task description. (Show Details)Jan 23 2017, 11:56 PM
DannyH moved this task from Untriaged to Bug backlog on the Community-Tech board.
1978Gage2001 moved this task from Triage to In progress on the DBA board.Dec 11 2017, 9:45 AM
1978Gage2001 moved this task from Triage to In progress on the DBA board.
Marostegui moved this task from In progress to Triage on the DBA board.Dec 11 2017, 11:07 AM
jcrespo moved this task from Triage to Done on the DBA board.Jun 15 2018, 3:36 PM

I don't think this is well thought, and I see some proposals here are not properly designed, at least from the point of view of WMF. Comment should be store on the comment store, there is a lot of redundant data to recentchanges, etc.- I don't think this should go on without a rethinking- at least from the WMF perspective (I have no opinion on mediawiki for non-WMF setups).

MusikAnimal closed this task as Declined.Jun 15 2018, 4:07 PM
MusikAnimal moved this task from Backlog to Closed on the CheckUser board.

This was originally requested as part of T145912, when we were intent on using cu_changes to add public IP range contributions functionality. As I'm sure you recall we ended up creating the ip_changes table with T163562. As such, I'll close this since I'm unaware of any desire for it beyond the original use case. Thanks for your time!

As I'm sure you recall

Don't be so sure :-), but kudos to me from the past, I guess. Thank you, I was just reviewing old requests, it is nice things are ok now.