Page MenuHomePhabricator

Consider renaming columns and/or table to abide by the data modeling guidelines
Open, Needs TriagePublic

Description

Since this table will go to prod after the data modeling guidelines were established, we should strive to abide by it:

I will share some thoughts on naming, for whatever they're worth. I recognize most of these names were used for historical continuity but since table migrations are so rare, I think it's worth taking the opportunity to clarify and standardize things for the next decade of users.

  • wmf_dumps isn't the most useful name, since lots of different datasets get dumped and most of them will be in other databases. What about wmf_content?
  • revision_timestamp, error_timestamp: according to the data modeling guidelines, we should use revision_dt and error_dt instead.
  • wiki_db: according to the data modeling guidelines, we should use wiki_id instead.
  • revision_is_minor_edit: it would be less redundant to use revision_is_minor
  • user_is_visible, revision_comment_is_visible, revision_content_is_visible: "visible" is actually quite a good term for this, so I personally I kind of want to keep it, but it's not used elsewhere. The official name for the functionality is revision deletion, although some parts of the interface do use "visible". mediawiki_history provides this as an array named revision_deleted_parts. Maybe it's worth emulating that? 🤷🏽‍♂️
  • page_redirect_title: I think this is a tiny bit confusing (what the title of a redirect?). Maybe page_redirect_target instead, as "target" seems to be the common term (e.g. on en:w:Wikipedia:Redirect and mw:Help:Redirects).
  • revision_size and content_size: according to the data modeling guidelines, these should be suffixed by the unit (revision_size_bytes and content_size_bytes).
  • row_last_update, row_visibility_last_update: according to the data modeling guidelines, these should be suffixed by _dt (although personally I find that a bit redundant)
  • content_body: the two words seem redundant to me. What about just content?

Event Timeline

We should probably add something to data modeling guidelines about making choices between consistency with existing fields, and best practices for new fields.

In MediaWiki state change events, in some cases we decided to stick closely with the MediaWiki db field names, in others we decided to go with newer guidelines.

Generally, I'd say if renaming an existing field will cause significantly more confusion and misunderstanding than keeping it the same, we should keep it the same and document the Modeling Guidelines violation.

While you are at this, consider looking at the event entity fragment schemas we made a year or so ago.

https://schema.wikimedia.org/#!/primary/jsonschema/fragment/mediawiki/state/entity

It would be nice if these could be close, since we did a lot of work bikeshedding names there.

We'd probably like to create a revision change stream one day too, to model things like visibility changes to past revisions. Since dumps is about revisions, it would be nice if whatever we come up with here aligns to what we might do in the future.

decided to stick closely with the MediaWiki db field names

E.g. we went with rev_id and rev_dt instead of revision_id and revision_dt. You can see its a mix. rev_id is a foreign key for revision table, so revision_id would be better. We decided that our _dt convention was important, so we decided not to use keep _timestamp. But, for field name consistency, we kept the rev_ part, and got rev_dt. ¯\_(ツ)_/¯

xcollazo renamed this task from Rename columns and/or table to abide by the data modeling guidelines to Consider renaming columns and/or table to abide by the data modeling guidelines.Fri, Jun 28, 3:53 PM