Consider renaming columns and/or table to abide by the data modeling guidelines
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	xcollazo
	Jun 3 2024, 9:51 PM

Description

Since this table will go to prod after the data modeling guidelines were established, we should strive to abide by it:

In T358366#9852010, @nshahquinn-wmf wrote:

I will share some thoughts on naming, for whatever they're worth. I recognize most of these names were used for historical continuity but since table migrations are so rare, I think it's worth taking the opportunity to clarify and standardize things for the next decade of users.

wmf_dumps isn't the most useful name, since lots of different datasets get dumped and most of them will be in other databases. What about wmf_content?

revision_timestamp, error_timestamp: according to the data modeling guidelines, we should use revision_dt and error_dt instead.

wiki_db: according to the data modeling guidelines, we should use wiki_id instead.

revision_is_minor_edit: it would be less redundant to use revision_is_minor

user_is_visible, revision_comment_is_visible, revision_content_is_visible: "visible" is actually quite a good term for this, so I personally I kind of want to keep it, but it's not used elsewhere. The official name for the functionality is revision deletion, although some parts of the interface do use "visible". mediawiki_history provides this as an array named revision_deleted_parts. Maybe it's worth emulating that? 🤷🏽‍♂️

page_redirect_title: I think this is a tiny bit confusing (what the title of a redirect?). Maybe page_redirect_target instead, as "target" seems to be the common term (e.g. on en:w:Wikipedia:Redirect and mw:Help:Redirects).

revision_size and content_size: according to the data modeling guidelines, these should be suffixed by the unit (revision_size_bytes and content_size_bytes).

row_last_update, row_visibility_last_update: according to the data modeling guidelines, these should be suffixed by _dt (although personally I find that a bit redundant)

content_body: the two words seem redundant to me. What about just content?

Related Objects
Search...

Status	Assigned	Task
Invalid	VirginiaPoundstone	T345988 [Epic] XML MediaWiki data dumps for right to fork
Open	xcollazo	T358877 Dumps 2.0 Phase II: Production intermediate table milestone
Open	None	T366542 Consider renaming columns and/or table to abide by the data modeling guidelines

Event Timeline

xcollazo created this task.Jun 3 2024, 9:51 PM

xcollazo mentioned this in T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.

We should probably add something to data modeling guidelines about making choices between consistency with existing fields, and best practices for new fields.

In MediaWiki state change events, in some cases we decided to stick closely with the MediaWiki db field names, in others we decided to go with newer guidelines.

Generally, I'd say if renaming an existing field will cause significantly more confusion and misunderstanding than keeping it the same, we should keep it the same and document the Modeling Guidelines violation.

While you are at this, consider looking at the event entity fragment schemas we made a year or so ago.

https://schema.wikimedia.org/#!/primary/jsonschema/fragment/mediawiki/state/entity

It would be nice if these could be close, since we did a lot of work bikeshedding names there.

We'd probably like to create a revision change stream one day too, to model things like visibility changes to past revisions. Since dumps is about revisions, it would be nice if whatever we come up with here aligns to what we might do in the future.

Ottomata added projects: Event-Platform, Data-Engineering.Jun 4 2024, 3:10 PM

Ottomata added subscribers: gmodena, Tchanders, Ahoelzl, lbowmaker.

Ottomata added a subscriber: amastilovic.Jun 4 2024, 3:18 PM

decided to stick closely with the MediaWiki db field names

E.g. we went with rev_id and rev_dt instead of revision_id and revision_dt. You can see its a mix. rev_id is a foreign key for revision table, so revision_id would be better. We decided that our _dt convention was important, so we decided not to use keep _timestamp. But, for field name consistency, we kept the rev_ part, and got rev_dt. ¯\_(ツ)_/¯

VirginiaPoundstone moved this task from Incoming to Dumps 2 on the Data Products board.Fri, Jun 7, 3:27 PM

nshahquinn-wmf updated the task description. (Show Details)Sun, Jun 23, 12:20 AM

nshahquinn-wmf updated the task description. (Show Details)

xcollazo renamed this task from Rename columns and/or table to abide by the data modeling guidelines to Consider renaming columns and/or table to abide by the data modeling guidelines.Fri, Jun 28, 3:53 PM

Consider renaming columns and/or table to abide by the data modeling guidelinesOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Consider renaming columns and/or table to abide by the data modeling guidelines
Open, Needs TriagePublic
Actions

Related Objects
Search...