Page MenuHomePhabricator

Improvements to the basic MediaWiki event schemas
Closed, ResolvedPublic

Description

During the initial round of creating the schemas for basic MediaWiki events (originally contained in this PR, now residing in its own repo), there were some comments which have not been fully discussed or addressed. This is a follow-up ticket to try to address / clarify any low-hanging fruit. I'll try to summarise the current state of affairs for each of issues below, all of which address the naming conventions adopted for the event schemas.

The current edit schema uses save_dt for storing the time stamp, while the canonical name of the field as referred to it in MediaWiki is timestamp. Renaming the field gives it more clarity and specificity (as in, timestamp as the name itself is too broad). The counterargument is that we shouldn't change field names promoted by MediaWiki if we don't have to. Along the same lines, there is a suggestion to also rename the following fields:

  • summary into comment
  • parent_revision_id into parent_id

Event Timeline

mobrovac raised the priority of this task from to High.
mobrovac updated the task description. (Show Details)

Time stamp

timestamp is really too ambiguous to me. Since we are talking about an edit event where a new revision is created, I would be alright with renaming the field to rev_timestamp. How does that sound?

Summary / Comment

Traditionally, this field is referred to as comment in MediaWiki internals. However, this is really a summary of the edit, and it's portrayed as such in MW's interface when editing a page. Hence, I think it's much cleaner to retain summary as the field name.

Parent Revision ID

parent_id might not be clear enough (the parent of what?), parent_revision_id makes it explicit and easy to understand.

timestamp is really too ambiguous to me. Since we are talking about an edit event where a new revision is created, I would be alright with renaming the field to rev_timestamp. How does that sound?

I agree. If we're changing the name anyway, I'm a little for just going with the dt/ts convention, but if @Halfak prefers rev_timestamp to save_dt, I'm for it.

Traditionally, this field is referred to as comment in MediaWiki internals. However, this is really a summary of the edit, and it's portrayed as such in MW's interface when editing a page. Hence, I think it's much cleaner to retain summary as the field name.

Hm, yeahhhh, but many users of this data are used to MW internals, and will expect things to be consistent with what MW names things. I think this is case where we should stick with the MW name.

parent_id might not be clear enough (the parent of what?), parent_revision_id makes it explicit and easy to understand.

I defer to @Halfak's preference on this one.

timestamp is really too ambiguous to me.

But the dt field in the event isn't? What makes dt less ambiguous than timestamp? How does rev_timestamp make the field less ambiguous when it appears within the PageEdit (aka. "revision" everywhere else in our data sources)?

this is really a summary of the edit

No. It is a comment left by the editor. Summarizing what was done in the edit is often encouraged, but not required. In fact, many comments are not summaries at all, but by and large, the field is left blank (aka "no comment"). It is called "comment" in all other data sources that we provide. Why would we rename it?

(the parent of what?)

The revision. What else? This is like asking "user_id of what?". It appears within the event, so I'd guess it would be relevant to that the facts of the event. E.g. in the dumps, <parentid> appears within the <revision> tag (see the schema) and in the API "parentid" appears within the "revision" objects (see example query).

But the dt field in the event isn't? What makes dt less ambiguous than timestamp?

No, not to me at least, because it's the save_dt field, of type datetime, i.e. the save date and time (stamp) of the edit.

How does rev_timestamp make the field less ambiguous when it appears within the PageEdit (aka. "revision" everywhere else in our data sources)?

I could go for revision_timestamp too. It makes it less ambiguous because it makes it clear to what exactly the time stamp refers.

No. It is a comment left by the editor. Summarizing what was done in the edit is often encouraged, but not required. In fact, many comments are not summaries at all, but by and large, the field is left blank (aka "no comment"). It is called "comment" in all other data sources that we provide. Why would we rename it?

I meant it's meant to be used as a summary. And yes, I am aware of the different usage patterns by editors wrt this field, but that doesn't mean we shouldn't name it based on intended usage.

That said, I don't feel too strong about this.

(the parent of what?)

The revision. What else? This is like asking "user_id of what?". It appears within the event, so I'd guess it would be relevant to that the facts of the event.

Right, but parent_id in the PageEdit does not resonate clearly to me: does it refer to a previous edit? A previous edit event or a previous revision which caused the generation of the event? Conversely, user_id and page_id are rather clear in the context of the event.

And yes, I am aware of the different usage patterns by editors wrt this field, but that doesn't mean we shouldn't name it based on intended usage.

I think this is a case of just being consistent. We shouldn't change the name of a field only because we have a name that is slightly more clear. In this case, we aren't trying to conform to a convention (like we are with dt/ts, etc.). The only reason to call this summary is for a slight aesthetic improvement, and I don't think that's enough to justify the inconsistency.

No, not to me at least, because it's the save_dt field, of type datetime, i.e. the save date and time (stamp) of the edit.

I don't see a justification here. Is a timestamp not a type? Can you find anywhere else where we refer to this field as "save datetime"?

I could go for revision_timestamp too. It makes it less ambiguous because it makes it clear to what exactly the time stamp refers.

Why is it not obvious what the timestamp refers to? Why is it not ambiguous what user_id refers to?

parent_id in the PageEdit does not resonate clearly to me:

That's OK. The term "parent" is standard jargon in MediaWiki. "previous" and "base" are not used. It doesn't really matter whether these terms vibe with your sensibilities. Creating a new standard and further polluting the ecosystem of names used in MediaWiki data should be a strong counter argument to inventing new terms!

I don't see a justification here. Is a timestamp not a type? Can you find anywhere else where we refer to this field as "save datetime"?

timestamp is a type, but it isn't the type of this field. This field is a string datetime. 'timestamp' is also often a keyword, and we'd like to avoid naming fields things that have the likelihood of clashing with tools. Integer timestamps will be 'ts' and string datetimes will be 'dt'.

Creating a new standard and further polluting the ecosystem of names used in MediaWiki data should be a strong counter argument to inventing new terms!

This is a very strong point, and we should seriously consider this when changing the name of a field. I def think we should go with parent_id and comment because of this. Let me see if I can justify sticking with timestamp to myself...

timestamp is the only time field in the event. It is part of the event content, not the meta, so even though I think timestamp is an exceptionally bad name (vs. only kinda bad for the others we are considering here), the 'be consistent' argument may be enough. This is the timestamp of the page edit. I am willing to live with the poor name. We could avoid potential bugs in downstream tools by avoiding a keyword, but we might introduce other ones when users of this data don't find the field name they expect. If we do keep timestamp we should put a special comment about how it is a not an integer timestamp.

I agree that we shouldn't be pulling names out of thin air, but I don't think that's what's we are doing in this instance. Rather, we are picking appropriate names for fields.

We still seem to be stuck on timestamp. To quote @Ottomata (from his comment on PR#5):

that's fair, although just timestamp isn't good for a couple of reasons:

  • it is often a keyword, so not using it might avoid issues down the line
  • The meaning of the field should be imparted in the name. timestamp is too ambiguous. It'd be like calling a string typed field string.

I believe these still stand.

I could go for revision_timestamp too. It makes it less ambiguous because it makes it clear to what exactly the time stamp refers.

Why is it not obvious what the timestamp refers to?

It is not clear if the string (not timestamp, as @Ottomata rightfully points out) refers to the event creation time or action time. IMHO, revision_timestamp (or save_dt, or perhaps revision_dt) would be more accurate.

parent_id in the PageEdit does not resonate clearly to me:

That's OK. The term "parent" is standard jargon in MediaWiki.

My point was that it is not clear the parent of what is that, not that parent by itself is unclear.

It doesn't really matter whether these terms vibe with your sensibilities.

They don't, personally. But that's beside the point. Making it clear and easy for most people is.

we are picking appropriate names for fields.

And creating a new things for people to remember. Also, what's "appropriate" from your point of view might be totally inappropriate from mine (as may be evident). I'm sure that the people who picked the names for the Database, the XML dumps and the API thought hard about them too. I've been developing code against these datasources for more than a decade, so I might have some insights about what is "appropriate" that you do not.

I believe these [the arguments about "timestamp"] stand.

Those same arguments could be applied to the "dt" field in "meta". Further, datetime and timestamp are synonyms. See https://en.wikipedia.org/wiki/Timestamp I think I have thoroughly made my point and I have seen no defense of 'dt' as being less ambiguous.

It is not clear if the string (not timestamp, as @Ottomata rightfully points out) refers to the event creation time or action time.

These times are the same for a revision event. A revision is created when it's action is performed. I'm still waiting for someone to explain to me how rev_timestamp or revision_dt is more explicit than timestamp appearing within the revision event like it does everywhere else in MediaWiki land. This term "PageEdit" is also new and does not appear in any data source that I'm aware of.

parent of what

Of the revision! What else? "user_id" User of what? The revision!

Making it clear and easy for most people is [the point].

I think the best way to make things clear and easy is to either stick with the conventions or make a new convention and change EVERYTHING ELSE. How is introducing yet another naming scheme make working with the data "clear and easy for most people"?

Those same arguments could be applied to the "dt" field in "meta"

Not enitrely. dt is purposefully ambiguous, because it is a standardized part of meta data that every event will use as that event sees fit. Most of the time it will be the datetime of event creation. In Hadoop, we will default to bucketing events by this field. It is important that all events have a standardized datetime or timestamp field.

Further, datetime and timestamp are synonyms.

Yes, but we are explicitly using dt for string datetimes, and ts for integer timestamps.

However, even with that said, I think @Halfak is right here. We are modeling Mediawiki events here, in the mediawiki/event-schema repository. If we were writing Mediawiki code, we would stick with the naming conventions that make sense there. I think that for events like this, where we are purposefully mirroring things happening in Mediawiki, we should do our best to make the events look like what is happening in Mediawiki.

As I said before, timestamp has good reasons for renaming, and it could be justified, but I am more and more thinking that consistency trumps them here. comment and parent_id don't have compelling enough reasons to rename them, imo.

This term "PageEdit" is also new and does not appear in any data source that I'm aware of.

Waiiiit a minute! Really? So um, what should this event be called? revision_save? Do the other event schemas we are designing also need renamed?

Those same arguments could be applied to the "dt" field in "meta"

Not enitrely. dt is purposefully ambiguous, because it is a standardized part of meta data that every event will use as that event sees fit. Most of the time it will be the datetime of event creation. In Hadoop, we will default to bucketing events by this field. It is important that all events have a standardized datetime or timestamp field.

... and it additionally applies to the event itself. timestamp does not.

Further, datetime and timestamp are synonyms.

Yes, but we are explicitly using dt for string datetimes, and ts for integer timestamps.

However, even with that said, I think @Halfak is right here. We are modeling Mediawiki events here, in the mediawiki/event-schema repository. If we were writing Mediawiki code, we would stick with the naming conventions that make sense there. I think that for events like this, where we are purposefully mirroring things happening in Mediawiki, we should do our best to make the events look like what is happening in Mediawiki.

As I said before, timestamp has good reasons for renaming, and it could be justified, but I am more and more thinking that consistency trumps them here.

We are in a deadlock here. We keep recycling the same arguments, so I'm not restating mine.

comment and parent_id don't have compelling enough reasons to rename them, imo.

Agreed.

This term "PageEdit" is also new and does not appear in any data source that I'm aware of.

Waiiiit a minute! Really? So um, what should this event be called? revision_save? Do the other event schemas we are designing also need renamed?

For that the [PageContentSaveComplete](https://github.com/wikimedia/mediawiki/blob/b72a8fdaa66638fb813b013a3d394ba5b409a2a7/docs/hooks.txt#L2173) hook is used, so it's clearly associated with the page, not the revision. Do you have a proposal for a different name?

it additionally applies to the event itself. timestamp does not.

I'm starting to think that we are talking about different things. The timestamp field I have been talking about applies directly to the event. It represents the server time when the revision was saved. It is used as an ordering field in MediaWiki code.

Do you have a proposal for a different name?

How about "revision saved". A revision is produced for a few reasons that have nothing to do with saving content or making an edit. E.g. a page move saves a revision with a structured comment. See this revision with the comment !dea4u moved page [[Talk:Warla]] to [[Talk:Warla, Seti Zone]]: Multiple places exist with same name. as an example.

I like revision_save event. Should any of these others be renamed too?

We currently have

  • page_edit
  • page_delete
  • page_move
  • page_restore
  • revision_visibility_set

Beyond "page edit", "revision_visibility_set" is the only one that doesn't sound like a term I am familiar with. Generally, it seems that the logging table refers to this as a log_type="delete" log_action="revision" whether visibility is being reduced or restored. I'm not aware of anywhere else that this type of event is available in data. So, I don't have a strong opinion here.

Beyond "page edit", "revision_visibility_set" is the only one that doesn't sound like a term I am familiar with.

The name is based on the hook name.

Generally, it seems that the logging table refers to this as a log_type="delete" log_action="revision" whether visibility is being reduced or restored. I'm not aware of anywhere else that this type of event is available in data.

I really think this should be just one event type - something has changed there. What exactly occurred can be inferred by inspecting the contents of the event message.

it additionally applies to the event itself. timestamp does not.

I'm starting to think that we are talking about different things. The timestamp field I have been talking about applies directly to the event. It represents the server time when the revision was saved. It is used as an ordering field in MediaWiki code.

Yup, it seems that we are. For me, there are two different things: the time stamp of when MW thinks the event occurred and the event message creation time stamp. We are discussing the former, while the latter is stored as meta.dt. The two will in most cases only slightly differ, though. So, from that perspective timestamp is really ambiguous :)

So, from that perspective timestamp is really ambiguous :)

What might you mistake "timestamp" for -- other than the timestamp at which the revision was saved?

What might you mistake "timestamp" for -- other than the timestamp at which the revision was saved?

I think that our whole bike-shedding session over the name is proof enough of ambiguity. Basically, because we have two time stamps in the event message, I just want users to be clear which is which (even though, as noted earlier, the time difference between the two will be minimal).

@mobrovac, my disagreement with you about the ambiguity is not evidence of ambiguity. My call for consistency between our datasources is not bikeshedding.

Ok, just had a long convo with @mobrovac and some folks in #wikimedia-services. Have learned some things!

First, timestamp. We looked at some code and the revision table. @Halfak, would you be ok with rev_timestamp? It matches the revision table exactly, but also provides some disambiguation, and won't conflict as a potential keyword in downstream tools. It is a little inconsistent with ourselves since other fields from the revision table have their rev_ prefixes dropped, but timestamp is so exceptionally bad as a field name that I think it's ok to change it.

The other issue was the name of page_edit. @Halfak, you are right that a page is not the primary noun for this event, and that a revision is more primary. However, this event is created from the PageContentSaveComplete hook in MW, not the RevisionInsertComplete hook. There may be revisions that get created that don't show up in the page_edit stream. E.g. a page move will not show up in this stream.

We talked about how having a RevisionSaved stream as you propose here, but that is probably a different stream than page_edit. There's no harm in emitting overlapping streams to eventbus. Perhaps we can add this as a separate topic and schema later?

If you insist, the inconsistency of "rev_timestamp" seems less bad than the inconsistency of "save_dt" to me. So OK. I'm getting awfully tired of asking what makes "timestamp" so confusing and not getting an answer, so I'll give in and let it be as you insist.

Re. "page_edit" vs. "revision_saved", this is very silly to me. Are we not going to emit an event every time that a new revision was saved? Further the event created (PageContentSaveComplete) is meaningless in this discussion because this terminology is not used by anyone consuming our data and that is who I think we should be designing for. So, maybe "this event" *should* be emitted during the RevisionInsertComplete hook. Why are we locking ourselves to some arbitrary past decision with regards to PageContentSaveComplete?

I think it would be very confusing to have a both RevisionSaved stream and a PageEdit stream. Further, these two streams would hardly serve a different purpose. I think it is a little interesting that we are inventing a new data type (from the point of view as a data consumer) with no clear use-case for that new data type. Why don't we serve more general use-cases first and let the special "PageEdit" use-case by served via a filter on the more general RevisionSaved event?

Re. "page_edit" vs. "revision_saved", this is very silly to me. Are we not going to emit an event every time that a new revision was saved? Further the event created (PageContentSaveComplete) is meaningless in this discussion because this terminology is not used by anyone consuming our data and that is who I think we should be designing for. So, maybe "this event" *should* be emitted during the RevisionInsertComplete hook. Why are we locking ourselves to some arbitrary past decision with regards to PageContentSaveComplete?

The idea is to first create an MVP which would serve the internal purposes of the Services team. We want to port our existing code running on the job runners to the event bus. In that regard, we are interested primarily in page edits, not revision creation events, as we need to re-render and store the resulting Parsoid HTML whenever a page changes.

I think it would be very confusing to have a both RevisionSaved stream and a PageEdit stream. Further, these two streams would hardly serve a different purpose. I think it is a little interesting that we are inventing a new data type (from the point of view as a data consumer) with no clear use-case for that new data type. Why don't we serve more general use-cases first and let the special "PageEdit" use-case by served via a filter on the more general RevisionSaved event?

This would be phase two. We are not saying that general revision creation events shouldn't be represented in the event bus, of course they should, but their presence is of lesser priority for now. I for one would be really interested in continuing the discussion for defining the set of events which other users are interested in.

The idea is to first create an MVP which would serve the internal purposes of the Services team.

Whose idea was that? Who made the decision to deprioritize the public data consumption use-cases and prioritize Service's specific needs over more general use-cases? Or maybe this decision was never explicitly made and we're just encountering each others assumptions on the subject? On that subject, I'd like to suggest that the effective support of on-wiki backlogs and synchronized tool support is at least as important as RESTBase performance. Further, that both can be served with little cleverness required (e.g. have a "content_changed" flag on the RevisionSaved event).

I think this is our primary point of contention. I think our MVP should be aimed at our external collaborators in all ways possible and that it is inappropriate to tailor this system to your specific needs to the exclusion of those other use-cases. I have been participating in this discussion with the idea that I have just as much of a claim to the design and use of this system as the Services does and that my point of view as a long-term data consumer (and in many ways, an originator of the pub/sub event scheme proposal) would be considered as though I have a legitimate seat at the table. It seems you had a different impression -- maybe that I'm just a non-user getting in the way of a system design that you own. Maybe we should work that out before we continue discussing any implementation details.

The idea is to first create an MVP which would serve the internal purposes of the Services team.

Whose idea was that?

Politics? :) This system was suggested independently by 3 different sources (you first, of course). Services has a specific short term need. Analytics has a more general longer term need. Our teams decided to contribute resources to the project. It is hard to justify infrastructure building for its own sake, so we needed to choose a product. Effort was made to make the system generic and able to support many use cases, but the initial MVP was chosen to focus support on RESTBase change propagation.

I think our MVP should be aimed at our external collaborators in all ways possible and that it is inappropriate to tailor this system to your specific needs to the exclusion of those other use-cases.

Heh, you might be right, but it's a little too late to redefine the first 'MVP' now.

Re. "page_edit" vs. "revision_saved", this is very silly to me. Are we not going to emit an event every time that a new revision was saved?

If the answer is no, as it is now, 'page_edit' is better than 'revision_save'. But!

Why don't we serve more general use-cases first and let the special "PageEdit" use-case by served via a filter on the more general RevisionSaved event?

This makes a lot of sense to me. Why not just emit all revision creates, and make the change propagation system ignore the few that Services doesn't want? This would help with more general use of this event stream later.

AND NOW FOR MOR TIMESTAMP!

I'm getting awfully tired of asking what makes "timestamp" so confusing and not getting an answer

I think this was answered above, especially concerning the type vs name issue, but maybe not explicitly enough.

timestamp is often a keyword / type in many tools. Naming this field timestamp is similar to naming a string type field string.

We talked about the edit vs revision event stream issue in our weekly EventBus meeting today. If the RevisionInsertComplete (or whatever) hook has enough information to allow differentiation between revisions that change content and those that don't, then we will attempt to create a revision_save (revision_create? :D) event stream, instead of page_edit. Stay tuned!

Ok! Information obtained!

RevisionInsertComplete is not called for unchanged content, because a new revision is not created if the content hasn't changed. ($revision->insertOn is not called.)

RevisionInsertComplete is called twice during a page move, and there doesn't seem to be a good way to indicate from that point that the revisions created are involved in a page move event.

After a Friday afternoon IRC chat, it seems this isn't a big deal. page moves are rare, and if a RESTBase re-render is triggered an extra time or two, things will be fine.

So, the current plan is to replace page_edit with revision_create (or revision_save?). Let's talk about this more next week.

Great! FWIW, I have no opinion about revision_save vs. revision_create -- either seem fine to me.

Change 268858 had a related patch set uploaded (by Mobrovac):
Add the revision_create event

https://gerrit.wikimedia.org/r/268858

PS 268858 adds the revision_create event schema. I think I included all of the things we have been discussing lately:

  • rename title to page_title
  • rename namespace to page_namespace
  • rename parent_revision_id to parent_id
  • rename save_dt to rev_timestamp
  • rename summary to comment

So please review it and let me know if I missed something. Also note that this patch adds the event schema, but the ultimate goal is for it to replace the page_edit event. We cannot do it at once, though, as we first have to introduce the new schema into the HTTP Proxy service and the extension and then phase out the old one.

Looks good.

Noticed an inconsistency though. We have revision_id and rev_timestamp. Shall we change revision_id to rev_id to match the database?

I'd be ok with that. If we do that should we do rev_parent_id too?

I don't think so. I'd actually be more in favor of dropping the "rev" prefix, but since ya'll insisted it be on timestamp, I figured you had it on id for a reason. I don't see a good reason to have it prefix parent_id.

Oh! Actually, I think I do see what you mean. Since the "type" of parent_id is rev_id. I'd be OK with either keeping it as is (since parent is a type of rev) or changing it to rev_parent_id.

Cool. In that case I'll prefix both with rev. Updated patch coming right up!

Done. Complete list of event fields (modulo the meta ones present in all of the schemas):

  • page_title
  • page_id
  • page_namespace
  • rev_id
  • rev_parent_id
  • rev_timestamp
  • user_id
  • user_text
  • comment

Good to go? :)

There are 2 +1's on the patch, so I'll merge it by the end of the day tomorrow if no objections are raised in the meantime.

Change 268858 merged by Mobrovac:
Add the revision_create event

https://gerrit.wikimedia.org/r/268858

mobrovac claimed this task.

And ... resolved!

@Ottomata, could you sync up prod to start supporting this event so that we can start implementing it in the extension and start emitting events?

Mentioned in SAL [2016-02-16T18:27:21Z] <madhuvishy> Deployed and restarted eventlogging with Auto-increment ID and Update config/schemas submodule changes (T124741 and T125135)