Page MenuHomePhabricator

Design Schema for page state and page state with content (enriched) streams
Open, MediumPublic3 Estimated Story Points

Description

The original design is done, but we are keeping this ticket open to continue the discussion

User Story
As a platform engineer, I need a common MediaWiki page state change schema that can be used as a 'changelog' of page state. I can then use this to maintain a materialized view of the current state of pages outside of mediawiki.
As a search engineer, I need to be able to easily subscribe to ordered changes to pages to keep search indexes up to date.
Timebox:
  • 2 weeks
Done is:
  • Schema reviewed and agreed with group, including Data Engineering, Research, and Wikimedia Enterprise
  • Schema is merged and deployed

For collaboration on this schema design, please use this MediaWiki Page State Change Event Schema Design google doc.

Details

This event stream is an implementation of the “comprehensiveness” problem described in T291120: MediaWiki Event Carried State Transfer - Problem Statement

How is this different from what we already have?

We do not currently have a way to get real time updates of comprehensive MediaWiki state outside of MediaWiki.

We want to design MediaWiki event streams that can be used to fully externalize MediaWiki state, without involving MediaWiki on the consumer side. That is, we want MediaWiki state to be carried by events to any downstream consumer.

See also: Event Notification vs. Event-Carried State Transfer

We had hoped that MediaWiki entity based changelog streams would be enough to externalize all MediaWiki state. The MediaWiki revision table itself is really just an event log of changes to pages. However, this is not technically true, as past revisions can be updated. On page deletes, revision records are 'archived'. They can be merged back into existing pages, updating the revision record's page id. Modeling this as a page changelog event will be very difficult.

Instead, this page state change data model will support use cases that only care about externalized current state. That is, we will not try to capture modifications to MediaWiki's past in this stream.

This stream will be useful for Wikimedia Enterprise, Dumps, Search updates, cache invalidation, etc, but not for keeping a comprehensive history of all state changes to pages.

We aim to create a new page ‘entity’ based stream that can be used to ‘materialize’ the current state of any MediaWiki page. An entity based stream will have all kinds of changes (creates, updates, deletes. etc.) in a single stream. That is, mediawiki.page_change stream will have page creates, page edits, page deletes, and possibly other types of changes (page properties changes?).

Decisions made

What is MediaWiki page state? What are the relevant entities?
  • wiki/database
  • page table data: e.g. page_id, page_title, etc.
  • actor: the user making a change to a page
  • revision
    • comment
    • content slots (MCR) (& content body)
    • rendered content slots (for derived/enriched streams)
    • editor (same as actor on edit events).
What is not MediaWiki page state (for now)
  • page properties: these are usually parsing hints, and are not persisted through edits.
  • editing restrictions: these are about edit restrictions on page, not how the page looks. We could add these state change later if we change our minds.
  • page links changes: We have this in a different stream already, can join if this is needed.

page state changes and changelog kinds

What kind of page changes are we going to capture in this stream, and what
'changelog kind' do they map to? 'changelog kind' is the type of change to apply to a state store, so either an 'insert'/'create' 'update' or 'delete. Each page change kind maps to exactly one changelog kind. (In Flink, these will be mapped to a RowKind

MediaWiki page change kindchangelog kind
createinsert
editupdate
current revision visibility change*update
moveupdate
deletedelete
suppressdelete
undeleteinsert

*This can happen if the comment or editor's user_text are hidden on the current revision.

Modeling decisions
  • We will make this schema organized, in that we are not going to force ourselves to stick with previous event data model decisions. E.g. we will have a revision object with revision related data, rather than top level rev_id, rev_timestamp fields. NOTE: This decision is being revisited, see this comment.
  • Every page change event will have ALL of the data needed to represent the current page state. (page content will be a in different stream). That is, a page move event will still have all the data about the page's current revision in it, even if only the title has changed.
Outstanding TODOs and unknowns
  • Nested vs flat/top level fields
    • It is difficult to work with nested fields in SQL. Perhaps flat is best. See this comment.
  • Deprecate meta.domain and meta.uri, and put that info top level
  • Message Keys
    • We'll need a message key data model too. Perhaps something like {"database": "enwiki", "page_id": 123} is enough.
    • We haven't yet had to think about message keys in Event Platform.
      • wikimedia-event-utilties (Java client), EventGate (HTTP produce API), EventStreams (HTTP consume API) need to support keyed messages, and likely validation of key schemas too.
  • Compacted Kafka topics
    • Can we maintain just one compacted Kafka topic for each of these streams, or do we need to maintain a non-compacted one (e.g. with suppressions in it), and a separate compacted one (where suppressions deletions are null/tombstoned out)?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

On our design meeting this past Wednesday, we talked about how to model revision content_slots. If we only ever needed to model the 'raw' content that is stored by mediawiki, we could just do:

content_slots:
  main:
    content_model: wikitext
    content_body: ...
    ...
  other_slot:
    content_model: json
    content_body: ...
    ...

However, in the future, we'd like to support 'rendered' content in streams. It'd be nice if we could have a generic enough model to support that.

In the meeting it was suggested to add a rendered_content_slots field to event schemas that needed it, with the same data model as content_slots, but with different 'rendered' content_models/formats, e.g. 'html'.

This could work, but I'm having a bit of trouble with structuring the schema fragments in a nice way to do this. Here's another possibility that use the same content_slots field to represent different 'renderings' in the same content slot name:

content_slots:
  main: 
    wikitext:
      content_model: wikitext
      content_body: <wikitext content here>
      content_sha1: ...
    html:
      content_model: html
      content_body: <parsed html content here>
      content_sha1: ...
  other_data_slot:
    json:
      content_model: json
      content_body: <json data here>

This is a double nested map field, the first level key being keyed by the slot name, the second level key being the content_model.

(If we wanted, we could restrict the allowed keys in the second level map to known content models).

In this way, the content_slots field can be used for multiple renderings/parsings of the same content slot.

@Protsack.stephan @fkaelin @Milimetric @gmodena @Mooeypoo, thoughts?

Only question I would ask is rendered content is different enough so we need to put it a separate rendered_content_slots field?
I would argue that it is because that content does not react to the change in templates (talking about HTML) and basically is a render for that point in time, this will create a possibility to discern between content_slots and rendered_content_slots on a schema level, instead of documenting that html version (or any other content type) in the content slot behaves a particular way.

At the same time not having rendered_content_slots makes it a little bit easier to understand the schema in general. So both of them sound good to me.

Thanks @Protsack.stephan. IIUC then, your preference is for the separate rendered_content_slots field, correct?

My opinions on this topic stem from my experience with Graphoid and the problems it ran into. That conversation stalled here: T249419#6295084.

Graphs are defined as json documents, you can find examples here. When people write these kinds of documents on the wikis, they would generally use templates to factor out boilerplate. So the raw wikitext would not have the full graph definition, but at parse time, the parser would construct it by resolving the tree of dependencies. Initially, this constructed document was stored in page properties and a key to it was stored in the wikitext. Since this was not how page properties were intended to be used, this approach created problems.

Ideally, the constructed document would be stored in a content slot. At face value, this looks the same as rendered html. But in practice, there is another layer of complexity. The graph document could dynamically reference another tree of dependencies: the data it visualizes. It's unreasonable for the parser to resolve this tree at parse time, but without resolving these dependencies, the rendered graph document may become useless after some period of time (as the data it refers to becomes unavailable or changes). Conceptually, there should probably be another content slot reserved for resolved dependencies for the graph. This would be filled in asynchronously.

I think we should talk about whether I'm right about that last sentence and whether better alternatives exist. But, if not, then for the purpose of this task, it would be nice if there was a way to capture adding a slot to a revision asynchronously.

This would be filled in asynchronously

AKA wikifunctions AKA Async content fragments?

Hm, @Milimetric I'm mostly asking about data modeling preferences, not so much what and how the data is filled in.

Option A: content_slots field that only represents raw / non-derived content slots. In a different event stream (and schema), when we need rendered / derived content (e.g. html) we add a new field rendered_content_slots (name TBD).

Option B: make our modeling content_slots possible for each slot to have multiple 'versions' of that slot's content.

Option A. is less complex as a schema (especially for querying purposes).
Option B. is more normalized and more future-proof.

Option A: content_slots field that only represents raw / non-derived content slots. In a different event stream (and schema), when we need rendered / derived content (e.g. html) we add a new field rendered_content_slots (name TBD).

Option B: make our modeling content_slots possible for each slot to have multiple 'versions' of that slot's content.

Option A. is less complex as a schema (especially for querying purposes).
Option B. is more normalized and more future-proof.

+1 to @Protsack.stephan reasoning.

I'm leaning towards beating option B: separate rendered_content_slots.

From my vantage point, if we want this stream to (eventually) become the source of truth, it might make sense to future proof and strive to be generic. To simplify querying, might uses cases arise, we could always derive streams with a simpler (ad-hoc) schema.

Thanks @Protsack.stephan. IIUC then, your preference is for the separate rendered_content_slots field, correct?

@Ottomata Yes, you are correct.

Okay, thank you! rendered_content_slots makes organizing and referencing the schema fragments a little more difficult, but I can do it. I think I agree with you as well, especially from the querying standpoint; the double nested map would be pretty annoying to deal with in SQL.

The reason I was trying to spell out the graphs use case is that we need more than option B to model what we want to do. So I vote for option B, but add that we will need a way to say "this revision has a slot that will be updated later, check <<somewhere>> for it"

Adding idea discussed with @Ottomata earlier on. It's probably interesting to separate streams by project, to allow optimal reading for both all-projects readers and single-project readers.

"this revision has a slot that will be updated later, check <<somewhere>> for it"

This sounds a lot like rendered content and wikifunctions too. Slot content is never modified without a bump in revision id (excluding Derived slots, which I was told I should ignore). Rendered content of any kind can change without a bump in revision id. I'm also leaning towards modeling these separately for this reason.

It's probably interesting to separate streams by project

Yes, @gmodena @JAllemandou we should discuss this more. There are some implications here for stream config. Right now, consuming all e.g. revision-create events just to get the events for some small wiki is annoying, but isn't so bad because the events aren't that big. Once we add content, they get much bigger. We should probably make it possible for consumers to choose the project they want to get contentful events for. To do this, each project will need to have its own stream. We could probably also declare a composite stream (or regex stream :/ ewww), that includes all of the substreams, but this would be the first time we'd have a multiple streams using the same kafka topics.

Adding idea discussed with @Ottomata earlier on. It's probably interesting to separate streams by project, to allow optimal reading for both all-projects readers and single-project readers.

Do we already have a set of use cases for this layout? Other than the known data skewness considerations, I'd like to get a fill for how streams would be used.

@Ottomata I'd lean towards not building a composite stream. Depending on the use cases, we could consider materialising derived streams from the main "page-change" one (our source of truth). Maybe smaller wikis could be bucketed together. I don't have a good feel yet the final data size with content included, but throughput seems reasonable.

This sounds like a separate thread though. Maybe we can spike some work on it?

Do we already have a set of use cases for this layout?

We have the WDQS-Updater: it reads everything but needs only wikidata.

This sounds like a separate thread though. Maybe we can spike some work on it?

+1, just wanted to call it out as something we should think about!

@gmodena, @Milimetric, @dcausse, I need some help with modeling current revision visibility changes.

We need to know when the visibility settings for the current revision of a page is changed. Visibility settings are per revision, not per page, so I am modeling them in the revision entity, which will be included in the page change event. So I've got

page_change_kind: visibilty_change
# ...
revision:
  rev_id: 123 # this is the current revision of the page.
  # ...
  visibility:
    comment: false # in this case the comment is not visible to public
    content: true
    editor: true

We set prior_state when we want to make it easier for the consumer to compare what has actually changed since the last event. In the existent revision-visibility-change stream (schema), we can capture these changes on a revision level. So it is easy to understand what visibility and prior_state.visibility mean; rev_id is the same.

To do the same in page_change, we'd do this:

prior_state:
  revision:
    rev_id: 123
    visibility: 
      comment: true
      # ...

So, by comparing revision.visibilty and prior_state.revision.visiblity, you can determine what has changed.

However, this is a little weird with our page entity changelog concept. Example:

Let's say a page is at revision rev_id 100. rev_id 100's parent is rev_id 99.
An edit happens that creates new rev_id 101. An page change edit event will be produced with

page_change_kind: edit

revision:
  rev_id: 101
  rev_parent_id: 100
  visibility:
    comment: true

prior_state:
  revision:
    rev_id: 100
    rev_parent_id: 99
    visibility:
      comment: true

Now let's say a the comment is hidden on rev_id 101. If we model the everything with the same prior_state logic, we'd get:

page_change_kind: visibility_change

revision:
  rev_id: 101
  rev_parent_id: 100
  visibility:
    comment: false

prior_state:
  revision:
    rev_id: 101
    rev_parent_id: 100
    visibility:
      comment: true

Perhaps this is okay, but IIRC, we previously used the existence of a field in prior_state to note that it had changed. However, fields like rev_id and rev_parent_id have not changed in this visibility_change page change event. I suppose it doesn't matter? One can still compare revision.rev_id to prior_state.revision.rev_id to know that is hasn't changed...and also one can expect that these will always be the same for visibility_change events.

Should I maybe only set the prior visibility , and skip any other revision details? E.g.

prior_state:
  revision:
    visibility:
      comment: true

Not sure what is best...but after typing all this out...maybe it doesn't matter? Thoughts?

Hm, actually doing ^^ (skipping extraneous revision details in prior_state) makes dealing with the prior state of the hidden properties easier too. If a comment is hidden, we probably should avoid reproducing it in the prior_state object.

Okay, hah, typing this out was a good rubber ducking exercise for me. Let me know what you think but I think I'm going to proceed this way.

One more question for you all.

The existent EventBus code uses an anonymous actor role when retrieving data fields for a a revision, e.g. comment, editor, etc. This means that if the revision comment is hidden, it will just not be set.

But,the EventBus code also checks the strlen of the comment, and if 0, also avoids setting the comment field.

I'm inclined to change this behavior and just always set comment field in the event if it is returned, even if it is the empty string. This would let us differentiate between a hidden comment (no comment field present) and an empty comment.

Sound okay?

@Ottomata just to clarify: the new page_kind visibility_change will only be emitted for the current revision which can happen only for hiding comments? If yes I wonder if this should not be stated explicitly, e.g. page_kind: [un]hide_comment or perhaps comment_visibility_change.
Regarding prior_state I agree with your approach, I think it'll will be difficult (impossible) to make it consistent if all the previous values of the fields have to be replayed, its purpose should be to give consumers some hints on what might have changed.

I'm inclined to change this behavior and just always set comment field in the event if it is returned, even if it is the empty string. This would let us differentiate between a hidden comment (no comment field present) and an empty comment. If it'll change existing streams I'm unsure of the consequences on existing consumers.

Will this change existing revision_create behaviors or is it just for this new stream? If it's only for the new stream I think this would match what the action API is returning for revisions, empty string when no comment is set, absence of the field if it's not visible.

comment as empty string

Will this change existing revision_create behaviors or is it just for this new stream?

Only the new stream. Great.

the new page_kind visibility_change will only be emitted for the current revision which can happen only for hiding comments?

Content...cannot be hidden for the current revision. I just tried and got Revision visibility could not be updated: Error hiding the item dated 17:20, 7 September 2022: This is the current revision. It cannot be hidden.

Editor user_text can be hidden. Hm, I need to check how this works, I think right now the code will not emit any user information...but I guess only the user_text needs to be hidden. TODO for me.

I am sort of modeling a more abstract 'revision entity' here. revision.visibility.content will always be true for page_change stream. I guess we could omit it. But if we re-use this revision entity schema elsewhere, we'd probably want to be able to represent this piece of the revision state.

Alternatively, we could make this an array of visible values or a map field instead of an object? You know, I could make this a map field with a restricted set of keys. This would allow us to omit content visibility in this stream, but still set it in another revision entity in a stream if needed? Map fields are worse for schema discovery, but maybe its okay here?

Too bad there's not a good way to serialize this in JSON as int with bitfield like MW does.
(I could just serialize the int, but then it'd be hard to use/query the data outside of MW.)

This comment was removed by Milimetric.

However, in the future, we'd like to support 'rendered' content in streams. It'd be nice if we could have a generic enough model to support that.

Rendered content exists on two levels: per slot, and per revision. All slots are combined to generate the revision rendering. Most of MediaWiki is not aware of per-slot rendering, and there is currently no (efficient) way to access or expose it. Making it a first class citizen would need quite a bit of work in core. We investigated this when working on MCR, and it was dropped as YAGNI.

This might change once we make more use of slots. Afterall, it's annoying that we have to re-render all slots if one slot changes.

@daniel am modeling slots in the revision entity and also the page change event now: Easiest to see in the page_change examples: https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/807565/10/jsonschema/mediawiki/page/change/1.0.0.yaml#621

Schema of revision entity is here: https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/807565/10/jsonschema/fragment/mediawiki/state/entity/revision/1.0.0.yaml

Been meaning to find time to go through all these ideas with you. I'll put a meeting on our calendars for next week :)

I am sort of modeling a more abstract 'revision entity' here. revision.visibility.content will always be true for page_change stream. I guess we could omit it. But if we re-use this revision entity schema elsewhere, we'd probably want to be able to represent this piece of the revision state.

I've been thinking about this as I sketch out an architecture for the new dumps. We have to listen to visibility changes over the whole history, because dumps publishes every revision. So one way to do it is to listen to your stream and compact it with all visibility change events. Then we'd create revision.visibility.content as we produced. So if that's how it ends up being done, then I think you don't need it?

Other thoughts around this is that it seems like we'd be duplicating a lot of work, I'm reading more in depth now to see if that's just my bad intuition.

listen to your stream and compact it with all visibility change events

You mean join with? Yes I think that makes sense. Using the existent mediawiki.revision-visibility-change event stream to fill in historical revision visiblity changes.

duplicating a lot of work

How so? say more! :)

In T212482#8294070 @daniel wrote:

Modeling an event as change-of-state-of-entity is intruiging, but I see two prolems with it:

  1. you may want additional meta-data about the change itself, things that do not update the resource. E.g. the IP and edit came from.
  2. the state of an entity is potentially huge, e.g. a "page" could potentially be modeled as containing all its revisions, or maybe even all possible renderings... representing "before" and "after" fully in the event itself becomes impossible.

Perhaps it would make sense to model page changes a bit like git commits: they have meta-data, and contain a patch against a base state.

you may want additional meta-data about the change itself, things that do not update the resource

Yes, and drawing the line for what to include is a difficult one. We want the change stream to be as externally useful as possible, so we denormalize it a lot by putting more in it (editor info, etc.) than strictly belongs to a page. But, we don't want the events to include everything, so it is a fuzzy line and requires some guessing (and bikeshedding!)

could potentially be modeled as containing all its revisions

We are trying to model the 'current state' of the page, with some 'before' information where it makes sense. So we only need to include info about the current (and maybe parent) revision.

This has a downside in that we won't get information about changes to past revisions (e.g. revision-tags, old revision suppressions or, cough cough, 'derived' revision slots :p ), but that's a trade off we're willing to make.

model page changes a bit like git commits: they have meta-data, and contain a patch against a base state.

Modeling diffs is pretty hard. Here's how we decided to model state changes in events. The event is supposed to represent the current state: For cases where we want to capture what has changed, we include a prior_state object with the same top level fields/structure as the event. If a field is set in prior_state, you know is has changed since the previous value. Otherwise, it has not changed.

When doing stateful stream processing, this prior_state isn't strictly needed; you could just keep the state and diff yourself when you receive a change. But, for convenience when doing more simple stream consumption, we decided to represent old state like this.

@gmodena @dcausse I talked to @daniel today, and ended up with an interesting question about how we are modeling MediaWiki's content outside of MediaWiki.

I had been modeling content like this:

revision:
  rev_id: 19
  # ...
  content_slots:
    main:
      content_size: 1235
      content_format: wikitext/html
      #...

But, Daniel asked why we were including any content_slot information in our event if we weren't actually emitting content bodies from MediaWiki. Now, we do plan to have content bodies in the enriched streams, but, the question is, does that have anything to do with content_slots? If we only care about 'rendered' content, then the answer apparently is no, because the rendered content is generated from all the combined content slots.

But, are there cases where we do care about the raw content slots? E.g. for commons, do we want streams with the main wikitext and also the 'structured data' content slot?

@dcausse how would we like to model this for other kinds of enriched streams, e.g. wikidata triples or CirrusSearch renderings for ElasticSeach index updates? Do we care about slots, or do we only need the 'rendered' version?

If we only ever need the rendered versions, I think we can ignore the MCR slots stuff altogether. When we emit an enriched event, we can just have e.g.:

revision:
  rev_id: 19
  # ...
  content:
    size: 12345
    format: wikitext/html # (?)  or wikibase triple (?) or cirrussearch/html (?)
    body: <div>example example</div>  # or whatever is provided as the rendered content

@dcausse how would we like to model this for other kinds of enriched streams, e.g. wikidata triples or CirrusSearch renderings for ElasticSeach index updates? Do we care about slots, or do we only need the 'rendered' version?

Reasoning about MCR slots in a generic manner is very ambiguous so I'll stick to the commons usecase:

  • for search we blend some revision metadata, the main and mediainfo slots into a single indexable document.
  • for WDQS we render the RDF output that is only based on the mediainfo slot content (+ some revision metadata).

So indeed as of today for structured data on commons in search and WDQS an enriched stream with a single content field is sufficient. Note that these rendering function are inside MW, if we were to extract them outside of MW then having raw content slots might make sense but we don't have such plans and it's unlikely we will ever have.

As for raw content slots the search team does not have such use-cases but there might be values to it (e.g. json dumps for the mediainfo entities) but does this justify a complex schema that supports emitting all slot content bodies in a single event? I'm not sure, I can't think of a use-case with commons where we'd need both the main wikitext slot and the mediainfo JSON slot in a single document. If we were to emit the raw mediainfo slot content I guess that a separate enriched stream can be created using the very same schema:

revision:
  rev_id: 19
  # ...
  content:
    size: 12345
    format: application/json
    body: {"type":"mediainfo","id":"M76","labels":{},"descriptions":{}}

The sole information that might be missing is the fact that it is the mediainfo slot (other than inferring this from the stream name itself).

Okay, then given that and also the discussion about rendered_content_slots above, let's avoid modeling RevisionSlots at all, for now. I will drop them from page change schema that will be emitted from MediaWiki.

One note about slot information: if the page-event we're talking about here is envisioned to be the "reference" page state change (including for analytics purposes), I think it's worth adding slot information to it, except if there is a plan to deprecate it soon. the reason for me saying this is because when there is data representing something (in this case slot information, with its content-type etc), there usually are interesting analytics questions related to it, such as how often different content-types are used, and what else. Therefore if it's not too expensive, and if the event is to be the reference, and if the underlying model is here to stay, I'd rather have that information be present on the schema :)

@tstarling, @daniel advised that I should include more than just is_bot in our modeling of MW users in event data. Here's the WIP user model.

I'm mostly getting confused on how isTemp and isNamed fit in with all the other types/qualities a user can have. Is there any intention to eventually standardize this in MW, maybe similar to how Daniel described here? I could just add more is_temp, is_named, etc. boolean fields to the event model to match with what User.php has. But is that the right thing to do, or should I be more 'future proof' by e.g. adding a user_type (one of registered, anon, or system?) and a user_qualities (any of bot, temp, etc.)?

I don't feel like I have a good understanding of all the ways MediaWiki wants to represent its user types, so I'm likely to do something wrong here. Any pointers you have would be appreciated.

I think is_temp is fine. I don't think it's future-proof to add user_type since if we decide we want such a concept in MW core, the details may differ from what you decide on here.

Daniel backed off from asking for a user type after he saw the code. I implemented it, but he didn't like it.

Your schema is missing imported users. Imported users have no user_id and have a name consisting of two parts separated by ">". The first part identifies the wiki and the second part is the foreign user name.

Temporary users are similar to imported users in that both interpret the user name in a special way. Whether you need to parse the user name and provide details to clients depends on whether the clients need that information. For most purposes, it's acceptable to treat temporary users as normal logged-in users.

After some more discussions with folks, there are needs to have wikitext / raw content in streams, e.g. for incremental dumps, for ML scoring, etc. I will keep the content_slots model, which we can use to enrich the stream with slot content later.

@daniel, Q about page suppress vs delete.

As far as I can tell, a page delete is a delete, and page suppression is just a page delete + setting all revisions 'deleted' (hidden) too, right?

I'm currently distinguishing between page delete and suppress as different state changes.

If so, perhaps I should just add an is_suppression_delete field to the event to indicate that it was a page delete with full suppress, and then make sure that the revision is_*_visible settings are all set to false, and that any included values for things like editor user_text and revision comments are redacted in the delete event. Do you think that would be more correct?

@daniel, Q about page suppress vs delete.

As far as I can tell, a page delete is a delete, and page suppression is just a page delete + setting all revisions 'deleted' (hidden) too, right?

Yes - from MediaWiki's internal perspective, "page supression" doesn't exist. Things that can be supressed are revision content, revision comment, and user names (on revisions and elsewhere). These are independent of eachother. The revision itself is always "visible", it never vanishes from the history.

I wasn't aware that we had UI for "supressing a page". Do we? My understanding was that the supression would happen manually, in a second step, after the deletion. But please ask someone who actually knows the UI :)

By the way: technically, deleted pages do not exist either. Only archived revisions, which are associated with a page title and a page ID (note that the current page with this title may have a different ID). I think this is a design flaw and should change, but that's how it currently works.

I wasn't aware that we had UI for "supressing a page". Do we?

Screen Shot 2022-10-22 at 19.59.11.png (960×1 px, 136 KB)

"Suppress data from administrators as well as others" seems to 'delete' all revisions.

Ok, the annoying thing is that when deleting and checking the 'suppress data' box, the revisions are all moved to the revision archive table, and when inserting into it:

'ar_deleted'    => $this->suppress ? $bitfield : $row->rev_deleted,

So, the revision 'deleted' bitfield is set to a different value in the archive table than it was in the revision table.

In the onPageDeleteComplete Hook, the only indication I have that a page was 'suppressed' is in the ManualLogEntry. The RevisionRecord $deletedRev I am given in the hook does not have its own mDeleted (visibility) bitfield updated, so I can't indicate that the revision is 'suppressed' in the event.

It would be better if the Hook gave me a RevisionRecord that has the same data fields that is being inserted into the revision archive table.

Oh well, I guess the right thing to do is to manually update the event's is_*_visible fields if $logEntry->getType() === 'suppress'.

It would be better if the Hook gave me a RevisionRecord that has the same data fields that is being inserted into the revision archive table.

Oh well, I guess the right thing to do is to manually update the event's is_*_visible fields if $logEntry->getType() === 'suppress'.

Updating the RevisionRecord isn't trivial, but it would be easy to simply pass the $this->suppress flag into the hook if only we had hook param objects, per the discussion on T212482: RFC: Evolve hook system to support "filters" and "actions" only. Please keep telling people that we need to overhaul the hook system!

Change 807565 merged by jenkins-bot:

[schemas/event/primary@master] Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema

https://gerrit.wikimedia.org/r/807565

JArguello-WMF added a subscriber: JArguello-WMF.

@Ottomata Is the decision made something worth documenting in the decision log?

Change 851670 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0

https://gerrit.wikimedia.org/r/851670

Change 851673 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Declare rc0.mediawiki.page_content_change stream

https://gerrit.wikimedia.org/r/851673

Change 851670 merged by jenkins-bot:

[schemas/event/primary@master] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0

https://gerrit.wikimedia.org/r/851670

Change 851673 merged by jenkins-bot:

[operations/mediawiki-config@master] Declare rc0.mediawiki.page_content_change stream

https://gerrit.wikimedia.org/r/851673

Mentioned in SAL (#wikimedia-operations) [2022-11-01T19:15:30Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Declare rc0.mediawiki.page_content_change stream - T307959 T308017 (duration: 03m 42s)

Ottomata added a subscriber: Tgr.

Re-opening to discuss a schema change.

In https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/853267, @Tgr wrote:

IMO it would make sense to remove formatComment from EventSerializer because the dependencies are onerous and limit when it can be used (CommentFormatter requires the parser, the parser requires among many other things a user context for language preferences, so trying to obtain it will result in an exception in no-session contexts and might result in unexpected behavior

I think support for the html formatted comment was added in older events as part of T170145 and we've just kept it in.

I propose we remove the comment_html field from mediawiki/page/change altogether. If we want a parsed comment, we can add it in as part of an enrichment step.

Change 855146 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] development/ page change - Remove comment_html fields, bump to 2.0.0

https://gerrit.wikimedia.org/r/855146

2 more questions to answer:

Nested vs flat/top level fields

Right now, this schema uses Rows AKA Structs to as nested fields to contain entity specific information, like page, performer, revision, revision.editor, revision.content_slots etc. Querying such nesting can be difficult in SQL. Perhaps it would be better to put as much as we can in top level entity prefixed fields instead, e.g. page_id, page_title, performer_user_id, rev_id, rev_editor_user_id, rev_content_slots, etc.

Deprecate meta.domain and meta.uri

Using the meta field for domain specific event data is wrong. My preference would be to fully get rid of meta, but that would be quite an undertaking. Instead I propose:

  • Mark meta.domain and meta.uri as deprecated, but don't do major meta schema bump to remove those fields.
  • Stop producing meta.domain and meta.uri in our new mediawiki/page/change event, but add top event level fields that contain the same information.

The consumer would need to know that the event doesn't have this info in e.g. meta.domain anymore,but I think that is fine, as the consumer would also need to know that this data WAS in meta.domain in the first place. Old events and consumers will continue to have this data, but new ones should not.

In https://phabricator.wikimedia.org/T317768#8400702 @Isaac wrote:

@Ottomata recognizing that this might be long past the time when you'd want this feedback but a question about an additional field:

Similar to is_redirect, we often use whether an article is a disambiguation / list page as a determination for how to handle with ML models -- e.g., it's not intended behavior to run many models like add-a-link or the topic model on disambiguation / list pages. While I don't think list article is easy to determine without making a call to Wikidata (I assume that's out of the question), disambiguation pages are tracked by Mediawiki -- e.g., https://en.wikipedia.org/w/api.php?action=query&titles=Albert&prop=pageprops&format=json&ppprop=disambiguation.

What would be the process to consider whether this could be included as part of the page info in the event?

long past the time when you'd want this feedback

It is not! We still want and need your feedback. That's why this is currently an 'rc0' stream, and the schemas are in a '/development' namespace.

disambiguation pages are tracked by Mediawiki
prop=pageprops&format=json&ppprop=disambiguation

Hm, we had made the decision to not include page properties:

page properties: these are usually parsing hints, and are not persisted through edits.

I think, that the right thing to do if you need this kind of stuff, will be to join with other streams that have this information, e.g. mediawiki.page-properties-change (schema). Ooof, but actually, to do that we need T281483: mediawiki/page/properties-change schema should use map type for added and removed page properties.

Ottomata updated the task description. (Show Details)

It is not! We still want and need your feedback. That's why this is currently an 'rc0' stream, and the schemas are in a '/development' namespace.

Yay!

Hm, we had made the decision to not include page properties:

page properties: these are usually parsing hints, and are not persisted through edits.

I think, that the right thing to do if you need this kind of stuff, will be to join with other streams that have this information, e.g. mediawiki.page-properties-change (schema). Ooof, but actually, to do that we need T281483: mediawiki/page/properties-change schema should use map type for added and removed page properties.

I'm not following the aspect about page properties not being persisted through edits. Most of the page props I see (all props of enwiki) are largely-static properties of the page such as its display image or whether it's a disambiguation page. Most of them are irrelevant to our needs though I'm also quite interested in wikibase_item (connecting Wikipedia articles to Wikidata items can be a painful process so having the data present in one place saves that hassle). I can come up with weaker reasons why page_image and wikibase-shortdesc are also interesting but that's really only for specific models.

That said, if the disambiguation property isn't included in the page-change, here's what I imagine our options would be for using this stream as part of the pipeline for running LiftWing models on new edits:

  • We could have a separate job that watches for pages that add the disambiguation property and use that to e.g., clear predictions for a page from the Search index but it wouldn't necessarily help us with the question of whether to run a model on a given edit.
  • We could do the same but maintain a table of e.g., all the pages that are disambiguation pages in a feature store on the ML platform that we could check against. I'd worry about that falling out of sync due to e.g., page moves etc. though so I think that would be a brittle solution.
  • I think we'd be most likely to just make the API calls to pageprops ourselves when it's relevant to whether a model should be triggered or not (which is fine but was hoping to remove that step if easy because I imagine it's an important piece of information for lots of ML models)

I'm not following the aspect about page properties not being persisted through edits

I don't know if I totally follow either, but there is more context the initial collab design doc see "Do we want page properties?" and the comment.

here's what I imagine our options would be

What about using the mediawiki.page-properties-change stream? You can join these two streams together on wiki_id and page_id, and keep state about the current page properties for a page_id, and use that to decide what to do. This would be similar to your idea of maintaining a feature store lookup table for page properties, except that the content of the table/feature store is updated from the stream directly.

Re-opening to discuss a schema change.

In https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/853267, @Tgr wrote:

IMO it would make sense to remove formatComment from EventSerializer because the dependencies are onerous and limit when it can be used (CommentFormatter requires the parser, the parser requires among many other things a user context for language preferences, so trying to obtain it will result in an exception in no-session contexts and might result in unexpected behavior

I think support for the html formatted comment was added in older events as part of T170145 and we've just kept it in.

I propose we remove the comment_html field from mediawiki/page/change altogether. If we want a parsed comment, we can add it in as part of an enrichment step.

Sorry for the slow response! I just want to clarify that my point was about separating the class that does comment formatting (which is complex and has lots of dependencies) from the one that does the rest of the event object creation (which is fairly simple, and needed in some contexts where we don't have enough information to initialize a comment formatter). That could be done without changing anything about the events themselves - it's just that sometimes we have events which don't involve comments in any way, and happen in requests where we can't easily obtain a comment formatter, and we'd need a more lightweight serializer class for those. That could be solved by having two different kinds of serializers, or a serializer and a helper class for comments, for example.

I have no opinion on the proposal here one way or another, as I'm not familiar with how the data is used. It would certainly solve the problem that led to the patch mentioned above, but there are other ways to solve it, too.

Thanks @Tgr! At this point it is easy enough to remove, and we can always add it back in later if/when we need it. I'd prefer to solve this problem by making the event model simpler for now anyway.

I don't know if I totally follow either, but there is more context the initial collab design doc see "Do we want page properties?" and the comment.

@Ottomata thanks for that pointer. My summary after reading: Moriel's right that page props are for parser hints. Not withstanding, they also happen to be useful for knowing when to trigger some ML models (there are certain things that are painful to infer in a consistent manner from wikitext so it's very helpful when we can rely on the parser for that). David raised the point about those properties potentially changing silently in between revisions (like the wikibase_item). For our use-cases, we're probably okay with having slightly stale data -- e.g., not triggering models when the wikibase_id changes but waiting till the actual page is edited. But I understand that other use cases might be less okay with that. Per our discussion last week though, stream enrichment is easy and this is all pretty specific to ML models so I'm now thinking that we likely will just want to work with ML Platform to create a stream that enriches the base page change stream with properties like is_disambiguation and wikibase_item.

What about using the mediawiki.page-properties-change stream? You can join these two streams together on wiki_id and page_id, and keep state about the current page properties for a page_id, and use that to decide what to do. This would be similar to your idea of maintaining a feature store lookup table for page properties, except that the content of the table/feature store is updated from the stream directly.

My concern about approaches like this is that they require us to build increasingly complex code to not fall out of sync with Mediawiki (akin to the heroic scale of what Joseph put together for mediawiki-history). For instance, I assume along with page-properties-change, we'd also have to watch page move logs and possibly others to actually maintain an accurate list of pages that are disambiguation pages. Perhaps there's some middle ground though where we watch page-properties-change to get realtime changes but still call pageprops API on edits to clean up the state for when it falls out of sync. I'll bring this up with ML Platform though.

build increasingly complex code to not fall out of sync with Mediawiki (akin to the heroic scale of what Joseph put together for mediawiki-history)

A goal of this task is to simplify the code that is needed to maintain the current state of MediaWiki pages outside of MediaWiki.

along with page-properties-change, we'd also have to watch page move logs and possibly others

Ideally you'd only need this new mediawiki.page_change stream, plus mediawiki.page-properties-change, unless I'm missing something.

accurate list of pages that are disambiguation pages

Is known via page properties or something else?

create a stream that enriches the base page change stream with properties like is_disambiguation and wikibase_item

If this info is in mediawiki.page-properties-change, it will probably be preferrable to create this new enriched stream by joining the streams, that way the MediaWiki API is not involved, and backfilling history will not require throttling.