Design Schema for page state and page state with content (enriched) streams
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	lbowmaker
	May 10 2022, 1:39 PM

Description

The original design is done, but we are keeping this ticket open to continue the discussion

User Story

As a platform engineer, I need a common MediaWiki page state change schema that can be used as a 'changelog' of page state. I can then use this to maintain a materialized view of the current state of pages outside of mediawiki.

As a search engineer, I need to be able to easily subscribe to ordered changes to pages to keep search indexes up to date.

Done is:

Schema reviewed and agreed with group, including Data Engineering, Research, and Wikimedia Enterprise
Schema is merged and deployed

For collaboration on this schema design, please use this MediaWiki Page State Change Event Schema Design google doc.

Details

This event stream is an implementation of the “comprehensiveness” problem described in T291120: MediaWiki Event Carried State Transfer - Problem Statement

How is this different from what we already have?

We do not currently have a way to get real time updates of comprehensive MediaWiki state outside of MediaWiki.

MediaWiki History is a monthly snapshot
Wikimedia Dumps are (monthlyish) snapshots
MediaWiki Event Streams (e.g. mediawiki.revision-create) are notification streams, and do not have full state changes (e.g. no content in streams).

We want to design MediaWiki event streams that can be used to fully externalize MediaWiki state, without involving MediaWiki on the consumer side. That is, we want MediaWiki state to be carried by events to any downstream consumer.

We had hoped that MediaWiki entity based changelog streams would be enough to externalize all MediaWiki state. The MediaWiki revision table itself is really just an event log of changes to pages. However, this is not technically true, as past revisions can be updated. On page deletes, revision records are 'archived'. They can be merged back into existing pages, updating the revision record's page id. Modeling this as a page changelog event will be very difficult.

Instead, this page state change data model will support use cases that only care about externalized current state. That is, we will not try to capture modifications to MediaWiki's past in this stream.

This stream will be useful for Wikimedia Enterprise, Dumps, Search updates, cache invalidation, etc, but not for keeping a comprehensive history of all state changes to pages.

We aim to create a new page ‘entity’ based stream that can be used to ‘materialize’ the current state of any MediaWiki page. An entity based stream will have all kinds of changes (creates, updates, deletes. etc.) in a single stream. That is, mediawiki.page_change stream will have page creates, page edits, page deletes, and possibly other types of changes (page properties changes?).

Decisions made

What is MediaWiki page state? What are the relevant entities?

wiki/database
page table data: e.g. page_id, page_title, etc.
actor: the user making a change to a page
revision
- comment
- content slots (MCR) (& content body)
- rendered content slots (for derived/enriched streams)
- editor (same as actor on edit events).

What is not MediaWiki page state (for now)

page properties: these are usually parsing hints, and are not persisted through edits.
editing restrictions: these are about edit restrictions on page, not how the page looks. We could add these state change later if we change our minds.
page links changes: We have this in a different stream already, can join if this is needed.

page state changes and changelog kinds

What kind of page changes are we going to capture in this stream, and what
'changelog kind' do they map to? 'changelog kind' is the type of change to apply to a state store, so either an 'insert'/'create' 'update' or 'delete. Each page change kind maps to exactly one changelog kind. (In Flink, these will be mapped to a RowKind

MediaWiki page change kind	changelog kind
create	insert
edit	update
current revision visibility change*	update
move	update
delete	delete
suppress	delete
undelete	insert

*This can happen if the comment or editor's user_text are hidden on the current revision.

Modeling decisions

We will make this schema organized, in that we are not going to force ourselves to stick with previous event data model decisions. E.g. we will have a revision object with revision related data, rather than top level rev_id, rev_timestamp fields. NOTE: This decision is being revisited, see this comment.
Every page change event will have ALL of the data needed to represent the current page state. (page content will be a in different stream). That is, a page move event will still have all the data about the page's current revision in it, even if only the title has changed.

We will use our existent prior_state modeling convention

Outstanding TODOs and unknowns

Nested vs flat/top level fields
- It is difficult to work with nested fields in SQL. Perhaps flat is best. See this comment.
Deprecate meta.domain and meta.uri, and put that info top level
- See this comment.
Message Keys
- We'll need a message key data model too. Perhaps something like {"database": "enwiki", "page_id": 123} is enough.
- We haven't yet had to think about message keys in Event Platform.
  - wikimedia-event-utilties (Java client), EventGate (HTTP produce API), EventStreams (HTTP consume API) need to support keyed messages, and likely validation of key schemas too.
Compacted Kafka topics
- Can we maintain just one compacted Kafka topic for each of these streams, or do we need to maintain a non-compacted one (e.g. with suppressions in it), and a separate compacted one (where suppressions deletions are null/tombstoned out)?

Details

Other Assignee: lbowmaker

Subject	Repo	Branch	Lines +/-
development/ page change - Remove comment_html fields, bump to 2.0.0	schemas/event/primary	master	+1 K -27
[POC] less-nested mediawiki/page/change schema	schemas/event/primary	master	+1 K -1 K
mediawiki/page/change - Use single array field for user attributes	schemas/event/primary	master	+45 -77
Remove development/ mediawiki page change schemas	schemas/event/primary	master	+9 -4 K
Finalize mediawiki/page/change schema at 1.0.0	mediawiki/extensions/EventBus	master	+1 -4
Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change	operations/mediawiki-config	master	+10 -11
Finalize mediawiki/page/change schema at 1.0.0	mediawiki/extensions/EventBus	wmf/1.40.0-wmf.21	+1 -4
Finalize mediawiki/page/change schema at 1.0.0	mediawiki/extensions/EventBus	wmf/1.40.0-wmf.22	+1 -4
Finalize mediawiki/page/change schema at 1.0.0	schemas/event/primary	master	+2 K -26
Declare rc0.mediawiki.page_content_change stream	operations/mediawiki-config	master	+11 -2
Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0	schemas/event/primary	master	+931 -16
Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema	schemas/event/primary	master	+2 K -144
Add Schema for Enriched MW Streams	schemas/event/primary	master	+892 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	gmodena	T307959 [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content
Resolved	Ottomata	T308017 Design Schema for page state and page state with content (enriched) streams
Resolved	Ottomata	T310082 [Shared Event Platform] - Research Flink Changelog semantics to inform POC MW schema design
Declined	None	T336506 mediawiki/page/change event schema - Use single array field for user attributes instead of boolean fields

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T308017#8339347, @Ottomata wrote:

It would be better if the Hook gave me a RevisionRecord that has the same data fields that is being inserted into the revision archive table.

Oh well, I guess the right thing to do is to manually update the event's is_*_visible fields if $logEntry->getType() === 'suppress'.

Updating the RevisionRecord isn't trivial, but it would be easy to simply pass the $this->suppress flag into the hook if only we had hook param objects, per the discussion on T212482: RFC: Evolve hook system to support "filters" and "actions" only. Please keep telling people that we need to overhaul the hook system!

Change 807565 merged by jenkins-bot:

[schemas/event/primary@master] Add new mediawiki state entity and change fragments, and use them in new mediawiki page change schema

https://gerrit.wikimedia.org/r/807565

JArguello-WMF updated the task description. (Show Details)Oct 27 2022, 1:15 PM

JArguello-WMF moved this task from In Review to Done on the Event-Platform (Sprint 03) board.

@Ottomata Is the decision made something worth documenting in the decision log?

JArguello-WMF moved this task from Sprint 03 to Radar on the Event-Platform board.Oct 27 2022, 1:21 PM

JArguello-WMF edited projects, added Event-Platform; removed Event-Platform (Sprint 03).

Ottomata moved this task from Radar to Sprint 03 on the Event-Platform board.Oct 27 2022, 7:13 PM

Ottomata edited projects, added Event-Platform (Sprint 03); removed Event-Platform.

JArguello-WMF closed this task as Resolved.Nov 1 2022, 1:36 PM

Change 851670 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0

https://gerrit.wikimedia.org/r/851670

Change 851673 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Declare rc0.mediawiki.page_content_change stream

https://gerrit.wikimedia.org/r/851673

Change 851670 merged by jenkins-bot:

[schemas/event/primary@master] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0

https://gerrit.wikimedia.org/r/851670

Change 851673 merged by jenkins-bot:

[operations/mediawiki-config@master] Declare rc0.mediawiki.page_content_change stream

https://gerrit.wikimedia.org/r/851673

Mentioned in SAL (#wikimedia-operations) [2022-11-01T19:15:30Z] <otto@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Declare rc0.mediawiki.page_content_change stream - T307959 T308017 (duration: 03m 42s)

Re-opening to discuss a schema change.

In https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/853267, @Tgr wrote:

IMO it would make sense to remove formatComment from EventSerializer because the dependencies are onerous and limit when it can be used (CommentFormatter requires the parser, the parser requires among many other things a user context for language preferences, so trying to obtain it will result in an exception in no-session contexts and might result in unexpected behavior

I think support for the html formatted comment was added in older events as part of T170145 and we've just kept it in.

I propose we remove the comment_html field from mediawiki/page/change altogether. If we want a parsed comment, we can add it in as part of an enrichment step.

Ottomata edited projects, added Event-Platform; removed Event-Platform (Sprint 03).Nov 9 2022, 9:33 PM

Ottomata moved this task from Backlog to Sprint 04 on the Event-Platform board.

Ottomata edited projects, added Event-Platform (Sprint 04); removed Event-Platform.

Ottomata moved this task from Sprint 04 to Backlog on the Event-Platform board.

Ottomata edited projects, added Event-Platform; removed Event-Platform (Sprint 04).

Change 855146 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] development/ page change - Remove comment_html fields, bump to 2.0.0

https://gerrit.wikimedia.org/r/855146

lbowmaker moved this task from Backlog to Investigate on the Event-Platform board.Nov 16 2022, 4:40 PM

2 more questions to answer:

Nested vs flat/top level fields

Right now, this schema uses Rows AKA Structs to as nested fields to contain entity specific information, like page, performer, revision, revision.editor, revision.content_slots etc. Querying such nesting can be difficult in SQL. Perhaps it would be better to put as much as we can in top level entity prefixed fields instead, e.g. page_id, page_title, performer_user_id, rev_id, rev_editor_user_id, rev_content_slots, etc.

Deprecate meta.domain and meta.uri

Using the meta field for domain specific event data is wrong. My preference would be to fully get rid of meta, but that would be quite an undertaking. Instead I propose:

Mark meta.domain and meta.uri as deprecated, but don't do major meta schema bump to remove those fields.
Stop producing meta.domain and meta.uri in our new mediawiki/page/change event, but add top event level fields that contain the same information.

The consumer would need to know that the event doesn't have this info in e.g. meta.domain anymore,but I think that is fine, as the consumer would also need to know that this data WAS in meta.domain in the first place. Old events and consumers will continue to have this data, but new ones should not.

EBernhardson subscribed.Nov 17 2022, 3:12 PM

In https://phabricator.wikimedia.org/T317768#8400702 @Isaac wrote:

@Ottomata recognizing that this might be long past the time when you'd want this feedback but a question about an additional field:

Similar to is_redirect, we often use whether an article is a disambiguation / list page as a determination for how to handle with ML models -- e.g., it's not intended behavior to run many models like add-a-link or the topic model on disambiguation / list pages. While I don't think list article is easy to determine without making a call to Wikidata (I assume that's out of the question), disambiguation pages are tracked by Mediawiki -- e.g., https://en.wikipedia.org/w/api.php?action=query&titles=Albert&prop=pageprops&format=json&ppprop=disambiguation.

What would be the process to consider whether this could be included as part of the page info in the event?

long past the time when you'd want this feedback

It is not! We still want and need your feedback. That's why this is currently an 'rc0' stream, and the schemas are in a '/development' namespace.

disambiguation pages are tracked by Mediawiki
prop=pageprops&format=json&ppprop=disambiguation

Hm, we had made the decision to not include page properties:

page properties: these are usually parsing hints, and are not persisted through edits.

I think, that the right thing to do if you need this kind of stuff, will be to join with other streams that have this information, e.g. mediawiki.page-properties-change (schema). Ooof, but actually, to do that we need T281483: mediawiki/page/properties-change schema should use map type for added and removed page properties.

Ottomata updated the task description. (Show Details)Nov 17 2022, 3:27 PM

Ottomata updated the task description. (Show Details)

It is not! We still want and need your feedback. That's why this is currently an 'rc0' stream, and the schemas are in a '/development' namespace.

Yay!

Hm, we had made the decision to not include page properties:

page properties: these are usually parsing hints, and are not persisted through edits.

I think, that the right thing to do if you need this kind of stuff, will be to join with other streams that have this information, e.g. mediawiki.page-properties-change (schema). Ooof, but actually, to do that we need T281483: mediawiki/page/properties-change schema should use map type for added and removed page properties.

I'm not following the aspect about page properties not being persisted through edits. Most of the page props I see (all props of enwiki) are largely-static properties of the page such as its display image or whether it's a disambiguation page. Most of them are irrelevant to our needs though I'm also quite interested in wikibase_item (connecting Wikipedia articles to Wikidata items can be a painful process so having the data present in one place saves that hassle). I can come up with weaker reasons why page_image and wikibase-shortdesc are also interesting but that's really only for specific models.

That said, if the disambiguation property isn't included in the page-change, here's what I imagine our options would be for using this stream as part of the pipeline for running LiftWing models on new edits:

We could have a separate job that watches for pages that add the disambiguation property and use that to e.g., clear predictions for a page from the Search index but it wouldn't necessarily help us with the question of whether to run a model on a given edit.
We could do the same but maintain a table of e.g., all the pages that are disambiguation pages in a feature store on the ML platform that we could check against. I'd worry about that falling out of sync due to e.g., page moves etc. though so I think that would be a brittle solution.
I think we'd be most likely to just make the API calls to pageprops ourselves when it's relevant to whether a model should be triggered or not (which is fine but was hoping to remove that step if easy because I imagine it's an important piece of information for lots of ML models)

I'm not following the aspect about page properties not being persisted through edits

I don't know if I totally follow either, but there is more context the initial collab design doc see "Do we want page properties?" and the comment.

here's what I imagine our options would be

What about using the mediawiki.page-properties-change stream? You can join these two streams together on wiki_id and page_id, and keep state about the current page properties for a page_id, and use that to decide what to do. This would be similar to your idea of maintaining a feature store lookup table for page properties, except that the content of the table/feature store is updated from the stream directly.

In T308017#8384646, @Ottomata wrote:

Re-opening to discuss a schema change.

In https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/853267, @Tgr wrote:

IMO it would make sense to remove formatComment from EventSerializer because the dependencies are onerous and limit when it can be used (CommentFormatter requires the parser, the parser requires among many other things a user context for language preferences, so trying to obtain it will result in an exception in no-session contexts and might result in unexpected behavior

I think support for the html formatted comment was added in older events as part of T170145 and we've just kept it in.

I propose we remove the comment_html field from mediawiki/page/change altogether. If we want a parsed comment, we can add it in as part of an enrichment step.

Sorry for the slow response! I just want to clarify that my point was about separating the class that does comment formatting (which is complex and has lots of dependencies) from the one that does the rest of the event object creation (which is fairly simple, and needed in some contexts where we don't have enough information to initialize a comment formatter). That could be done without changing anything about the events themselves - it's just that sometimes we have events which don't involve comments in any way, and happen in requests where we can't easily obtain a comment formatter, and we'd need a more lightweight serializer class for those. That could be solved by having two different kinds of serializers, or a serializer and a helper class for comments, for example.

I have no opinion on the proposal here one way or another, as I'm not familiar with how the data is used. It would certainly solve the problem that led to the patch mentioned above, but there are other ways to solve it, too.

Thanks @Tgr! At this point it is easy enough to remove, and we can always add it back in later if/when we need it. I'd prefer to solve this problem by making the event model simpler for now anyway.

I don't know if I totally follow either, but there is more context the initial collab design doc see "Do we want page properties?" and the comment.

@Ottomata thanks for that pointer. My summary after reading: Moriel's right that page props are for parser hints. Not withstanding, they also happen to be useful for knowing when to trigger some ML models (there are certain things that are painful to infer in a consistent manner from wikitext so it's very helpful when we can rely on the parser for that). David raised the point about those properties potentially changing silently in between revisions (like the wikibase_item). For our use-cases, we're probably okay with having slightly stale data -- e.g., not triggering models when the wikibase_id changes but waiting till the actual page is edited. But I understand that other use cases might be less okay with that. Per our discussion last week though, stream enrichment is easy and this is all pretty specific to ML models so I'm now thinking that we likely will just want to work with ML Platform to create a stream that enriches the base page change stream with properties like is_disambiguation and wikibase_item.

What about using the mediawiki.page-properties-change stream? You can join these two streams together on wiki_id and page_id, and keep state about the current page properties for a page_id, and use that to decide what to do. This would be similar to your idea of maintaining a feature store lookup table for page properties, except that the content of the table/feature store is updated from the stream directly.

My concern about approaches like this is that they require us to build increasingly complex code to not fall out of sync with Mediawiki (akin to the heroic scale of what Joseph put together for mediawiki-history). For instance, I assume along with page-properties-change, we'd also have to watch page move logs and possibly others to actually maintain an accurate list of pages that are disambiguation pages. Perhaps there's some middle ground though where we watch page-properties-change to get realtime changes but still call pageprops API on edits to clean up the state for when it falls out of sync. I'll bring this up with ML Platform though.

Milimetric mentioned this in T323645: Spark Streaming Dumps POC: Update iceberg tables.Nov 22 2022, 8:15 PM

build increasingly complex code to not fall out of sync with Mediawiki (akin to the heroic scale of what Joseph put together for mediawiki-history)

A goal of this task is to simplify the code that is needed to maintain the current state of MediaWiki pages outside of MediaWiki.

along with page-properties-change, we'd also have to watch page move logs and possibly others

Ideally you'd only need this new mediawiki.page_change stream, plus mediawiki.page-properties-change, unless I'm missing something.

accurate list of pages that are disambiguation pages

Is known via page properties or something else?

create a stream that enriches the base page change stream with properties like is_disambiguation and wikibase_item

If this info is in mediawiki.page-properties-change, it will probably be preferrable to create this new enriched stream by joining the streams, that way the MediaWiki API is not involved, and backfilling history will not require throttling.

Isaac mentioned this in T318010: Improved edit summary data in mediawiki_history.Dec 2 2022, 10:29 PM

Ottomata updated the task description. (Show Details)Dec 20 2022, 3:35 PM

Change 874900 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] [POC] less-nested mediawiki/page/change schema

https://gerrit.wikimedia.org/r/874900

Did some work today seeing what the schema would look like flattened as discussed in this comment.

Here's an example event:

$schema: /development/mediawiki/page/change/2.0.0
changelog_kind: update
dt: '2021-01-01T00:00:00.0Z'
meta:
  domain: examplewiki
  dt: '2021-01-01T00:00:00.0Z'
  stream: mediawiki.page_change
wiki_id: example
page_change_kind: edit
page_id: 1
page_is_redirect: false
page_namespace_id: 1
page_revision_count: 1
page_title: example
performer:
  user_id: 123
  user_text: yoohoo
rev_content_slots:
  main:
    content_format: text/x-wiki
    content_model: wikitext
    content_sha1: 16619839a55cfb5c61bcf520bf9734e0c67f98cc
    content_size: 100
    content_origin_rev_id: 3
    content_slot_role: main
rev_dt: '2021-01-01T00:00:00.0Z'
rev_editor:
  user_id: 123
  user_text: example
rev_id: 3
rev_is_comment_visible: true
rev_is_content_visible: true
rev_is_editor_visible: true
rev_is_minor_edit: false
rev_parent_id: 2
rev_sha1: 16619839a55cfb5c61bcf520bf9734e0c67f98cc
rev_size: 100
prior_state:
  page_title: Prior Page Title
  rev_content_slots:
    main:
      content_format: text/x-wiki
      content_model: wikitext
      content_sha1: 12349839a55cfb5c61bcf520bf9734e0c67f98cc
      content_size: 80
      content_origin_rev_id: 2
      content_slot_role: main
  rev_dt: '2020-12-01T00:00:00.0Z'
  rev_editor:
    user_id: 120
    user_text: other_user
  rev_id: 2
  rev_is_comment_visible: true
  rev_is_content_visible: true
  rev_is_editor_visible: true
  rev_is_minor_edit: false
  rev_parent_id: 1
  rev_sha1: 12349839a55cfb5c61bcf520bf9734e0c67f98cc
  rev_size: 80

It's not bad. Note how all of the entity fields are prefixed, e.g. rev_is_comment_visible, user_groups, page_is_redirect, etc.

There are still some subobjects like performer, rev_editor and prior_state, but we need these in order to $ref schema fields. The user entity is used 3 times in this example: top level performer, rev_editor, and prior_state.rev_editor. Also, rev_content_slots has to be a 'nested field' as it is a map type with variable (string) keys.

While this flattening might make writing some SQL queries easier, it will not be possible to fully flatten, and the authoring of the JSONSchema gets much more complicated. Example:

The fields in prior_state should contain most (if not all) of the top level fields. Each time we compose change schemas together, we need to add more fields into the prior_state field. Here is how I declared prior_state in the nested schema:

prior_state:
  type: object
  properties:
    page:
      $ref: /development/fragment/mediawiki/state/entity/page/1.0.0
    revision:
      $ref: /development/fragment/mediawiki/state/entity/revision/1.0.0

I'm able to do this because all of the $refs I have to make are fully enclosed in their own subfield.

In my non-nested attempt, page and rev fields are all top level. JSONSchema won't let me use multiple $refs at the same object level. I believe this is because the JSON is parsed first, which ends up with a duplicate key error, as $ref is repeated twice. I.e.

prior_state:
  type: object
  properties:
    $ref: /development/fragment/mediawiki/state/entity/page/1.0.0
    $ref: /development/fragment/mediawiki/state/entity/revision/1.0.0

Does not work.

To work around this in the non-nested schema, I have to use allOf to declare two separate prior_state fields that reference the entity's properties directly. These fields are then later merged by jsonschema-tools:

allOf:
  - properties:
      prior_state:
        type: object
        description: >
          Prior state of this page before this event.
          Fields are only present if their values have changed.
        properties:
          $ref: /development/fragment/mediawiki/state/entity/page/2.0.0#/properties
  - properties:
      prior_state:
        properties:
          $ref: /development/fragment/mediawiki/state/entity/revision/2.0.0#/properties

Referencing these fragment schema properties directly is okay I guess, but I really liked how in the nested version, the fragment schema information (e.g. $id, title), etc. was carried over into the field schema. This doesn't affect the shape of the event or validation at all, but it does make it easier to see in a materialized schema where a specific $ref-ed field comes from.

Given the fact that this less-nested schema still has nested fields to work with, and it makes schema more complicated, I'm inclined to keep the nested schema as is.

Thoughts @tchin @gmodena @dcausse ?

My attempt at de-nesting is here: https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/874900

Mayakp.wiki subscribed.Jan 3 2023, 7:41 PM

@Ottomata I don't have any preference here, it just occurred to me that you could also work around the $ref problem like this:

prior_state:
  type: object
  properties:
    page:
      type: object
      properties: 
        $ref: /development/fragment/mediawiki/state/entity/page/1.0.0
    revision:
      type: object
      properties: 
        $ref: /development/fragment/mediawiki/state/entity/revision/1.0.0

It would be:

prior_state:
  type: object
  properties:
    page:
      type: object
      properties: 
        $ref: /development/fragment/mediawiki/state/entity/page/1.0.0#/properties
    revision:
      type: object
      properties: 
        $ref: /development/fragment/mediawiki/state/entity/revision/1.0.0#/properties

If you want to $ref the properties like that, but...that is a nested schema anyway, with revision and page as nested fields, instead of e.g. rev_id, page_id top level. If we are going nested, there's no need to ref the properties of the schema directly.

lbowmaker moved this task from Investigate to Sprint 07 on the Event-Platform board.Jan 9 2023, 3:09 PM

lbowmaker edited projects, added Event-Platform (Sprint 07); removed Event-Platform.

lbowmaker moved this task from Next Up to In Review on the Event-Platform (Sprint 07) board.

Hm, uh oh, does our page properties assumption need changed? We said:

What is not MediaWiki page state (for now)

page properties: these are usually parsing hints, and are not persisted through edits.

But recently, a new major version of the PageProperties extension was just released:

From tdvit@mail.com on wikitech-l:

Hello, I've released a major upgrade of PageProperties extension and completely rewritten the extension page
https://www.mediawiki.org/wiki/Extension:PageProperties
[...]
However at the same time the extension can be considered a proof of concept for the use of
Slots, since all the data-structure is completely based on SLOTS
(properties associated to a page are recorded within a slot with JSON content model)
and they can also be navigated through the interface.
If you want to check out the extension, I've set up this wiki
https://wikienterprise.org/wiki/Main_Page
where you can freely test it.

@Cparle, any thoughts about ^?

Actually, we are already going to capture revision slots in these events anyway...so if this stuff is stored in revision slots, we should just get it for free!

JArguello-WMF moved this task from In Review to In Progress on the Event-Platform (Sprint 07) board.Jan 17 2023, 2:11 PM

Sigh, I'm also reconsidering my desire to deprecate meta.domain and meta.uri for this.

Deprecating would be a cleaner data model. But perhaps, the inconsistency between this schema and older schemas that use the fields isn't worth it?

In a Event Platform Value Stream meeting last week, we decide not do deprecate these fields for now. We'd have to live with the inconsistency between our new schemas and our old schemas (which are unlikely to be decommed anytime soon) for a long time.

Instead, if we do choose to purse deprecating these kinds of meta fields, we will do so as its own project, and do it everywhere.

In T308017#8519809, @Ottomata wrote:

But recently, a new major version of the PageProperties extension was just released:

Well, we don't use that extension, so it's somebody else's problem. But I don't think there was much relation at any point - before it switched to slots, it seems to have used a table called page_properties, not MediaWiki core page properties (the page_props table).

Oh, okay. Thanks.

TODO:

Removing comment_html: schema, eventbus
Use PageUndeleteComplete hook: after patch is merged.

It would also be nice to do T321411: Add $comment and $performer to ArticleRevisionVisibilitySet params, but I think we can punt on it for now.

@Ottomata Do you need me to create tickets for the TO Dos mentioned above?

JArguello-WMF edited projects, added Event-Platform (Sprint 08); removed Event-Platform (Sprint 07).Jan 27 2023, 8:05 PM

JArguello-WMF moved this task from Next Up to In Progress on the Event-Platform (Sprint 08) board.

@JArguello-WMF Not a bad idea,

For the comment_html related TODO, I'll do that as part of this ticket.

T328308: Use new PageUndeleteComplete hook to emit mediawiki.page_change undelete event - This one needs to wait for its parent task to be resolved.
T321411: Add $comment and $performer to ArticleRevisionVisibilitySet params - This one is actually a bit difficult. Let's deprioritize this and put it in the backlog for now.

JArguello-WMF moved this task from In Progress to Ready to Deploy on the Event-Platform (Sprint 08) board.Jan 31 2023, 2:10 PM

Change 885876 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/885876

Change 885877 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/885877

Change 885877 merged by Ottomata:

[schemas/event/primary@master] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/885877

Change 885880 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Finalize mediawiki/page/change schema

https://gerrit.wikimedia.org/r/885880

I've merged the schema out of the development namespace at /mediawiki/page/change/1.0.0.

I've got a bit of a chicken egg race condition for the MW deployments.

EventBus needs a change to produce the updated $schema URI. StreamConfig needs to have the matching schema_title set at the same time. I think I need to manage the deployment of these two things together. There will be a short period where one is deployed before the other, and there will be eventgate produce errors.

To make things a little smoother, I'm going to make a new release candidate stream for this, so that any cached stream configs for the rc0 won't be used.

I'll prep the patches and schedule a backport window to do this next week.

Once deployed, I need to remove the /development/* namespaced schemas.

Steps:

Patch for EventBus and Event Stream Config.
Backport deploy EventBus change, and deploy Event Stream Config change immediately after. - Scheduled for Tuesday Feb 7 2023
verify all is well; remove /development/* schemas.

Change 885886 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] Remove development/ mediawiki page change schemas

https://gerrit.wikimedia.org/r/885886

Change 887347 had a related patch set uploaded (by Urbanecm; author: Ottomata):

[mediawiki/extensions/EventBus@wmf/1.40.0-wmf.22] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/887347

Change 887348 had a related patch set uploaded (by Urbanecm; author: Ottomata):

[mediawiki/extensions/EventBus@wmf/1.40.0-wmf.21] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/887348

Change 887347 merged by jenkins-bot:

[mediawiki/extensions/EventBus@wmf/1.40.0-wmf.22] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/887347

Change 887348 merged by jenkins-bot:

[mediawiki/extensions/EventBus@wmf/1.40.0-wmf.21] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/887348

Change 885880 merged by Urbanecm:

[operations/mediawiki-config@master] Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change

https://gerrit.wikimedia.org/r/885880

Mentioned in SAL (#wikimedia-operations) [2023-02-07T16:41:05Z] <urbanecm@deploy1002> Started scap: 58f4d877: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (T308017), 854ff4ac: Finalize mediawiki/page/change schema at 1.0.0 (T308017)

Change 885876 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Finalize mediawiki/page/change schema at 1.0.0

https://gerrit.wikimedia.org/r/885876

Mentioned in SAL (#wikimedia-operations) [2023-02-07T16:48:37Z] <urbanecm@deploy1002> Finished scap: 58f4d877: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (T308017), 854ff4ac: Finalize mediawiki/page/change schema at 1.0.0 (T308017) (duration: 07m 32s)

Annnnd we're done with schema! Latest changes are now being produced to kafka jumbo in the rc1.mediawiki.page_change stream.

We are calling the schema design phase done.

Ottomata moved this task from Ready to Deploy to Done on the Event-Platform (Sprint 08) board.Feb 7 2023, 4:51 PM

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.23; 2023-02-13).Feb 7 2023, 5:01 PM

lbowmaker closed this task as Resolved.Feb 17 2023, 2:50 PM

Ottomata mentioned this in T331401: Design event schema for ML scores/recommendations on current page state.Mar 7 2023, 11:19 AM

Change 885886 merged by jenkins-bot:

[schemas/event/primary@master] Remove development/ mediawiki page change schemas

https://gerrit.wikimedia.org/r/885886

Milimetric mentioned this in T333223: Adding user_is_temp to the user table.Mar 27 2023, 8:07 PM

matmarex mentioned this in T330338: [SPIKE] Determine what – if any – changes need to be made to editing-related schemas to handle temporary accounts.Apr 6 2023, 9:55 PM

Ottomata mentioned this in T332212: Major (API) versioning of Event Platform streams.Apr 19 2023, 3:43 PM

Okay, based on this ticket, I'm leaning towards changing the user type booleans in our user entity schema to use a single list<string> field with restricted values.

This will allow us to be a little more flexible and future proof with respect to changes that might happen in MediaWiki's user type data model.

Change 919106 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] mediawiki/page/change - Use single array field for user attributes

https://gerrit.wikimedia.org/r/919106

Ottomata closed subtask T336506: mediawiki/page/change event schema - Use single array field for user attributes instead of boolean fields as Declined.May 16 2023, 1:58 PM

Change 919106 abandoned by Ottomata:

[schemas/event/primary@master] mediawiki/page/change - Use single array field for user attributes

Reason:

Declined the task

https://gerrit.wikimedia.org/r/919106

Change 874900 abandoned by Ottomata:

[schemas/event/primary@master] [POC] less-nested mediawiki/page/change schema

Reason:

Decided not to do this.

https://gerrit.wikimedia.org/r/874900

Change 855146 abandoned by Ottomata:

[schemas/event/primary@master] development/ page change - Remove comment_html fields, bump to 2.0.0

Reason:

Done elsewhere.

https://gerrit.wikimedia.org/r/855146

Ottomata mentioned this in T365693: Provide attribute to indicate that user is temporary account in exported content.Thu, May 23, 1:03 PM

	F35612259: Screen Shot 2022-10-22 at 19.59.11.png
	Oct 23 2022, 12:00 AM

Design Schema for page state and page state with content (enriched) streamsClosed, ResolvedPublic3 Estimated Story PointsActions

Description

User Story

As a platform engineer, I need a common MediaWiki page state change schema that can be used as a 'changelog' of page state. I can then use this to maintain a materialized view of the current state of pages outside of mediawiki.

As a search engineer, I need to be able to easily subscribe to ordered changes to pages to keep search indexes up to date.

Done is:

Details

How is this different from what we already have?

Decisions made

What is MediaWiki page state? What are the relevant entities?

What is not MediaWiki page state (for now)

page state changes and changelog kinds

Modeling decisions

Outstanding TODOs and unknowns

Details

Related ObjectsSearch...

Event Timeline

Nested vs flat/top level fields

Deprecate meta.domain and meta.uri

Design Schema for page state and page state with content (enriched) streams
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...