Page MenuHomePhabricator

Common event data model for data derived from parsed page revision content
Open, Needs TriagePublic

Description

Both T360794: Implement stream of HTML content on mw.page_change event and T331399: Create new mediawiki links change streams based on fragment/mediawiki/state/change/page are about emitting event data derived from the HTML parsed version of MediaWiki page revisions.

Other projects also use events to represent data that is either (or should be) derived from the page revision HTML:

etc.

The output data of all of these depends (directly or indirectly) on the MediaWiki parsed HTML. The parsed HTML (and anything derived from it) can change due to things other than edits; Template or transclusion changes, time passing, different parser versions, etc. etc.

E.g., A page's topic prediction might change because a template dependency was edited.

Propagating all changes due to reparsing is out of scope for current externalized (outside of MediaWiki) derived data projects. However, while we may not need to update externally stored parsed HTML derived data for MVPs, getting the data model right now will be important for when we do.

We primarily need a model for a reusable stable identifier for a specific page revision rendering.

This task should follow the precedent set by T308017: Design Schema for page state and page state with content (enriched) streams. Data Engineering and MediaWiki engineers should collaborate on designing a good data model and event JSONSchema fragment that can represent MediaWiki's concept of a 'rendering' with a render_id.

Done is:

Event Timeline

Ottomata updated the task description. (Show Details)

We know that we will need a stable identifier for a specific rendering.

Can/should we access and use MediaWiki Parser's render_id as a stable identifier for rendered content for:

If not, is there another way to detect if the derived data stored elsewhere is stale compared to what ParserCache has? Timestamp perhaps?

I've been doing a little brainstorming. Here are some raw notes from ideas. I hope they aren't too confusing presented like this! If they are we can meet and discuss and brainstorm together.

Here is a stripped down example of a page_change event with the page and revision entity models.

wiki_id: examplewiki
page:
  namespace_id: 1
  page_id: 1
  page_title: examplepage

revision:
  rev_id: 2
  rev_parent_id: 1  
  rev_dt: '2021-01-01T00:00:00.0Z'
  comment: changed a thing
  content_slots:
    main:
      content_format: text/x-wiki
      content_model: wikitext
      content_sha1: 16619839a55cfb5c61bcf520bf9734e0c67f98cc
      content_size: 100
      origin_rev_id: 2
      slot_role: main
      content_body: <body here optional>
  editor:
    user_id: 123
    user_text: example
Ideas
Option A: Add a parsed_data (name TBD) container field to revision
revision:
  rev_id: 2
  # ...
  parsed_data:
    render_id: 1234
    render_dt: '2021-01-01T01:00:00.0Z'    
    content_html: <html body here>

In different event(s), links might be modeled in parsed_data like:

revision:
  rev_id: 2
  # ...
  parsed_data:
    render_id: 1234
    render_dt: '2021-01-01T01:00:00.0Z'    
    page_links: array<page link model>
    external_links: array<string URLs>
    template_links: array<page link model>
    ...

Pros:

  • render_id only once
  • parsed_data associated with revision as it probably should be
  • parsed_data on par with content_slots model in revision model.

Cons:

  • Too much nesting?
  • derived data fields don't refer to render_id or rev_id on their own, so they are not ID-able in isolation.
  • A little strange to have to think about parsed_data?
Option B: Add fields to revision that each include render_id.
revision:
  rev_id: 2
  rev_dt: '2021-01-01T00:00:00.0Z'  
  # ...
  content_html: 
    render_id: 1234
    render_dt: '2021-01-01T01:00:00.0Z'    
    content_body: <html content here>

And in a different event(s) representing page link changes, this might look like:

revision:
  rev_id: 2
  rev_dt: '2021-01-01T00:00:00.0Z'  
  page_links:
    render_id: 1234
    render_dt: '2021-01-01T01:00:00.0Z'  
    links: array<page link model>
  external_links:
    render_id: 1234
    render_dt: '2021-01-01T01:00:00.0Z'  
    links: array<string URLs>
  template_links: 
    render_id: 1234
    render_dt: '2021-01-01T01:00:00.0Z'  
    array<page link model>

Pros:

  • derived data fields are associated with revision.
  • a little less nesting
  • each derived/parsed rendered data has its own render_id reference. More useful in isolation, but probably still would want rev_id.
  • derived data fields like content_html and page_links are on par with content_slots in revision model.

Cons:

  • render_id and render_dt duplicated in each derived data field.
Option C: Top level parsed_data container.

Same as Option A. but parsed_data not nested in revision, so it should include some revision and possibly page ids.

page: ...
revision: ...
parsed_data:
  rev_id: 2
  rev_dt: <dt>    # do we need this here?
  page_id: 1      # might not need this?
  namespace_id: 1 # might not need this?
  render_id: 1234
  render_dt: '2021-01-01T01:00:00.0Z'
  content_html: <html body here>
  page_links: ...
  external_links: ...

Pros:

  • render_id etc. only once
  • less nesting

Cons:

  • derived data fields don't refer to render_id or rev_id on their own, so they are not useful in isolation.
  • duplicated info from top level revision and page fields, but only once.
Option D: Top level derived data fields
page: ...
revision: ...
content_html:
  rev_id: 2
  render_id: 1234
  render_dt: '2021-01-01T01:00:00.0Z'
  # page_id & namespace_id too?
page_links:
  rev_id: 2
  render_id: 1234
  render_dt: '2021-01-01T01:00:00.0Z'  
  (page_id & namespace_id too?)
external_links: 
template_links:
  # ... also has repeat ids
  # ... also has repeat ids

Pros:

  • Explicit. Each derived data field fully keyed with render_id and rev_id.
  • A derived data field is IDable in isolation.
  • even less nesting

Cons:

  • Lots of repeat ids.

Of all of these, I'm leaning towards something like either Option A or Option C, using a parsed_data
container somewhere. They minimize duplicate fields, but still strongly associate the derived data
fields with a specific rev_id and render_id.

Thoughts?

I think the render_id is useful as a concept but does that ID actually have any practical meaning? Like do we have plans to track it elsewhere in a way where someone might want to compare the data in this stream to a state somewhere else to check if it's from the same render ID or not? Otherwise, I think the ID itself doesn't tell you anything and might confuse folks while the important thing is the time at which it was rendered (render_dt in your model).

Otherwise no strong feelings about the particular schema. I think the reality is that this data is extremely nested in terms of where it comes from -- i.e. a wiki has a page which has a revision+editor which has a rendering which has extracted data. But if it's flattened out or flipped around, I think that's fine too so long as all the data shows up somewhere in the event.

I think the render_id is useful as a concept but does that ID actually have any practical meaning?

It's basically an e-tag. It tells you whether any data you have stored based on an earlier version is still alid, or if you need to re-compute.
The other use case is to fetch up the actual HTML, if it's not contained in the stream itself.

The render time alone is insufficient as an identifier, though it ma be useful for finding out which rendering is more recent.

cc @xcollazo in case you have any thoughts, since you will likely look at this data from a data lake querying perspective.

I think the render_id is useful as a concept but does that ID actually have any practical meaning?

It's basically an e-tag. It tells you whether any data you have stored based on an earlier version is still alid, or if you need to re-compute.
The other use case is to fetch up the actual HTML, if it's not contained in the stream itself.

The render time alone is insufficient as an identifier, though it ma be useful for finding out which rendering is more recent.

From data lake point of view, how do we use this e-tag? Are we saying that we would have a process that would periodically go over all revisions, check the e-tags, and re-fetch any that If-None-Match?

Also, If we keep an e-tag, don't we also need to keep a fully qualified URL? (Because we have multiple parsers, and thus a revision can have multiple associated HTML renders at any time?)

Regarding data layout, I favor Option A because it seems to repeat less data. As long as we use structs for the nesting we should be fine performance wise.

No maps please!

Re repeating data: Within any use of the fragment/mediawiki/state/change/page model, the relevant ID fields are:

  • wiki_id
  • page.page_id
  • revision.rev_id

All 3 of these are needed to ID a revision (well, strictly page_id isn't). We don't repeat wiki_id or page_id in the revision object. If we follow this convention, perhaps it is not so bad to use render_id on its own outside of revision and not directly referencing rev_id or page_id, since those are in the event elsewhere.

Option C:

page:
  page_id: 1
  ...
revision:
  rev_id: 2
  ...
parsed_data:
  render_id: 123
  content_html: <body here>
  # or
  page_links: ...
  external_links: ...

We could do the same with Option D, but in cases where an event might contain multiple pieces of parsed data (page_links, external_links, etc.) render_id would be repeated.

So perhaps, Option C top level parsed_data container field is the most appealing?

Assigning this officially to you @Ottomata for tracking.

Update: we have decided (T351225#11629321) to try and pursue an HTML enriched stream which includes a diff to the parent revision HTML too.

While not directly related to a common data model, we will consider how we might include mode diff to parent html (and/or if we were to have parent rev html) in the model too.