Page MenuHomePhabricator

Rework how mediawiki-history differentiates fake page-create from real ones
Open, MediumPublic

Description

Topic discussed with Dan on 2020-10-06 after interesting findings on vetting kafka-events vs mediawiki-history:

  • mediawiki-history stores 2 different types of rows related to page-creation: event_entity: page / event_type: create and event_entity: page / event_type: create-page (names are misleading but let's not concentrate on that for now).
  • create-page rows come from page-creation action logged in the logging table. They are available in project since 2018-06.
  • create rows are created artificially by the reconstruction algorithm and are at the page's first revision timestamp, unless there is a page-title collision. This can happen, for example, if the page is already created by a page/create-page event. In that case, we set the timestamp to just before that page title existing. This conforms to the constraint that only one page with a given title can exist at a given time and it allows the page/create event to be consistently present across every page.
  • The reason for us to keep create rows even for pages having a create-page row is to be able to join older revisions, if any, in the page history, as the denormalization join between page and revision rows uses page_id and timestamp. The timestamp aspect of the join enforces that for a given page-row P, any revision happening after that P and before the next page-row in time is using P as its page-historical values.
  • Something else to notice is that we store, when available, both page_first_revision and page_creation timestamps.

One less confusing way to deal with the problem of having two different events referencing page-creation is to use the create-page row when available, and the artificially created create otherwise, using the same event-type create, but with a flag letting users know that the event has been artificially created based on our best assumptions. Having this for events means clearer semantic of create for pages. However, this also requires to change the denormalization join between pages and revisions so that all revisions happening before the first create row of page should be denormalized using that first row.

This is a complex issue, I hope I make sense :S

Event Timeline

at the page's first revision timestamp if no page-title collision happens, or at earliest timestamp before collision otherwise (By doing so, we enforce a single page with a specific page-title can exist at any point in time)

How about:

at the page's first revision timestamp, unless there is a page-title collision.  This can happen, for example, if the page is already created by a page/create-page event.  In that case, we set the timestamp to just before that page title existing.  This conforms to the constraint that only one page with a given title can exist at a given time and it allows the page/create event to be consistently present across every page.

However, this also requires to change the denormalization join between pages and revisions so that all revisions happening before the first create row of page should be denormalized using that first row.

Might be useful to add an example here, like with restored pages being created with restored old revisions. Instead of changing the join logic, couldn't we just join based on the page_first_revision timestamp?

couldn't we just join based on the page_first_revision timestamp?

We'd need to differentiate which timestamp is used for the join from event-type: user first-revision-timestamp for create events, otherwise user event-timestamp. We would also need to triple check that pages have a single and unique create event. This solution is effective and easy to implement, I like it :)

Milimetric triaged this task as Medium priority.May 10 2021, 4:18 PM