Add log entry details to page and user events in EventBus
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Milimetric
	Sep 16 2020, 7:39 PM

Description

EDIT: after a longer discussion below, we decided to add a new stream of events recording changes to the logging table. These would be very useful to us and to the collection of events in general. We will also add log id to the other events mentioned below. The strategy is basically to do both of these things and use whatever makes it out to production first.

Who: I'll be implementing this as part of our work on incremental updates.
What: We need a solid link between EventBus events and what we get out of the mysql databases. For revisions, this is rev_id, but for page and user events, we think log_id is great, maybe log_timestamp too if it's free.
When: I'm going to get started soon, hopefully we can wrap up reviews over the next couple of months.
Asks: I'd love some help with reviews and pointers if I get lost. Here are the events we need log entries for:

event	note
page deleted	Available! and according to this has id and timestamp?
page restored	no logEntry available yet
page moved	no logEntry available yet
page create	we have a rev_id for this so it's ok
user {create, rename}	events not yet available, see T262205 and add log info to the schema
user alterblocks	no logEntry available yet

Details

Subject	Repo	Branch	Lines +/-
logging: add public getter for log entry id	mediawiki/core	master	+18 -2
[WIP] Add log-entry create schema	schemas/event/primary	master	+472 -6
[WIP] Add handler and publish event for new log entries	mediawiki/extensions/EventBus	master	+70 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T258511 Data Lake incremental Data Updates
Open	None	T231938 Get "edits hourly" on a daily basis
Resolved	Milimetric	T258532 [SPIKE] Prototype of incremental updates for mediawiki history for simplewiki , including reverts using apache hudi
Declined	Milimetric	T263055 Add log entry details to page and user events in EventBus

Event Timeline

Milimetric created this task.Sep 16 2020, 7:39 PM

Could you expand a bit in what MySQL data are you trying to correlate the events to? It feels conceptually wrong to add ‘log_id’ to an event cause the log entry is not really a primary part of the action the even represents, it’s more of a derivative from the primary event. The page is moved regardless of the log about it being generated (conceptually, not in practice)

Btw, We can probably add a log event, similar to a recentchange event.

In T263055#6468006, @Pchelolo wrote:

Could you expand a bit in what MySQL data are you trying to correlate the events to? It feels conceptually wrong to add ‘log_id’ to an event cause the log entry is not really a primary part of the action the even represents, it’s more of a derivative from the primary event. The page is moved regardless of the log about it being generated (conceptually, not in practice)

So, essentially, records in the logging table. The way mediawiki history reconstruction works is pretty simple:

take the current "state" out of the page, revision, and user tables
turn records from the logging table into "events"
walk back through these "events", matching them to the state to get each "step" back in history

For example, if the page table has a row with page_title == A, and we find a logging record moving B to A, at timestamp T, then we set page_title = B for the time range [START, T).

So as we reconstruct this way, our ultimate record of truth is the logging table, and specifically the log_timestamp. We record these in each state they are responsible for generating. So that's how the logging table data becomes the backbone of the history dataset. And now we're trying to update a snapshot from the EventBus events. It seems that in most cases these events happen after the change is saved to mediawiki, and therefore the logging information would be available.

I agree with you that it's conceptually weird to add this to the event, it should be an implementation detail of a consumer of the event, but alas the reality is a bit different. I do think your next idea (below) could totally work, though! Have to think about it some more, but yeah, if we could get all the log events we care about, totally, I think that's all we'd need!

In T263055#6468011, @Pchelolo wrote:

Btw, We can probably add a log event, similar to a recentchange event.

My 2 cents on this.
For automated usage of data, we want to be able to join representations of the same actions in a (hopefully) non-fuzzy way. For revisions, the database + revision_id should be unique, good. For other events such as page-create, page-rename etc, the action representation in mediawiki database is a row in the logging table. While I agree that in an event-centered world, possibly we'd prefer put the event-id in the logging-table, pushing toward events as the source of truth, both implementation-side work for our purpose and I don't really mind one of the other.

just to be clear, I'm not in a strong opposition to adding the log_id either, just wanted to better understand why.

So as we reconstruct this way, our ultimate record of truth is the logging table, and specifically the log_timestamp. We record these in each state they are responsible for generating. So that's how the logging table data becomes the backbone of the history dataset. And now we're trying to update a snapshot from the EventBus events. It seems that in most cases these events happen after the change is saved to mediawiki, and therefore the logging information would be available.

Why not just use event timestamp instead of log timestamp in the ongoing update? they should be within milliseconds from each other. Cause you can't switch from streaming reconstruction to old way seamlessly since different timestamps are used?

Why not just use event timestamp instead of log timestamp in the ongoing update? they should be within milliseconds from each other. Cause you can't switch from streaming reconstruction to old way seamlessly since different timestamps are used?

The EventBus event timestamps are up to 30 seconds away from the logging table events in small samples we looked at, I'd bet there are more extreme outliers. So joining with that much fuzziness seems scary even if we find no mismatches in the samples we have. If we have to do it that way, we would, but the connection to the logging table seems much cleaner.

Ok, I see. Do you have the capability to do stream joins in the system that builds the history (e.g. like in Flink). Then it could be really cool - we could make the mediawiki.log event and introduce a log_id into all relevant events so that you (and all the other stream consumers) could join the streams on log_id.

That would make the event system much more understandable and complete since the log_id wouldn't be just some dangling integer pointing to who knows what, but would be clear that it's pointing to the log events stream.

While it is possible to join streams with Flink, we'll start by using log-events in hadoop in batch mode. I agree it makes more sense to add a log_id in events if there is a log stream :) That represent some more work (adding again a new event-type), but I like it :)

mforns triaged this task as High priority.Sep 17 2020, 3:45 PM

mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

I like it as well if we have the log stream. We can separate the log stream into a followup, fine by me as well.

I'm game to do it all together. The new log stream might be easier considering changing a hook's signature is involved since T240307. Having the log ids anywhere gets us to minimum viable, so it seems smart to try both and use whichever one's done first. I'll update the task description and get to work :)

Milimetric updated the task description. (Show Details)Sep 17 2020, 8:46 PM

daniel moved this task from Inbox to Tracking/Watching on the Platform Engineering board.Sep 18 2020, 3:41 PM

Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 23 2020, 6:56 PM

JAllemandou mentioned this in T262261: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month .Oct 27 2020, 2:25 PM

Milimetric moved this task from In Progress to In Code Review on the Analytics-Kanban board.Dec 22 2020, 9:38 PM

Change 651633 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[mediawiki/core@master] logging: add public getter for log entry id

https://gerrit.wikimedia.org/r/651633

gerritbot added a project: Patch-For-Review.Dec 22 2020, 9:57 PM

Change 651634 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[mediawiki/extensions/EventBus@master] [WIP] Add handler and publish event for new log entries

https://gerrit.wikimedia.org/r/651634

Change 651635 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[schemas/event/primary@master] [WIP] Add log-entry create schema

https://gerrit.wikimedia.org/r/651635

• fdans moved this task from In Code Review to Paused on the Analytics-Kanban board.Jan 21 2021, 5:55 PM

Removing from Kanban since there hasn't been any activity since Jan 21.

We are discussing ideas like this, but not exactly this, closing and linking from a strategy document I'm putting together. The work's here if we decide to go back to it.

@Milimetric should we abandon the attached gerrit changes?

Change 651634 abandoned by Milimetric:

[mediawiki/extensions/EventBus@master] [WIP] Add handler and publish event for new log entries

Reason:

see linked phab task

https://gerrit.wikimedia.org/r/651634

Change 651635 abandoned by Milimetric:

[schemas/event/primary@master] [WIP] Add log-entry create schema

Reason:

see linked phab task

https://gerrit.wikimedia.org/r/651635

Change 651633 abandoned by Milimetric:

[mediawiki/core@master] logging: add public getter for log entry id

Reason:

see linked phab task

https://gerrit.wikimedia.org/r/651633

Maintenance_bot removed a project: Patch-For-Review.Oct 25 2022, 1:31 PM

Add log entry details to page and user events in EventBusClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add log entry details to page and user events in EventBus
Closed, DeclinedPublic
Actions

Related Objects
Search...