Page MenuHomePhabricator

Add log entry details to page and user events in EventBus
Closed, DeclinedPublic

Description

EDIT: after a longer discussion below, we decided to add a new stream of events recording changes to the logging table. These would be very useful to us and to the collection of events in general. We will also add log id to the other events mentioned below. The strategy is basically to do both of these things and use whatever makes it out to production first.

Who: I'll be implementing this as part of our work on incremental updates.
What: We need a solid link between EventBus events and what we get out of the mysql databases. For revisions, this is rev_id, but for page and user events, we think log_id is great, maybe log_timestamp too if it's free.
When: I'm going to get started soon, hopefully we can wrap up reviews over the next couple of months.
Asks: I'd love some help with reviews and pointers if I get lost. Here are the events we need log entries for:

eventnote
page deletedAvailable! and according to this has id and timestamp?
page restoredno logEntry available yet
page movedno logEntry available yet
page createwe have a rev_id for this so it's ok
user {create, rename}events not yet available, see T262205 and add log info to the schema
user alterblocksno logEntry available yet

Event Timeline

Could you expand a bit in what MySQL data are you trying to correlate the events to? It feels conceptually wrong to add ‘log_id’ to an event cause the log entry is not really a primary part of the action the even represents, it’s more of a derivative from the primary event. The page is moved regardless of the log about it being generated (conceptually, not in practice)

Btw, We can probably add a log event, similar to a recentchange event.

Could you expand a bit in what MySQL data are you trying to correlate the events to? It feels conceptually wrong to add ‘log_id’ to an event cause the log entry is not really a primary part of the action the even represents, it’s more of a derivative from the primary event. The page is moved regardless of the log about it being generated (conceptually, not in practice)

So, essentially, records in the logging table. The way mediawiki history reconstruction works is pretty simple:

  • take the current "state" out of the page, revision, and user tables
  • turn records from the logging table into "events"
  • walk back through these "events", matching them to the state to get each "step" back in history

For example, if the page table has a row with page_title == A, and we find a logging record moving B to A, at timestamp T, then we set page_title = B for the time range [START, T).

So as we reconstruct this way, our ultimate record of truth is the logging table, and specifically the log_timestamp. We record these in each state they are responsible for generating. So that's how the logging table data becomes the backbone of the history dataset. And now we're trying to update a snapshot from the EventBus events. It seems that in most cases these events happen after the change is saved to mediawiki, and therefore the logging information would be available.

I agree with you that it's conceptually weird to add this to the event, it should be an implementation detail of a consumer of the event, but alas the reality is a bit different. I do think your next idea (below) could totally work, though! Have to think about it some more, but yeah, if we could get all the log events we care about, totally, I think that's all we'd need!

Btw, We can probably add a log event, similar to a recentchange event.

My 2 cents on this.
For automated usage of data, we want to be able to join representations of the same actions in a (hopefully) non-fuzzy way. For revisions, the database + revision_id should be unique, good. For other events such as page-create, page-rename etc, the action representation in mediawiki database is a row in the logging table. While I agree that in an event-centered world, possibly we'd prefer put the event-id in the logging-table, pushing toward events as the source of truth, both implementation-side work for our purpose and I don't really mind one of the other.

just to be clear, I'm not in a strong opposition to adding the log_id either, just wanted to better understand why.

So as we reconstruct this way, our ultimate record of truth is the logging table, and specifically the log_timestamp. We record these in each state they are responsible for generating. So that's how the logging table data becomes the backbone of the history dataset. And now we're trying to update a snapshot from the EventBus events. It seems that in most cases these events happen after the change is saved to mediawiki, and therefore the logging information would be available.

Why not just use event timestamp instead of log timestamp in the ongoing update? they should be within milliseconds from each other. Cause you can't switch from streaming reconstruction to old way seamlessly since different timestamps are used?

Why not just use event timestamp instead of log timestamp in the ongoing update? they should be within milliseconds from each other. Cause you can't switch from streaming reconstruction to old way seamlessly since different timestamps are used?

The EventBus event timestamps are up to 30 seconds away from the logging table events in small samples we looked at, I'd bet there are more extreme outliers. So joining with that much fuzziness seems scary even if we find no mismatches in the samples we have. If we have to do it that way, we would, but the connection to the logging table seems much cleaner.

Ok, I see. Do you have the capability to do stream joins in the system that builds the history (e.g. like in Flink). Then it could be really cool - we could make the mediawiki.log event and introduce a log_id into all relevant events so that you (and all the other stream consumers) could join the streams on log_id.

That would make the event system much more understandable and complete since the log_id wouldn't be just some dangling integer pointing to who knows what, but would be clear that it's pointing to the log events stream.

While it is possible to join streams with Flink, we'll start by using log-events in hadoop in batch mode. I agree it makes more sense to add a log_id in events if there is a log stream :) That represent some more work (adding again a new event-type), but I like it :)

mforns triaged this task as High priority.Sep 17 2020, 3:45 PM
mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

I like it as well if we have the log stream. We can separate the log stream into a followup, fine by me as well.

I'm game to do it all together. The new log stream might be easier considering changing a hook's signature is involved since T240307. Having the log ids anywhere gets us to minimum viable, so it seems smart to try both and use whichever one's done first. I'll update the task description and get to work :)

Change 651633 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[mediawiki/core@master] logging: add public getter for log entry id

https://gerrit.wikimedia.org/r/651633

Change 651634 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[mediawiki/extensions/EventBus@master] [WIP] Add handler and publish event for new log entries

https://gerrit.wikimedia.org/r/651634

Change 651635 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[schemas/event/primary@master] [WIP] Add log-entry create schema

https://gerrit.wikimedia.org/r/651635

odimitrijevic added a subscriber: odimitrijevic.

Removing from Kanban since there hasn't been any activity since Jan 21.

We are discussing ideas like this, but not exactly this, closing and linking from a strategy document I'm putting together. The work's here if we decide to go back to it.