Page MenuHomePhabricator

Discussion of Event Driven Systems
Open, MediumPublic


Back in January 2021, the Architecture and Data Engineering teams hosted a 3 part book club on Ben Stopford's Designing Event Driven Systems (notes).
In September 2021, the Architecture and Data Engineering teams hosted a watch party and discussion of Martin Kleppmann's Turning the Database Inside Out talk (notes).
In spring of 2021, we had a meeting or two in a 'data infrastructure working group' where we talked about some of these ideas (notes)

If weren't able to make these past events, or you are new to some of these ideas, here's a list of background reading you could check out.

These events started some interesting discussions. We'd like to continue those discussions more asynchronously so we can have time to deep dive into the ideas. This ticket will be used as a discussion forum for interested parties.
I'll copy some of the higher level discussion ideas from the notes from those sessions, and we can expand on them in the comments as we see fit.

  • Event Sourcing / Events as source of truth
  • Turning the database inside out
  • How do deal with eventual consistency?
  • Event Driven Systems to solve runtime decoupling and data integration problems
  • Event Driven Systems to solve organizational problems
  • "Data on the outside" as first class citizen
  • CQRS and single writer principal
  • What event driven systems do we have at WMF now?
  • What would parts of MediaWiki look like we if were to redesign them as event driven?
  • ...

NOTE: This ticket is not meant to reach any conclusions. It is meant for educational and discussion purposes.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata triaged this task as Medium priority.Sep 1 2021, 9:11 PM

Today, @Andrew and @Cparle said they weren't sure what was meant by 'turning the database inside out', and also weren't sure how caches were related. Here are some quotes from the Designing Event Driven Systems book:

Turning the database inside out: a database has a number of core components—a commit log, a query engine, indexes, and caching—and rather than conflating these concerns inside a single black-box technology like a database does, we can split them into separate parts using stream processing tools and these parts can exist in different places, joined together by the log. So Kafka plays the role of the commit log, a stream processor like Kafka Streams is used to create indexes or views, and these views behave like a form of continuously updated cache, living inside or close to your application.

rather than querying data in a database, then layering caching over the top, we explicitly push data to where it is needed and process it there
“The database inside out” is an analogy for stream processing where the same components we find in a database—a commit log, views, indexes, caches—are not confined to a single place, but instead can be made available wherever they are needed.

stream processing flips the traditional approach to data and code on its head, encouraging us to bring data into the application layer—to create tables, views, and indexes exactly where we need them.

if you squint a bit, you can see the whole of your organization’s systems and data flows as a single distributed database. You can view all the individual query- oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. You can view the stream processing systems [...] as just a very well-developed trigger and view materialization mechanism.

The “database inside out” idea is important because a replayable log, combined with a set of views, is a far more malleable entity than a single shared database. The different views can be tuned to different people’s needs, but are all derived from the same source of truth: the log.

Here's some of my (somewhat naive) perspective on link tables. I know there are lots of other problems, but this is stuff I've been thinking about over the years:

By *links tables I mean imagelinks, pagelinks, templatelinks, categorylinks, externallinks, langlinks, and iwlinks as shown in the “Link table” section of the db layout. Taken in aggregate, these tables define the dependency graph between different entities in the wiki universe. Files, articles, categories, templates, etc. Storing this conceptual dependency graph in mysql tables in the current format has a variety of shortcomings:

  • Lack of context in time introduces inaccuracies. For example, from the pagelinks docs: “Note that the target page may or may not exist, and due to renames and deletions may refer to different page records as time goes by”.
  • The relationships are often across wikis, but that’s not reflected in the data. For example, the imagelinks table often points to content on Wikimedia Commons. I don’t understand all the details but the images from commons are shadowed on the wikis they’re used on, such as: which is used on but is an image from commons: This tells me there’s application logic that gives additional meaning to the raw data.
  • Recursive updates can cascade and cause problems. For example, if a template changes, and it’s used by other templates, which are used by lots of other templates, and so on, re-rendering correctly becomes complicated. This is compounded with the problems above, especially inaccuracies developed over time.
  • Performance. Categorylinks used to be queried by community tools to enumerate pages with a certain category, or with that category X levels above in the hierarchy. A number of these tools were broken due to performance issues with the queries they used. This means community workflows had to change around limitations in our technology. Backlogs used to track different types of work are a prime example, lots of detail in this abandoned project: T155497.

I made this doodle of an "event driven mediawiki" architecture a while ago. I had forgotten about this, but listening the "db inside out talk" made me remember. I'd be curious to hear in how far this matches other people's ideas:

I made this doodle of an "event driven mediawiki" architecture a while ago. I had forgotten about this, but listening the "db inside out talk" made me remember. I'd be curious to hear in how far this matches other people's ideas

It seems a bit high level, but I think it's the main idea. The magic with these kinds of systems is how exactly you model each kind of event and how you adapt the rest of the system as you get better at that. Obviously a lot goes into your "Dependency Tracking" box, all of what I wrote above and much more. And it would probably feed from other events too, not just render, but again I think that's the basic idea. In the "Extraction" use case, I think that would be a way we process more than just the Render stream of events as well, pulling out stuff that enriches the query service from anywhere relevant. So there's lots to model there.

I'm trying to find something concrete and somewhat isolated that we can kind of all agree needs a different approach, like the link tables thing above but it doesn't have to be that. Just anything that we can start with. Because I think otherwise we risk navel gazing forever. (Sorry, I don't mean to be dismissive of anything, just excited to get going on a collaboration here, of which I think you'd be an awesome part, @daniel) is quite excellent. It focuses on more than just the 'event driven' part, but data products using events as source of truth is a crucial part of the ideas there.