Page MenuHomePhabricator

Eventually Consistent MediaWiki State Change Events
Open, MediumPublic

Description

T116786 introduced MediaWiki Event-Platform production via an extension utilizing hooks. While adequate for the EventBus MVP, this is only an interim solution. Ultimately, we need a mechanism that guarantees event delivery (eventual consistency is OK).

The Event Platform program extended the work started in T116786 to provide a standardized event producing APIs unified for both production and analytics purposes.

However, in order to build truly reliable new production services with events based on MediaWiki data, we need a single source of truth for MediaWiki data. That source of truth is the MediaWiki MariaDB database. This is only consistently accessible by MediaWiki itself. There is currently no way to consistently expose (real time) MediaWiki state changes to non MediaWiki applications.

We do have events produced by MediaWiki, but these events are decoupled from the MariaDB writes, and there is no guarantee that e.g. every revision table save results in a mediawiki.revision-create event. This means that as of today, MediaWiki events cannot be relied on as a 'source of truth' for MediaWiki data. They are not much more than a best (really good!) effort notification.

Background reading: Turning the database inside out

Why do we need this?

I asked a few stake holders to explain why this is important to them, and they gave me permission to quote them here. These are a few examples of why consistent events are important.

WikiData Query Service Updater - T244590

@Zbyszko:

... missed events are probably the biggest issue in the system. We have visibility into late and out of order events (and probably mostly buggy events, but there's no way of knowing for sure). Not only that, there are sensible ways of dealing with them, both in general and in our specific situation.

Missed events are, by their nature, invisible to us via standard means and hard to observe in general. Since we also don't really understand the situation when those are dropped, it's hard to assess the impact on WDQS updater. We decided we're ok with it for now, because it's simply still better than the previous solution.

To reiterate - we can deal with lateness and out-of-orderliness - dealing with missed events is order of magnitude a harder challenge.

Image Recommendations project - T254768

@gmodena:

Throughout the month, the state of an article can change. We'll need to track a "revisions events topic" to establish a feedback loop with the
model re the following state changes (among others):

  1. Previously unillustrated articles that are now illustrated
  2. Articles illustrated algorithmically, that have been reverted
  3. Orthogonal (technically not a MW state change): track which recommendations have been rejected by a client.

Being late in capturing state changes, would result in a degraded UX that will fix itself with time.
Missing events would be an order of magnitude harder problem to solve.

HTML wiki content dumps and other public datasets - T182351

@fkaelin:

Another category of tools that depend on the correctness of the events are derived datasets that the foundation could publish. This includes the equivalent of the wikidumps on which the analytics wiki history datasets are based, which could be replaced with a snapshot-less and continuous log of revisions. Another example is the html dumps discussed in T182351: Make HTML dumps available, which the OKAPI team can also relate to, and any number of other datasets that one can think of.

Wikimedia Enterprise AKA Okapi

@Protsack.stephan

if you don't have consistent events, how else would you get the data you need for your use case? - We heavily rely on events to maintain our dataset. Basically we do CDC from event streams to maintain our dataset. Not having consistent events means that our dataset gets out of sync and we need to engineer something on top of events to make sure that it is consistent. Just FYI we are just acknowledging that events may be not consistent and putting that problem into a box for now, but that's probably going to be our next bridge to cross.

Potential solutions

Event Sourcing is an approach that event driven architectures use to ensure they have a single consistent source of truth that can be used to build many downstream applications. If we were building an application from scratch, this might be a great way to start. However, MediaWiki + MariaDB already exist as our source of truth, and migrating it to an Event Sourced architecture all at once is intractable.

In lieu of completely re-architecting MediaWiki's data source, there are a few possible approaches to solving this problem in a more incremental way.


Change Data Capture (CDC)

CDC uses the MariaDB replication binlog to produce state change events. This is the same source of data used to keep the read MariaDB replicas up to date.

Description
A binlog reader such as debezium would produce database change events to Kafka. This reader may be able to transform the database change events into a more useful data model (e.g. mediawiki/revision/create), or transformation maybe done later by a Stream Processing framework such as Flink or Kafka Streams.

Pros

  • No MediaWiki code changes needed
  • Events are guaranteed to be produced for every database state change
  • May be possible to guarantee each event is produced exactly once
  • Would allow us to incrementally Event Source MediaWiki (if we wanted to)

Cons

  • Events are emitted (by default?) in a low level database change model, instead of a higher level domain model, and need to be joined together and transformed by something, most likely a stateful stream processing application.
  • WMF's MariaDB replication configuration may not support this (we may need GTIDs).
  • Data Persistence is not excited about maintaining more 'unicorn' replication setups.

Transactional Outbox

This makes use of database transactions and a separate poller process to produce events.

See also: https://microservices.io/patterns/data/transactional-outbox.html

Description
Here's how this might work:

  • MediaWiki wraps MariaDB writes for a web request in one transaction.
  • When an event is to be emitted, it is serialized and inserted into an event_outbox table.
  • Once the web request is finished, MW EventBus attempts to produce the event in a deferred update as it does currently.
  • If successful, the previously inserted row in the event_outbox table is deleted.
  • If failed, the previously inserted row can be updated with a a failed_at timestamp and an error message.

A separate maintenance process polls the event_outbox table for rows, produces the events to Kafka, and deletes the row when the produce request succeeds.

NOTE: This example is just one of various ways a Transactional Outbox might be implemented. The core idea is the use of MariaDB transactions and a separate poller to ensure that all events are produced.

Pros

  • Events can be emitted modeled as we choose
  • Since MW generally wraps all DB writes in a transaction, no MW core change needed. This could be done in an extension.

Cons

  • At least once guarantee for events, but this should be fine. There may be ways to easily detect a the duplicate event.
  • Separate polling process to run and manage.

Hybrid: Change Data Capture via Transactional Outbox

This is a hybrid of the above two approaches. The main difference is instead of using CDC to emit change events on all MariaDB tables, we only emit change events for event outbox tables.

This idea is from Debezium: https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/

Description
MediaWiki would be configured to write all changes in a transaction with the outbox tables. When a revision is to be inserted into the revision table, a MariaDB transaction is started. A record is inserted into the revision table as well as the event_outbox table. The event_outbox has a field including a JSON string representing the payload of the change event. The transaction is then committed.

A binlog reader such as Debezium would then filter for changes to the event_outbox table (likely extracting only the JSON event payload) and emit only those to Kafka.

Pros

  • Events can be emitted modeled as we choose
  • Events are guaranteed to be produced for every database state change
  • May be possible to guarantee each event is produced exactly once
  • No need to transform from low level database changes to high level domain models.
  • Since MW generally wraps all DB writes in a transaction, no MW core change needed. This could be done in an extension.
  • Would allow us to incrementally Event Source MediaWiki (if we wanted to)

Cons

  • WMF's MariaDB replication configuration may not support this (we may need GTIDs).
  • Data Persistence is not excited about maintaining more 'unicorn' replication setups.

Reconciliation Capability

Devise a way to be able to link an event to the database transaction/revid/whatever that generated it, and have a system that allows reconciliation. For instance: one revision is missing from the event log but is present in the database? then generate the event that we're missing.

e.g. T358373: [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions


2 Phase Commit with Kafka Transactions

This may or may not be possible and requires more research if we want to consider it. Implementing it would likely be difficult and error prone, and could have an adverse affect on MediaWiki performance. If we do need Kafka Transactions, this might be impossible anyway, unless a good PHP Kafka Client is written.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Interesting thanks! So brainstorming how that would work for Debezium, since Debezium is just a slave process consuming a binlog, would it be possible to just stop it, change configs so it points at a new master, and start it? As long as the same binlog position exists on the old and new master, would that work?

Interesting thanks! So brainstorming how that would work for Debezium, since Debezium is just a slave process consuming a binlog, would it be possible to just stop it, change configs so it points at a new master, and start it? As long as the same binlog position exists on the old and new master, would that work?

I am not sure how this would work with our scripts to automate all this movement (and with orchestrator, which is most likely the tool we'll use in the future to handle replicas movement).

Right, I understand that the extra maintenance this would cause could be too onerous to for Debezium be a good solution to this problem, I'm mostly just trying to understand. Assuming the master swap was done manually, would the procedure I suggested work technically?

Another idea that may not be feasible: Would it be possible to move the event produce call out of the deferred update to before MediaWiki closes the MariaDB transaction? I.e.

  1. open MariaDB transaction
  2. insert into revision, etc.
  3. produce event
  4. close MariaDB transaction

In this way, we might produce spurious events which are not actually persisted in the database, but that should be easier to reconcile than missing events. E.g. Most consumers would probably have to ask MW API later for the revision content anyway, and if MW doesn't have the event's rev_id, MW API request will let the consumer know.

I'm not sure what happens with the rev_ids in our MariaDB transactions though. Are they created and available to MediaWiki before the transaction is closed?

In addition to what @aaron said, let me give you a reliability prespective:

This is a really bad idea, it would make the database availability depend on the availability of eventgate, and make the two systems tied to each other, and/or to leave db transactions open for a long time anyways.

It would make sense to e.g. add to eventgate a local queue on disk for events that have failed to submit? That should almost-remove errors to tolerable levels.

it would make the database availability depend on the availability of eventgate, and make the two systems tied to each other, and/or to leave db transactions open for a long time anyways.

Yeah makes sense.

eventgate a local queue on disk for events that have failed to submit?

Could help, but it seems simpler and more complete to write to an outbox table in a transaction, no?

Drive by comments by yours truly:

  • Do we have estimations (or even better hard data) as to the number of missed events?
  • The few event driven blogs and articles I 've read, regardless of how the implementation is done, push for the idea of a commit log that contains the entire history of all events (allowing e.g. for very easy transformations, aggregations as well as other operations) . I gather from the task that this is NOT what we want to do here, right? We just want to increase the reliability of having some very specific events produced and delivered, is that right ?

Do we have estimations (or even better hard data) as to the number of missed events?

See T215001: Revisions missing from mediawiki_revision_create. A patch by Clara and Petr is going out with the wmf.5 train this week that may mitigate the majority of missing events.

the idea of a commit log that contains the entire history of all events [...] I gather from the task that this is NOT what we want to do here, right?

We do want this, but it is not certain (or likely?) that we will keep this entire history sourceable in Kafka. We keep this history in analytics Hadoop now, but one day imagine having a Shared Data Platform from which historical (and other) data can be bootstrapped/sourced from some 'cold storage', and then if desired, continuable from Kafka.

Anyway, this task is not about keeping the entire history in Kafka, but it is about making the events we emit as consistent as possible. Depending on the use case, downstream apps/datastores that have copies of the data may need some consistency reconciliation (i.e. lambda arch) to ensure the data is fully consistent over time, but the stronger we get the event streams to be consistent the better.

I think we'll some day (in a quarter or two?) file a Technical Decision Statement Overview that more broadly describes the problem as you state: making sure MW event data is consistent (as defined by some SLOs?) and (mostly) complete, meaning all relevant MW state change data is captured as events (we're missing things currently that could be very useful, like revision content etc.)

Do we have estimations (or even better hard data) as to the number of missed events?

See T215001: Revisions missing from mediawiki_revision_create. A patch by Clara and Petr is going out with the wmf.5 train this week that may mitigate the majority of missing events.

Ah, so ~1.5% of create revisions are missing. Assuming this can be generalized for all events, that sounds quite a bit to be honest, we should indeed find ways to increase the reliability. Let's see if the patch above and similar changes in the past have had a success at that.

the idea of a commit log that contains the entire history of all events [...] I gather from the task that this is NOT what we want to do here, right?

We do want this.

Let me just say that it sounds very unplausible that it can happen. @aaron and @Ladsgroup, as well as @Joe have pointed out in this task why that is, I will add that having 2 different distributed systems perfectly synced (to the point where they both contain the entire history of all events) is neigh to impossible.

but it is not certain (or likely?) that we will keep this entire history sourceable in Kafka.
We keep this history in analytics Hadoop now, but one day imagine having a Shared Data Platform from which historical (and other) data can be bootstrapped/sourced from some 'cold storage', and then if desired, continuable from Kafka.

Sure, that sounds fine. As long as the expectation is some thing might be missing from there, it sounds pretty reasonable to me.

Anyway, this task is not about keeping the entire history in Kafka, but it is about making the events we emit as consistent as possible. Depending on the use case, downstream apps/datastores that have copies of the data may need some consistency reconciliation (i.e. lambda arch) to ensure the data is fully consistent over time, but the stronger we get the event streams to be consistent the better.

+1 to both. Increase the reliability of event producing is definitely the way to go here.

I think we'll some day (in a quarter or two?) file a Technical Decision Statement Overview that more broadly describes the problem as you state: making sure MW event data is consistent (as defined by some SLOs?) and (mostly) complete, meaning all relevant MW state change data is captured as events (we're missing things currently that could be very useful, like revision content etc.)

+1 again and +1 to the SLOs.

the idea of a commit log that contains the entire history of all events [...] I gather from the task that this is NOT what we want to do here, right?

We do want this.

Let me just say that it sounds very unplausible that it can happen. @aaron and @Ladsgroup, as well as @Joe have pointed out in this task why that is, I will add that having 2 different distributed systems perfectly synced (to the point where they both contain the entire history of all events) is neigh to impossible.

Makes sense. I think what I would like to go for is events that are as (or almost as) consistent as MariaDB replication for a single MW database. Events that need cross-DB transactions are likely not worth the effort to improve their consistency.

New developments in this are of interest: watermark change data capture framework from netflix that aims to do what this task is about, streaming data from source A to source B taking into account an initial snapshot: https://arxiv.org/pdf/2010.12597v1.pdf

Debezium 1.7 incorporates some of the ideas of netflix paper

Huh, very interesting paper! @Nuria did you read? I mostly understand, but had some questions as I worked through the algorithm. In Figure 4., k2 and k4 are part of the final chunk written to the output buffer, but they are also earlier in the binlog (and output buffer). Are those then duplicates in the output? Should be fine to have duplicates, as the state should be idempotent, but am just curious since they don't mention it.

In https://phabricator.wikimedia.org/T215001#7523796 @Milimetric did some analysis on missing revision create events and determined that in the month of October mediawiki.revision-create was:

99.9986% reliable, or 0.0014% missing events.

There were only 67 missing events. Ultimately we shouldn't miss these events (especially if MariaDB replicas don't), so we still need to find a solution, but I think the priority of this is probably less than before Petr fixed the envoy timeouts. Great stuff!

This is this kind of thing we need to have a way to reconcile: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage

In the case of outages like this, state changes written to MediaWiki MariaDB still need to eventually make their way into the event streams if we are to use the events to carry state to other datastores and services.

Ottomata renamed this task from Consistent MediaWiki state change events | MediaWiki events as source of truth to Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth.Apr 17 2024, 3:21 PM

There is a lil discussion about this topic in T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable". Moving that discussion to here.

@Ladsgroup wrote:

those data could be regenerated from canonical data on the wiki

@Ottomata wrote:

This is very expensive, and requires complicated logic and maintenance to do.

@Ladsgroup wrote:

if you have a way to find mismatch and redo for a small portion of changes (e.g. even 1% of changes), it should be fine.

Agree, but how?

How should one find out they are missing records, and which records to fetch? If a Wikidata page is deleted, but you miss the page delete event, how does one eventually find out that the entity should be removed from WDQS? If you miss a suppression event, how do you ensure that you are not exposing PII?

Commenting here as well at the request of @Ottomata in T249745#9725953

In what apparently is 8.5 years now, this task has gathered comments for a multitude of people. Some of them have gone into lengths to investigate and propose solutions to the problem. Much of the conversation has gone into data stores and ways of getting various already existing in the wider industry solutions to work.

The justification section of this task, highlights some use cases that would allegedly benefit by adopting one (or more) of the solutions evaluated in the task.

However:

  • None of those use cases set qualitative requirements or defined what consistent events means for them.
  • None of those use cases set quantitative requirements regarding what level of consistent events (the flip side of the of missed events) would be acceptable.
  • 3 of the listed use cases have their linked tasks resolved. The 4th never had a task. While I didn't dig into each specific ticket, it's possible some satisfied their use case already without needing any of the solutions investigated in this task.
  • In T215001#7523796 and it was pointed out that missing events are now at a very low level.

So, I guess the question in 2024 is "Do we know of any (new/old) use cases that require a level of consistent events that is above 99.9999%? Which are those and what budget have they been allocated?"

I think there are two issues to be discussed here. Defining qualitative requirements and how to repair inconsistencies.
Regarding qualitative requirements, for search and WDQS we don't have a good sense of what would be good enough. the only visible criteria we have at the moment is when users complain about stale data but without concrete measurement of the instability it is hard to define a number I guess. Could we do the other way around by starting to measure how consistent the streams are compared to the source of truth? Could this be done for some important streams like revision-create/page-delete/page-undelete/page-state by applying similar techniques than the one used in T215001#7523796? It is probable that missed events are rare in normal conditions but I still see huge spikes in the logs with many events failing to reach event-gate (T362977), could there be ways to improve the situation at a reasonable cost?

Regarding how to repair inconsistencies:
Currently we apply different techniques depending on the service.
For search we use an asynchronous repair that slowly re-scans all pages (this is not ideal but without it we had users complaining for stale data). We also use mariadb to check whether the revision still exists before showing it to the user but this is only possible because CirrusSearch has access to mariadb.
For wdqs we can't repair missed events, scanning everything even slowly is not an option with blazegraph so the only option we have is to re-import from the dumps.

If there could be a process that detects such missed events it would I think greatly improve the situation for wdqs at least.

There is a lil discussion about this topic in T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable". Moving that discussion to here.

@Ladsgroup wrote:

those data could be regenerated from canonical data on the wiki

@Ottomata wrote:

This is very expensive, and requires complicated logic and maintenance to do.

@Ladsgroup wrote:

if you have a way to find mismatch and redo for a small portion of changes (e.g. even 1% of changes), it should be fine.

Agree, but how?

How should one find out they are missing records, and which records to fetch?

It highly depends on the usecase and the tolerance for failure. As it was mentioned, global rename has a table to keep track of those. Many places use page_touched field in page table to trigger a reparse in case a user visits the page to make sure stale data won't be shown to the user even if the jobs fail to queue.

You mentioned wikidata, it already queues a removal of sitelink when an admin deletes articles in Wikipedia (or move them) and that has had worked well without complaints (except in one case that it wouldn't remove the sitelink it if the admin doesn't have an account in wikidata which caused user-facing complaints but it's not related to this issue.)

If a Wikidata page is deleted, but you miss the page delete event, how does one eventually find out that the entity should be removed from WDQS?

The delete logs are publicly available via APIs, you can simply store a "handled via the job" list of ids for a week in the updater service, and check every day against the list on wikidata.org and drop anyone that has been deleted from wikidata.org but not reflected in WDQS. But again, do you need this? Have people complained about the mismatch? Isn't 99.999%, honestly in the list of WDQS issues, this mismatch probably won't make it to the top 20 even.

If you miss a suppression event, how do you ensure that you are not exposing PII?

I'm a bit confused on this, we currently don't remove anything from dumps or data lake or other parts of analytics infra. I think removal of 99.999% of those cases is better than 0. (Also obligatory link to part two of T241178#9438384)

Even if you want to implement something to make sure that is fixed, it's not that hard, you can look up suppression log ids in the db and keep track of the ids the service has handled in the service's database and check it once a day and re-do the ones that are not handled.

In general, I'm not sure a general solution to a problem that varies vastly between different consumer services of mediawiki could make things easier. Every solution for each service needs to be implemented (or tolerated) with regards to the specifics of that service, what data they need and so on.

Ottomata renamed this task from Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth to Eventually Consistent MediaWiki State Change Events.Apr 30 2024, 11:47 AM

^ changed title to remove the controversial 'source of truth' terminology.

highlights some use cases

FWIW, this section was gathered as example use cases, not a comprehensive list. I gathered these quotes as a response to this comment https://phabricator.wikimedia.org/T120242#6919037

in 2024 is "Do we know of any (new/old) use cases that require a level of consistent events that is above 99.9999%?

There are many use cases, and also cases where products are simply not built, or built in less ideal ways, because of the difficulty in consistently replicating the required MediaWiki state outside of MW databases. When I get back from leave, I will collect and document more of these.

In the meantime, here is a list of (non quantified) use cases we made in 2023:
https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Processing/Use_cases

None of those use cases set quantitative requirements

@akosiaris I'd like to turn this question around though and ask: do the MediaWiki based products that use MariaDB replication always have to declare these requirements? Or is it just assumed that MariaDB replicas will be (more or less) consistent?

I ask this because it is possible to use events to carry state in a way that is just as eventually consistent as MariaDB. If we can do this, and it helps engineering and product teams accomplish their goals, why would we not want to do it? We should of course debate how we do this. Perhaps using events to carry the state is not the way should do it! Is a better way to consistently externalize MW state in real time? Let's find it!

starting to measure how consistent the streams are compared to the source of truth?

Perhaps T358373: [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions could be amended to provide a regular measure of this. That will work for revision create events at least. Not sure if we'd catch things like suppressions.

probable that missed events are rare in normal conditions but I still see huge spikes in the logs with many events failing

An example of this from 2021:
https://phabricator.wikimedia.org/T120242#7569570
https://wikitech.wikimedia.org/wiki/Incidents/2021-11-25_eventgate-main_outage

simply store a "handled via the job" list of ids for a week in the updater service, and check every day against the list on wikidata.org

look up suppression log ids in the db and keep track of the ids

These ideas are certainly possible to implement, but it adds a bug-prone complexity to the external service. This is what I mean when I say "This is very expensive, and requires complicated logic and maintenance to do."

This kind of complexity can prevent products from being built in the first place.

But again, do you need this? Have people complained about the mismatch? Isn't 99.999%, honestly in the list of WDQS issues, this mismatch probably won't make it to the top 20 even.

A few missed events probably won't be noticed in the short term. But over the long term, the data will drift from the MediaWiki source-of-truth, and something will need to be done about it. Especially when there are more impactful outages.

What is often done now, is a full re-bootstrap from some snapshot, either legacy Dumps XML (going away soon) or from MariaDB replica snapshots in Hadoop. Using MariaDB snapshots mean that non-MW apps depend directly on MW's own internal data model as a public API. As you know, this leads to problems when DBAs need to make schema changes. Ideally, external applications should never access MW MariaDBs directly. MariaDB is MW's own internal application state.

Re-bootstrapping like this is complex and error prone, and can be computationally expensive. Asking a product team to manage their own reconciliation state and/or implement periodic re-bootstrapping from data in Hadoop is a big ask.

I'm not sure a general solution to a problem that varies vastly between different consumer services of mediawiki could make things easier.

I think I agree here, which is why I'm skeptical of a general purpose reconciliation solution.

But, making the state change events eventually consistent is a general solution to this problem, no?

I'm a bit confused on this [content suppression event consistency], we currently don't remove anything from dumps or data lake or other parts of analytics infra. I think removal of 99.999% of those cases is better than 0.

Dumps 2.0 will be using the mediawiki.revision-visibility-change stream to redact. T351564: Implement enriched revision visibility stream

It isn't just Analytics infra we are talking about though. This is about the ability to build products outside of MW that don't expose PII.

WM Enterprise is an example. If visibility-change events are missed, WM Enterprise will continue serving hidden/PII content to its users.

Also obligatory link to part two of T241178#9438384

FWIW, we'd like to solve this using Kafka compacted topics. We could do this now without solving event consistency to get the 99.999% as you say though. But 100% is better than 99.999% ;)

I think there are two issues to be discussed here. Defining qualitative requirements and how to repair inconsistencies.
Regarding qualitative requirements, for search and WDQS we don't have a good sense of what would be good enough. the only visible criteria we have at the moment is when users complain about stale data but without concrete measurement of the instability it is hard to define a number I guess.

I think we have a proxy for that in Grafana. Choose the stream(s) you want and you 'll get the errors for those stream(s).

Could we do the other way around by starting to measure how consistent the streams are compared to the source of truth? Could this be done for some important streams like revision-create/page-delete/page-undelete/page-state by applying similar techniques than the one used in T215001#7523796?

If the approach above is generalizable, I don't see why not. The questions that would remain are who and when.

It is probable that missed events are rare in normal conditions but I still see huge spikes in the logs with many events failing to reach event-gate (T362977), could there be ways to improve the situation at a reasonable cost?

If there are huge spikes with failures, it's an Incident and it has a process for being handled. Part of the followup is to have Incident review ritual which should find ways to improve the situation at a reasonable cost. Note that incidents are inevitable. Their frequency might suggest otherwise, but one thing that is a given is that they will eventually happen.

Regarding how to repair inconsistencies:
Currently we apply different techniques depending on the service.

That matches my expectations. After all, every service has different needs.

For search we use an asynchronous repair that slowly re-scans all pages (this is not ideal but without it we had users complaining for stale data). We also use mariadb to check whether the revision still exists before showing it to the user but this is only possible because CirrusSearch has access to mariadb.
For wdqs we can't repair missed events, scanning everything even slowly is not an option with blazegraph so the only option we have is to re-import from the dumps.

If there could be a process that detects such missed events it would I think greatly improve the situation for wdqs at least.

Jobs can be retried if failed, maybe we could utilize that as a proxy?

highlights some use cases

FWIW, this section was gathered as example use cases, not a comprehensive list. I gathered these quotes as a response to this comment https://phabricator.wikimedia.org/T120242#6919037

in 2024 is "Do we know of any (new/old) use cases that require a level of consistent events that is above 99.9999%?

There are many use cases, and also cases where products are simply not built, or built in less ideal ways, because of the difficulty in consistently replicating the required MediaWiki state outside of MW databases. When I get back from leave, I will collect and document more of these.

Thanks!

In the meantime, here is a list of (non quantified) use cases we made in 2023:
https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Processing/Use_cases

This is a useful link, but arguably a superset of the things that would belong in the set of uses cases I outlined above. e.g. I doubt for recommendation use cases 99.9999%+ of events are needed. Similarly, if some events are lost for e.g. changeprop, RESTBase and CDN purging, it's ok. I haven't quantified it, but there are safeguards built into some of these services (manual purging, caches that have TTLs, retries, etc)

None of those use cases set quantitative requirements

@akosiaris I'd like to turn this question around though and ask: do the MediaWiki based products that use MariaDB replication always have to declare these requirements? Or is it just assumed that MariaDB replicas will be (more or less) consistent?

I am not sure what the point is of turning the question around, but I 'll entertain you. Always is the key word here. Nothing ever happens always. But yes, there are products, that use MariaDB replication and react to changes to replication status. For lag, this is documented in https://www.mediawiki.org/wiki/Manual:$wgDBservers. The requirements are practically declared in the code and configuration of these products. They choose when to depool a lagging replica (doing otherwise, would give unacceptably inconsistent data to users).

With that out of the way and to turn the question around once more. Can we set requirements for some of the use cases discussed?

I ask this because it is possible to use events to carry state in a way that is just as eventually consistent as MariaDB. If we can do this, and it helps engineering and product teams accomplish their goals, why would we not want to do it?

I never said we should not do it. What I ask is what is the level that is required per application. As pointed out above, not even MediaWiki products assume 100%, so when something like 99.999% (or more, arguably) is being shown as currently achieved , it's a very logical question, given the law of diminishing returns to ask why is this not enough? Why shouldn't we be prioritizing something else instead?

But yes, there are products, that use MariaDB replication and react to changes to replication status.

QQ: is there a corresponding status for replica inconsistency, when lag is close to 0? Are there cases where revision records are missing from replicas and MW core or extensions expect and deal with this?

QQ: is there a corresponding status for replica inconsistency, when lag is close to 0? Are there cases where revision records are missing from replicas and MW core or extensions expect and deal with this?

Yes. For instance, RevisionStore has logic that will fall back to querying the primary database when information cannot be found on the replica, e.g. in loadSlotRecordsFromDb. That logic is intended for dealing with replication lag, but it would also work if data was permanently missing from the replica.

But not all our code is written to be defensive against that situation. It's really only done in the "critical bits". We are not consistent about documenting when and why it is needed. Generally, it'S introduced as a fix when a problem arises.

FWIW, for each section we have a master and a "candidate master". We always swap them when we need to do maintenance. So there is no real "source of truth". There are two (not to mention we change candidate masters from time to time if there are hw issues, refreshes, etc.). So it is quite possible that we might even lose canonical data due to inconsistencies between master and its candidate. It is extremely rare but not out of question.

This is the level of data inconsistency that we are willing to tolerate in mw as Alex put it nicely:

given the law of diminishing returns to ask why is this not enough? Why shouldn't we be prioritizing something else instead?

So there is no real "source of truth".

So it is quite possible that we might even lose canonical data due to inconsistencies between master and its candidate. It is extremely rare but not out of question.

Wow that is very interesting!

Does that mean that it is possible (although unlikely) that a suppression of PII may be suppressed in one master but not the other? Or a page deleted in one master but not the other?

This is the level of data inconsistency that we are willing to tolerate

IMO this is also the level of data inconsistency we should aim for in externalized state as well.

But not all our code is written to be defensive against that situation.

@daniel what if somehow a page delete is missed in a replica?

So there is no real "source of truth".

So it is quite possible that we might even lose canonical data due to inconsistencies between master and its candidate. It is extremely rare but not out of question.

Wow that is very interesting!

Does that mean that it is possible (although unlikely) that a suppression of PII may be suppressed in one master but not the other? Or a page deleted in one master but not the other?

It is possible but it is so unlikely that we never had a case of user reporting such issues since I've been around. I'm not sure arguing over hypothetical but so unlikely that we have never encountered in the past decade would make much sense.

Also noting that if such things happen, since it's the canonical, users notice it and then redo it again.

This is the level of data inconsistency that we are willing to tolerate

IMO this is also the level of data inconsistency we should aim for in externalized state as well.

May I ask why? For example, why AI system for vandalism detection would need the same consistency guarantees of the canonical storage of edits itself?

And again, if you're thinking of removing PII from dumps and not willing to tolerate 0.0001% of cases being missed, maybe fix the existing issues (T241178#9438384)?

IMO this is also the level of data inconsistency we should aim for in externalized state as well.

It's sure nice to have, but the question is how much we are ready to pay for it. Basically, there is no such thing as 100%. All we can do is add 9s to the 99.999%. And every 9 multiplies the cost by n (with n close to 2, if I had to guess).

IMO this is also the level of data inconsistency we should aim for in externalized state as well.

Aside from echoing the above commenters, I 'll point out that you probably want to justify that opinion with use cases, requirements etc, to avoid the risk of running into a pointless opinion war when/if somebody else shows up with a different opinion.

justify that opinion with use cases

Fair point! I will, or fail trying! :) (I probably should have left that "IMO" out of that last comment)

May I ask why?

I'll try again.

When there are problems producing an event (in one of the envoy proxies, because of restarts/timeouts, or eventgate is just down, etc.), it's not just one event here and there that are missed. It can be batches of events all missed together. When this happens, it is likely that important state changes events are missed: suppressions, page deletes or creates, etc.

To avoid this problem we either need to ensure (as best we can) events are not missed, or reconcile the differences. As Amir stated, there are certainly ways that individual services can manage this reconciliation themselves. However, the need to do this is a barrier to entry for developing these services. It is not easy to build and deploy services at WMF, and even more so when those services need to transform or serve state from MW outside of MW.

If we can make it easier to use MW state outside of MW, we will make it easier to develop novel features and products.

I'll re-link T291120: MediaWiki Event Carried State Transfer - Problem Statement too. I hope that explains the problem a little more and lists a few more use cases.

I'm beginning to think that what is needed is a rephrasing of the problem along the lines of "How to make it easier to use MW state outside of MW?". I'll admit this ticket (and T291120) focuses on using events to solve this problem. If there are other and better ways to solve the problem, let's find it.

(NOTE: I still owe more quantified justification, that will take time to gather.)

Ottomata updated the task description. (Show Details)

(^ Updates are clarifications to the Transactional Outbox solution.)

Jobs can be retried if failed, maybe we could utilize that as a proxy?

@akosiaris This sounds similar to the Transactional Outbox solution listed in the description. Or did you mean something different?

As pointed out above, not even MediaWiki products assume 100%

@akosiaris, just so I understand this point: Are you saying that MediaWiki products do not assume that the MariaDB replicas will eventually be in sync with the masters? Is that true?

Jobs can be retried if failed, maybe we could utilize that as a proxy?

@akosiaris This sounds similar to the Transactional Outbox solution listed in the description. Or did you mean something different?

No, I didn't have Transactional Outbox in mind when writing that. I was answering to

If there could be a process that detects such missed events it would I think greatly improve the situation for wdqs at least.

pointing our that Jobs, including submission from MediaWiki IIRC, can be used as a proxy (in the sense of a surrogate, not the technical proxy sense) to detect and fix such events.

As pointed out above, not even MediaWiki products assume 100%

@akosiaris, just so I understand this point: Are you saying that MediaWiki products do not assume that the MariaDB replicas will eventually be in sync with the masters? Is that true?

No, I am saying that MediaWiki product assume that the MariaDB replicas will be inconsistent with regards to masters at various random points in time and accept that inconsistency, up to a configured threshold.

pointing our that Jobs, including submission from MediaWiki IIRC, can be used as a proxy (in the sense of a surrogate, not the technical proxy sense) to detect and fix such events.

Ah okay! So this is a reconciliation solution. We should explore this more to see if we can do it generically, and list pros and cons in the task description.

Are you saying that MediaWiki products do not assume that the MariaDB replicas will eventually be in sync with the masters? Is that true?

No, I am saying that MediaWiki product assume that the MariaDB replicas will be inconsistent with regards to masters at various random points in time and accept that inconsistency, up to a configured threshold.

Got it, that is my understanding too. They assume eventual consistency. (Right?)

I'm beginning to think that what is needed is a rephrasing of the problem along the lines of "How to make it easier to use MW state outside of MW?". I'll admit this ticket (and T291120) focuses on using events to solve this problem. If there are other and better ways to solve the problem, let's find it.

If that phrasing was used when this task was opened, I hypothesize that the discussion would have flowed completely differently as we wouldn't be discussing solutioning but problems and use cases instead.

I wanna point out that externalizing the state of an application is best done on terms that the application controls, e.g. via the APIs that it exposes to end users. Figuring out what the clients want and expose that state in a technical way that allow the application to remain in control to alter whatever is used to store and represent state internally, avoiding seeing state stores ossified and change crawl to a halt.

In other words, bypassing the application or working around the application should be a no go.

Got it, that is my understanding too. They assume eventual consistency. (Right?)

No. Eventual consistency is so loose and weak that all that it guarantees is that reads across replicas will eventually return the same value. It gives no guarantees about when that will happen. The heat death of the universe is a valid value for when and so is in 1m from now and 1 week from now. But arguably worse, there is no guarantee on how conflicts are handled when they arise leading to inconsistent data when there are races between writes. You can very easily get back a value you didn't expect, even after all reasonable eventuality thresholds have been crossed. Thankfully MySQL/MariaDB replication is by default less weak. It is asynchronous (or more correctly semi/plesiosynchronoys) and lags by default, but in the default configuration won't provide you with inconsistent data when there are races.

By the way, eventual consistency is a very abstract model, it can be misleading to apply it to real applications. It's so abstract, that it's not even listed in Aphyr's Consistency models

state of an application is best done on terms that the application controls, e.g. via the APIs that it exposes to end users.

I agree. In T120242#9757550 I wrote:

Using MariaDB snapshots mean that non-MW apps depend directly on MW's own internal data model as a public API. As you know, this leads to problems when DBAs need to make schema changes. Ideally, external applications should never access MW MariaDBs directly. MariaDB is MW's own internal application state.

That is...unless MW and DBAs explicitly decide that they want to use MariaDB as the 'public' data API (cough cough Cloud Replicas). I don't think they want that though ;) But it they did, it would have to be very explicit with trade offs documented.

bypassing the application or working around the application should be a no go.

Curious: Do you consider the application producing state change event streams (schemaed and versioned) "bypassing the application"?

Event streams are just a form of data API (or data product), albeit not an HTTP service based one. I agree that consuming application database state directly is akin to using a private programming API. But if the app decides to manage exporting data as events as a data product for outside consumers, this seems fine (and good!) to me.

Got it, that is my understanding too. They assume eventual consistency. (Right?)

No. Eventual consistency is so loose and weak that all that it guarantees is that reads across replicas will eventually return the same value. It gives no guarantees about when that will happen. The heat death of the universe is a valid value for when

Hm, I thought that was the definition of eventual consistency. Where eventual at maximum == heat death of universe...but in most cases its better than that ;)

guarantee on how conflicts are handled when they arise leading to inconsistent data when there are races between writes

But still, only one of the writes will win, even if it is not defined which one? From a replica perspective, is the replica not still assumed to be eventually consistent with whatever write wins on the master? (Or am I misunderstanding completely?)

You can very easily get back a value you didn't expect, even after all reasonable eventuality thresholds have been crossed.

By "eventuality thresholds" do you mean time passing? If so, I understand and agree. But if you mean e.g. 'offset' in replica binlog, then I'd like understand more. The replica or consumer can use the log offset to calculate lag, and know that it is behind.

eventual consistency is a very abstract model, it can be misleading to apply it to real applications

Perhaps this is because Eventual Consistency != C in CAP, even though they use the same word. IIUC, in CAP, Consistency means "Every read receives the most recent write or an error.", which is absolutely not true for "Eventual Consistency".

So, once again I think I am realizing that wording may be the cause of some tension or disagreement here. Perhaps instead of "Eventually Consistent" we should say "Eventually Complete"? Not sure.

At the risk of being obtuse, let me ask my MW MariaDB question again with different terms:

In normal conditions, does MediaWiki assume that an individual MariaDB replica, up to offset N in the binlog, has the same tables and rows as the master had, when it emitted the corresponding offset N in the binlog?


I'm asking all these questions to try and understand how and why some of the proposed solutions don't meet the criteria desired here: no (or an expected few) missed state changes in external state stores. If they don't, we can throw them away outright. I think it is possible to achieve this goal by leveraging database transactions and using those to ensure events are propagated.

That doesn't mean we should use database transactions and events, but I'm just trying to discern if it is possible to do so.

If that phrasing was used when this task was opened, I hypothesize that the discussion would have flowed completely differently as we wouldn't be discussing solutioning but problems and use cases instead.

Hm, perhaps! I will try this approach in a different ticket, and start from this higher level problem. I'll do this as part of the use case quantification we discussed above. That will likely be part of a different ticket then.

However, I think exploring solutions and assessing feasibility and pros and cons is fine and also interesting and helps us learn. I'm learning a lot on the way!