Page MenuHomePhabricator

Rethink Cloud DB replicas
Open, MediumPublic


Cloud use cases can be classified as:

  • real-time updates that tools need to invoke some action
  • analytics-oriented questions spanning longer time periods

Neither of these are easy to meet with a schema designed for OLTP. The actor and comment refactor are giving us lots of headaches as seen in T215445. Another way to approach these two types of use cases is:

  • real time updates could work like the Job Queue does in production. We figure out who needs what, make good schemas, and output to Kafka from MediaWiki. We could work with use cases one by one the same way we worked on the RCStream migration. Tools could still be developed in any language that has a Kafka client (most do) or we could put something like EventStreams on top of the events so they're easier to consume. This way tools don't have to look for needles in the replication haystack.
  • analytics-oriented queries can use the same approach we took with MediaWiki history reconstruction. We collected OLAP use cases and restructured the schema so they could be better served. Transforming the data in the general case will be easier because we don't need to fix broken old data or make guesses about inconsistent records. We can move tools to this approach one by one as well.

For example, our current dilemma with comment could be handled like this:

  • real-time updates get unsanitized data, it happens already so we just send it and don't worry about it. Redaction information could be sent out via events as well, and handled as needed.
  • analytics queries would need to hide sanitized content. So we can model comment as a dimension that has redacted properties corresponding to rev_deleted, log_deleted, etc. Once all of these are true, the comment text itself should be hidden everywhere it's used. So we run a periodic compaction and sanitize. This won't be as immediate as the current redaction, but I think immediate redaction is sort of an illusion when the same data goes out via other channels. We could run this compaction every hour if we need to, and in serious emergencies we could run it on demand. This would dramatically improve on dumps, as we could have daily dumps with up-to-the-day sanitization. Currently, unsanitized data sits on dumps forever.

This is basically a scalable way to evolve the way people access MediaWiki data. It will not have everything available by default, as a full replica does. But it will scale by design. The infrastructure needed to run this is something Analytics needs to build anyway, so we could share and operate it together.