Page MenuHomePhabricator

Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema
Open, MediumPublic

Description

Cloud use cases can be classified as:

  • real-time updates that tools need to invoke some action
  • analytics-oriented questions spanning longer time periods

Neither of these are easy to meet with a schema designed for OLTP. The actor and comment refactor are giving us lots of headaches as seen in T215445. Another way to approach these two types of use cases is:

  • real time updates could work like the Job Queue does in production. We figure out who needs what, make good schemas, and output to Kafka from MediaWiki. We could work with use cases one by one the same way we worked on the RCStream migration. Tools could still be developed in any language that has a Kafka client (most do) or we could put something like EventStreams on top of the events so they're easier to consume. This way tools don't have to look for needles in the replication haystack.
  • analytics-oriented queries can use the same approach we took with MediaWiki history reconstruction. We collected OLAP use cases and restructured the schema so they could be better served. Transforming the data in the general case will be easier because we don't need to fix broken old data or make guesses about inconsistent records. We can move tools to this approach one by one as well.

For example, our current dilemma with comment could be handled like this:

  • real-time updates get unsanitized data, it happens already so we just send it and don't worry about it. Redaction information could be sent out via events as well, and handled as needed.
  • analytics queries would need to hide sanitized content. So we can model comment as a dimension that has redacted properties corresponding to rev_deleted, log_deleted, etc. Once all of these are true, the comment text itself should be hidden everywhere it's used. So we run a periodic compaction and sanitize. This won't be as immediate as the current redaction, but I think immediate redaction is sort of an illusion when the same data goes out via other channels. We could run this compaction every hour if we need to, and in serious emergencies we could run it on demand. This would dramatically improve on dumps, as we could have daily dumps with up-to-the-day sanitization. Currently, unsanitized data sits on dumps forever.

This is basically a scalable way to evolve the way people access MediaWiki data. It will not have everything available by default, as a full replica does. But it will scale by design. The infrastructure needed to run this is something Analytics needs to build anyway, so we could share and operate it together.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenJhernandez
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedBstorm
ResolvedBstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
StalledMarostegui
ResolvedCmjohnson
Resolveddcaro
DeclinedNone
ResolvedKormat
ResolvedArielGlenn
OpenBstorm
DeclinedBstorm
OpenBstorm
ResolvedBstorm
OpenJhernandez
Openrazzi
ResolvedMarostegui
ResolvedMilimetric
ResolvedBstorm
ResolvedBstorm
OpenNone
ResolvedBstorm
ResolvedAndrew
ResolvedBstorm
OpenNone
OpenJhernandez
ResolvedMarostegui
ResolvedRagesoss
ResolvedBstorm
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone

Event Timeline

bd808 added a subscriber: Bstorm.
bd808 added a subscriber: bd808.
fdans triaged this task as Medium priority.Feb 18 2019, 4:49 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
bd808 renamed this task from Rethink Cloud DB replicas to Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema.Oct 29 2020, 7:04 PM

Just to add a perspective…

A guarantee that data is always sanitised is an important property for an interface in some contexts, and affects the architecture of the thing that consumes the data.

The details depend on what exactly is being sanitised away, but to take data about a person as an example, if you are subject to the GDPR you generally do not want to touch such data at all because once it crosses the border into your system you are responsible for any processing of that data (which by definition includes deleting it).

The closer to the source of the data you can make this guarantee the less it affects downstream consumers, with one extreme of the scale being that every single end client (tool) needs to deal with this issue independently.

Or put anther way, I would pick a guaranteed sanitised interface over one without this guarantee every time, irrespective of whether that unsanitised data also happens to get exposed elsewhere. It reduces my legal exposure, and eliminates whole swathes of stuff I need to handle in my code.

o/

I don't have all of the context here, but in the past, the Analytics eng team had wanted to set up a 'public data lake' using Hive + Presto, mainly to allow SQL querying of the mediawiki history dataset. We canned that project due to lack of first class operational support for things like automated monitoring for clusters hosted in Cloud VPS projects.

However, if the problem you are trying to solve is cross MySQL instance joins, Presto still might be able to help you here. Presto has connectors for various downstream SQL(ish) datastores, and can do joins between them in Presto. The Presto MySQL connector could handle sending single DB queries to an underlying MySQL instance, but also join results from multiple MySQL instances.

Some thoughts about OLAP engine for Cloud. Presto feels like a great choice in term of computation engine - Fast, robust, ANSI SQL compliant. However Presto performance depends quite a lot of behind the scene data-handling: Presto is data-storage agnostic, it can load data from various backends and then computes. The trade-off I'm seeing for the labs use-case is between query performance and up-to-date data: I think (not tried) that using MariaDB as a backend will slow-down presto too much for OLAP, but that's the only way I can think of currently to provide up-to-date data for queries. Using columnar storage such as parquet is what should provide best querying performance, but we are still working on finding a good way to provide regular updates of the data (we currently use monthly snapshots). In analytics world we are currently trying to use events from Modern Event Platform to provide more regular updates of the mediawiki-history dataset, but this has proven difficult, noticeably because of non atomic event generation on the mediawiki application side (some actions are lost). The other concern using events is that it would easily be translatable back to the MariaDB schema. Based on the above here my idea of a plan to provide analytics capabilities over mediawiki data on Cloud:

  • Find a way to have Presto accessible from Cloud, either through labs hosts, but it means a lot of monitoring work, either through external service accessed by Cloud.
  • Continue to work toward having a regularly updated mediawiki-history dataset as it will be useful for analytical queries (denormalized, historified etc).
  • Investigate using Change-Data-Capture from MariaDB and applying it to parquet versions of the data.

Let me know if you think my views are useful, or if should keep them for me :)

From the upcoming WikiReplicas architecture changes we have cataloged T267992, T268240 & T268242 all related to matching file pages and the image name, sha1, and timestamp between wikis and commons, and wikis themselves. This is something that is very hard to do without cross-db joins, mainly because of common's DB size and entries.

Here T267992#6637250 is some of the explorations we are approaching, I'm posting this here to keep track of the use cases.

A couple of comments from a subtask that I think should continue here:

RE: Bstorm's comment T267992#6631541 about listening to recent changes:

We could start from a known good snapshot (like the last mediawiki history dataset), transform it into a useful shape, and listen to all the relevant page events, which in this case are:

  • page create
  • page move (rename)
  • page delete
  • page restore

The "useful shape" could be something really simple like:

page titlelist of wikis
File:Catenwiki,commonswiki
File:Batcommonswiki
File:Aristocratenwiki

Each event would issue a quick update, we would probably use the page id. The basic idea is we have the events we need and we could create a table that's fast/easy to query. We can also evolve this to a more comprehensive page schema that would support other use cases. It obviously requires some work and is not something we can do for free, just thinking out loud that some of the pieces are already there from a data point of view.

[...]
I think that kind of table/querying would be very useful. AFAICT it wouldn't need to be realtime up to date, so if the data is a bit stale, like updated weekly or monthly it would still be useful from what we've heard up until now.

Regarding the sub-tasks we have now with the table you propose:

  • T267992: for this it would help get a subset of pages that could then be queried against commons and enwiki a lot more easily. Hopefully the subsets wouldn't be too big that would make querying the DBs separately very hard. We would need to check what the size of the intersection of file images between enwiki and commons is to get a better idea.
  • T268240: this one looks like a variation of the previous T267992, so it would help too
  • T268242: for this one (syncing fair-use images change of licenses or deletions from enwiki to smaller wiki) I'm not sure it would help. This query may be better suited for changing it to something incremental, I still have to chime in there.
  • T268244: for this one (links to non-existent images) it wouldn't help as it is based on imagelinks. Although this query may be more doable incrementally anyways, I still have to chime in there too.

[...]

I'm happy to advise someone working on this, but I can't drive the work, we have had to re-focus and trim down a lot of scope. We're struggling with all the changes on our team.

Is it possible to get information how big the commons/wikidata tables are on the disk? (ie. which are the problematic tables in the replication point of view)

Also it is ruled out that you would mirror subset of the tables or custom tables to each shard so there would be defined set of data which could be queried using cross-database joins from all wikidatabases?

Is it possible to get information how big the commons/wikidata tables are on the disk? (ie. which are the problematic tables in the replication point of view)

The tables individually I'm not sure. The databases for sure. What would that info help with?

The problem is not just the size of the tables, it is also the queries that are performed against the cluster and a other factors.

Also it is ruled out that you would mirror subset of the tables or custom tables to each shard so there would be defined set of data which could be queried using cross-database joins from all wikidatabases?

Right now the work of T260389 is avoiding the replicas having a catastrophic failure by setting up the new architecture and upgrading MariaDB. Nothing else is in scope. Other alternatives and future projects are in discussion (for example in this task) but sadly these changes need to happen to keep the replicas operational at all.

Sorry for the disruption @Zache, thanks for listing your queries in the other task.

Is it possible to get information how big the commons/wikidata tables are on the disk? (ie. which are the problematic tables in the replication point of view)

The tables individually I'm not sure. The databases for sure. What would that info help with?

It would help to see what the options are. If the tables are big then they cannot be replicated because size/network traffic constrains. However, if they are small enough then it is not a problem to mirror just some of the tables (or more likely custom views) to all of the nodes.

The problem is not just the size of the tables, it is also the queries that are performed against the cluster and a other factors.

Do you have any analysis on this?

Based on queries which are the reasonable fast the most problematic ones are revision/logging tables which are also most heavily sanitazed too. On other hand I would be little bit suprized if the queries against indexed pagelinks/templatelinks/categorylinks tables would be a problem.

It would help to see what the options are. If the tables are big then they cannot be replicated because size/network traffic constrains. However, if they are small enough then it is not a problem to mirror just some of the tables (or more likely custom views) to all of the nodes.

I'm personally not sure if this is feasible, SREs or DB admins would have to chime in. I'm happy to bring the idea when we have a chat about this task 👍

Do you have any analysis on this?

I don't have any hard data at this moment. From what I've heard, there are queries that are run on huge tables using not-indexed columns, which can take a very very long time to finish. My guess is these queries consume resources for a long time and all the DBs being in the same hardware means that a bad query on a DB will affect the performance of all others.

The service is at the brink of collapse and in a dangerous position which is why the project on the subtask is happening. I hope once it is done we have more leeway to analyze how the replicas are used.

Based on queries which are the reasonable fast the most problematic ones are revision/logging tables which are also most heavily sanitazed too. On other hand I would be little bit suprized if the queries against indexed pagelinks/templatelinks/categorylinks tables would be a problem.

I would tend to agree with you but I think this would be very dependent on the query and database it is running against and the existing load on the server. Right now it would be hard to analyze to that level of detail in the current situation.

Do you have any analysis on this?

Based on queries which are the reasonable fast the most problematic ones are revision/logging tables which are also most heavily sanitazed too. On other hand I would be little bit suprized if the queries against indexed pagelinks/templatelinks/categorylinks tables would be a problem.

The scalability issues of a database is not limited to disk space (barely it's the disk space). It includes the huge amount of writes coming from Wikidata (which on top of that it gets sanitized) meaning the network will get saturated if we try to replicate from enwiki and wikidatawiki (and other wikis) at the same time to a db (in production, they don't get replicated to one cluster, see dbtree.wikimedia.org). and network is one. Innodb buffer pool running out of space is another one. Also revision table is not the biggest table of enwiki, pagelinks is.

Very low tech workaround for decreacing the current load could be generating unsanitized id / timestamp pair tables where user could query possible min/max values for the primary ids. This could be even compacted so that it would contain only one pair per day.

Ie. content of the table would be something like

CREATE TABLE yearmonthday_revision
AS (
SELECT floor(rev_timestamp/1000000) as yearmonthday, min(rev_id) as min_rev, max(rev_id) as max_rev FROM revision GROUP BY yearmonthday
);

and then user could query first min and max primary keys for certain time periods from fast table and use those for setting the search window in the slow table. Another solution could be adding indexes to timestamp columns, but i suspect that it would have been done already if it would be doable.