Create materialized views on Wiki Replica hosts for better query performance
Open, HighPublic

Description

Creating a new task to follow-up on today's meeting. We decided that DBAs would look into creating materialized views to help both Analytics and cloud use cases. I suggest we edit this task to keep a list of requirements that the DBAs can work with.

Analytics Requirements

  • Faster access to corresponding actor_name for archive, ipblocks, logging, and revision
  • Faster access to corresponding comment_text for ipblocks, logging, and revision
  • Our access pattern for importing is always: select [subset of fields] from [subset of tables] where timestamp < 'The first day of the current month'.

So, for example, if it's November 2nd, and the materialized views have updated so that they include at least all transactions through November 1st 00:00, that would be great for us. If this would be available but much later, like November 5th or so, then we'd have to do some tests to see how fast we can import from the new infrastructure.

Cloud Requirements

  • Faster access to the agent and comment views for our users because those two tables depend on many subqueries to get a single result because the rules that determine visibility are in other tables.
  • Cloud users access the data in all kinds of ways. From my scan of the definitions, a materialized view that is behind another table could cause a broken reference to a comment or actor record briefly, but it wouldn't result in an exposure because the rules are held in the external tables. However close to real-time updates of those tables that we can come is good, but whatever we determine is possible needs to be communicated to cloud users so that they can adjust things if needed.

The content table also does a subquery to the slots table, but it is a single subquery that is probably not going to be as much of a problem. It's just worth noting in case that is eventually incorrect.

There are a very large number of changes, so older changes are hidden. Show Older Changes
Milimetric triaged this task as High priority.Thu, Nov 29, 3:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Nov 29, 3:54 AM

cc @Nuria so she's in the loop that we're following up here about the materialized views work.

Banyek added a subscriber: Banyek.Thu, Nov 29, 2:41 PM

@Bstorm I can assist if you need it, just talk to me

Banyek moved this task from Triage to In progress on the DBA board.Thu, Nov 29, 2:47 PM

So you don't have to reuse this- but I am pasting this for reference:

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/375349/

This would need some extra tweaks- puppetization on a cron, change of hosts, making it work on both datacenters, a proper temporary storage, making it "better" in code sense (remember this was a quick script at the time coded in 5 minutes), better authentication, using ssl, recording the time of the generation for each table, and the most important part, avoid metadata locking (!) and transitional states with no rows (maybe with some rename work). However, 90% it is there (even if it is rewritten from scratch).

The difficult part is the metadata locking- but maybe we can avoid DDLs completely?

Bstorm updated the task description. (Show Details)Thu, Nov 29, 3:41 PM
bd808 renamed this task from Create materialized views for performance to Create materialized views on Wiki Replica hosts for better query performance.Thu, Nov 29, 8:57 PM
bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.
bd808 added a subscriber: bd808.

@Marostegui, we would like to go over plans for implementation during our Wednesday meeting. Is there anything else you'd like us to define or discuss before then?

Banyek added a comment.Mon, Dec 3, 5:44 PM

@Bstrom when could we talk about the details?

Banyek added a comment.Mon, Dec 3, 6:05 PM

We talked about the kick-off, and tomorrow we'll sync up about what we found.
@Bstorm gives me a few SQL's and I'll check what can I do with those

More details on requirements for Analytics. The queries that we'll be running will be of the form:

select [subset of fields] from [one of the tables listed below] where [timestamp < the first day of this month (if there's a timestamp)]

We import the following tables:

  • archive
  • ipblocks
  • logging
  • page
  • pagelinks
  • redirect
  • revision
  • user
  • user_groups

And from any materialized views created to make it faster to grab actor/comment joins to archive/revision/logging.

Obviously SSDs would be better, but these should all be contiguous reads on disk, requiring not too much memory, because no joins are being done in real-time. I think the biggest memory user will be refreshing materialized views, which will probably join actor/comment to archive/revision/logging. From a speed point of view, here's what we need:

  • 1st of the month: kick off any materialized view refreshes
  • when refreshes are done, import data
  • when import is done, kick off jobs which take at most 1 day

Before the performance problems, this process started on the 5th of the month and it generally finished by the 10th of the month. So if we had the same performance, we'd finish around the 5th of the month. If these boxes are half as fast at performing these tasks, we'd finish on the 10th. I'd say if they're close to half as fast or worse, we should get faster boxes. If not, we're good for now as a temporary solution.

Thanks @Milimetric, very useful information.
Could I ask for an specific and concrete SQL? With the exact subfields and an example timestamp?
Thanks!

@Marostegui while Dan answers your Q, I like his idea: Could the HDDs handle replication if we only replicated the tables Dan mentioned? I'd assume it wouldn't matter that much, as those are the most active tables and probably responsible for most of the replication load.

Milimetric added a comment.EditedMon, Dec 3, 8:23 PM

Yes, all the templates for the queries are here, and they're easy to read, basic python templating: https://github.com/wikimedia/analytics-refinery/blob/7cf62e1a7b9cc65cdf229e8a13ced8f6769f0a14/python/refinery/sqoop.py#L180 (that's archive, but scroll down and you'll see all the tables).

For example, if we were sqooping the revision table, for all data including 2018-10, first sqoop would query for min/max of the column we're splitting by, in this case rev_id:

select min(rev_id), max(rev_id) from revision;

And then, it would split up max - min into $parallelism parts, and grab each one, like this:

select rev_id,
       rev_page,
       rev_text_id,
       convert(rev_comment using utf8) rev_comment,
       rev_user,
       convert(rev_user_text using utf8) rev_user_text,
       convert(rev_timestamp using utf8) rev_timestamp,
       rev_minor_edit,
       rev_deleted,
       rev_len,
       rev_parent_id,
       convert(rev_sha1 using utf8) rev_sha1,
       convert(rev_content_model using utf8) rev_content_model,
       convert(rev_content_format using utf8) rev_content_format
  from revision
 where rev_id >= $min_rev_id
   and rev_id <  $min_rev_id + (($max_rev_id - $min_rev_id) / $parallelism) * $part
   and rev_timestamp >= '20010101000000'
   and rev_timestamp <  '20181101000000'

Happy to provide any other detail. (If you're looking at the templates, that max/min stuff is what gets substituted where you see $CONDITIONS, and the split_by is how we tell sqoop which column we want to use). Oh, and the convert(... using utf8) stuff is there because we export to Avro and it needs those to convert column types like varbinary into strings.

Marostegui added a comment.EditedMon, Dec 3, 8:36 PM

@Marostegui while Dan answers your Q, I like his idea: Could the HDDs handle replication if we only replicated the tables Dan mentioned? I'd assume it wouldn't matter that much, as those are the most active tables and probably responsible for most of the replication load.

I'm not sure, specially with logging, revision, actor, archive tables involved...

There is also the fact that we would have a another role to maintain, as the labsdb role would no longer be usable as we would need different configurations, filters etc.
There is also the fact that as we would need to put more/different filters in and in case another table is needed in the future (for whatever reason) we would need to do an import on 900 wikis, so that's operational a huge effort.
We have already experienced that on labs, when a table that was blacklisted is all of a sudden fine to replicate and we need to get it reimproted.

As discussed in the meeting, I think we need to try to solve this (or at least give a proper try) to from a MySQL/views points of view first.

Throwing hardware in is ok as long as we are really sure it can fix it, which is not the case so far, at least not with the proposed hardware - which implies a few big and potentially becoming tech debt changes. When we talked about maintaining 2 (or whatever number of hosts) it was assumed (or sort of) they would be the same as the current labs hosts which is no longer the case unfortunately :-(.

Let's please try to keep the tasks organized and discuss hardware on the other task and materialised related things otherwise it is going to be hard to keep information on the same subject.

In https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477503/ I have a proposal about how to install the materialized script generator on any of the hosts.
In this case it will be put to one of the wikireplica_analytics (labsdb1010) host, but it is easily movable anywhere

Change 477503 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: materialized view generator for analytics team

https://gerrit.wikimedia.org/r/477503

Banyek added a comment.Tue, Dec 4, 2:51 PM

@Milimetric could you be a little bit more specific please? As I understood currently you run the queries described in T210693#4795470 and what you need from us is to create a few tables which would make those queries run faster, right?
What is not clear to me, is that we - as the Persistence Team - should figure out how to build those tables or you can provide us some SQL queries which will build those tables, and our job is to integrate this process into our current system?
I have a proposal about the integration part in T210693#4796863: that would install a script on any chosen host, and run it via cron. Currently it is aimed to labsdb1010, but it could be easily moved to any future host.

Thanks to @Banyek for the questions and a talk we just had over hangouts, we decided to go in two directions in parallel:

  1. considering the complications that have been coming up here with the new cluster/replication/materialized views work, we think the solution the Analytics team is developing for the short-term might actually be better for the medium-term as well, until a better sanitize pipeline is in place. I'll describe this approach below, and we should consider it in our meeting tomorrow (cloud/dba/sre representatives are doing a hangout tomorrow).
  2. I will send @Banyek the exact query we would have to run for the revision table (after the comment/actor refactor) so he can experiment with materialized views based on that query, and get a feeling for performance hit, disk space, etc.

I will follow up with both of these directions in comments here, so they can be linked independently.

  1. exact query for a materialized view that would allow us to import the revision table into Hadoop. There are two possible ways to do this, depending on where the query runs:

If we're selecting from the revision view, as defined in maintain-views.yaml:

select rev_id,
       rev_page,
       rev_text_id,
       convert(rev_comment using utf8) rev_comment,
       rev_user,
       convert(rev_user_text using utf8) rev_user_text,
       convert(rev_timestamp using utf8) rev_timestamp,
       rev_minor_edit,
       rev_deleted,
       rev_len,
       rev_parent_id,
       convert(rev_sha1 using utf8) rev_sha1,
       convert(rev_content_model using utf8) rev_content_model,
       convert(rev_content_format using utf8) rev_content_format
  from revision
           left join
       actor           a   on a.actor_id = rev_actor
           left join
       comment         c   on c.comment_id = rev_comment_id

If we're selecting from the revision table, this is currently in the middle of a complex refactor and migration, so this query would change slightly as the temp tables go away:

select rev_id,
       rev_page,
       rev_text_id,
       convert(rev_comment using utf8) rev_comment,
       rev_user,
       convert(rev_user_text using utf8) rev_user_text,
       convert(rev_timestamp using utf8) rev_timestamp,
       rev_minor_edit,
       rev_deleted,
       rev_len,
       rev_parent_id,
       convert(rev_sha1 using utf8) rev_sha1,
       convert(rev_content_model using utf8) rev_content_model,
       convert(rev_content_format using utf8) rev_content_format
  from revision
           left join
       revision_actor_temp rat     on rat.revactor_rev = rev_id
           left join
       actor           a           on a.actor_id = revactor_actor
           left join
       revision_comment_temp rct   on rct.revcomment_rev = rev_id
           left join
       comment         c           on c.comment_id = revcomment_comment_id
Bstorm added a comment.Tue, Dec 4, 7:42 PM

To sum up discussion with @Banyek this morning:
From the Cloud perspective this is a problem with the replica views that Analytics happens to be a sophisticated and well-connected reporter on. We'd like any solutions to be something that can be used by more than one user of the wiki replicas since this will start to generate increasing tickets in time from other users who notice the same problems. As is, I can grab 10K page records in 0.04 sec. It takes over 4 sec for the same query on comment with no filtering. I can't help but imagine more interesting queries are going to run into problems in some other cases. The type of materialization we are talking about, triggered on schedule rather than some kind of hook, would only fit limited uses, but folks might be happy to have the option (such as Analytics).

So I'm interested in how often a materialized view could be created for the comment and actor tables. It wouldn't fit everyone's use cases, but it is likely to fit some more than just Analytics.

So we are all on the same sheet of music, the query that creates the comment view in question (with a particular definer set as well) is:

CREATE VIEW `comment` AS select `enwiki`.`comment`.`comment_id` AS `comment_id`,`enwiki`.`comment`.`comment_hash` AS `comment_hash`,`enwiki`.`comment`.`comment_text` AS `comment_text`,`enwiki`.`comment`.`comment_data` AS `comment_data` from `enwiki`.`comment` where (exists(select 1 from `enwiki`.`image` where (`enwiki`.`image`.`img_description_id` = `enwiki`.`comment`.`comment_id`)) or exists(select 1 from `enwiki`.`filearchive` where (`enwiki`.`filearchive`.`fa_deleted_reason_id` = `enwiki`.`comment`.`comment_id`)) or exists(select 1 from `enwiki`.`filearchive` where ((`enwiki`.`filearchive`.`fa_description_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`filearchive`.`fa_deleted` & 2) = 0))) or exists(select 1 from `enwiki`.`ipblocks` where ((`enwiki`.`ipblocks`.`ipb_reason_id` = `enwiki`.`comment`.`comment_id`) and (`enwiki`.`ipblocks`.`ipb_deleted` = 0))) or exists(select 1 from `enwiki`.`oldimage` where ((`enwiki`.`oldimage`.`oi_description_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`oldimage`.`oi_deleted` & 2) = 0))) or exists(select 1 from `enwiki`.`protected_titles` where (`enwiki`.`protected_titles`.`pt_reason_id` = `enwiki`.`comment`.`comment_id`)) or exists(select 1 from `enwiki`.`recentchanges` where ((`enwiki`.`recentchanges`.`rc_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`recentchanges`.`rc_deleted` & 2) = 0))) or exists(select 1 from (`enwiki`.`revision` join `enwiki`.`revision_comment_temp` on((`enwiki`.`revision_comment_temp`.`revcomment_rev` = `enwiki`.`revision`.`rev_id`))) where ((`enwiki`.`revision_comment_temp`.`revcomment_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`revision`.`rev_deleted` & 2) = 0))) or exists(select 1 from `enwiki`.`logging` where ((`enwiki`.`logging`.`log_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`logging`.`log_deleted` & 2) = 0) and (`enwiki`.`logging`.`log_type` in ('abusefilter','articlefeedbackv5','block','campus','close','contentmodel','course','create','delete','eparticle','gather','gblblock','gblrename','gblrights','globalauth','gwtoolset','import','institution','instructor','interwiki','liquidthreads','lock','managetags','massmessage','merge','moodbar','move','mwoauthconsumer','newsletter','newusers','notifytranslators','online','pagelang','pagetranslation','pagetriage-curation','pagetriage-deletion','patrol','protect','renameuser','review','rights','spamblacklist','stable','student','tag','thanks','timedmediahandler','translationreview','upload','usermerge')))))

And for actor:

CREATE VIEW `actor` AS select `enwiki`.`actor`.`actor_id` AS `actor_id`,`enwiki`.`actor`.`actor_user` AS `actor_user`,`enwiki`.`actor`.`actor_name` AS `actor_name` from `enwiki`.`actor` where (exists(select 1 from `enwiki`.`user` where (`enwiki`.`user`.`user_id` = `enwiki`.`actor`.`actor_user`)) or exists(select 1 from `enwiki`.`archive` where ((`enwiki`.`archive`.`ar_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`archive`.`ar_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`ipblocks` where ((`enwiki`.`ipblocks`.`ipb_by_actor` = `enwiki`.`actor`.`actor_id`) and (`enwiki`.`ipblocks`.`ipb_deleted` = 0))) or exists(select 1 from `enwiki`.`image` where (`enwiki`.`image`.`img_actor` = `enwiki`.`actor`.`actor_id`)) or exists(select 1 from `enwiki`.`oldimage` where ((`enwiki`.`oldimage`.`oi_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`oldimage`.`oi_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`filearchive` where ((`enwiki`.`filearchive`.`fa_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`filearchive`.`fa_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`recentchanges` where ((`enwiki`.`recentchanges`.`rc_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`recentchanges`.`rc_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`logging` where ((`enwiki`.`logging`.`log_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`logging`.`log_deleted` & 4) = 0))) or exists(select 1 from (`enwiki`.`revision` join `enwiki`.`revision_actor_temp` on((`enwiki`.`revision_actor_temp`.`revactor_rev` = `enwiki`.`revision`.`rev_id`))) where ((`enwiki`.`revision`.`rev_deleted` & 4) = 0)))

If we were just using enwiki.

Because re-creating these tables on a regular basis is likely to not be a small operation that might run into table locking problems, I am not all that hopeful that it will be able to be done more often than once a week. I can dream.

Anyway, we discussed running a test tomorrow to see what kind of resources this sort of operation takes up, doing a table create on one of the replica servers depooled in a wiki_p database, I presume. I'm not sure the tables are completely filled at this time or what kind of backfill is planned for them, so our mileage may vary from the test.

Change 477624 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool labsdb1010 for testing materialized view

https://gerrit.wikimedia.org/r/477624

+1 to @Bstorm's framing of the problem. I was going to say the same thing, and withdraw my revision query from above from @Banyek's consideration. We can work around our performance problem with or without more hardware (more is better), but everyone wins if we get faster queries. What the DBAs should consider, with our help, is just improving the performance in general as @Bstorm outlined.

Milimetric added a comment.EditedTue, Dec 4, 8:12 PM

And to follow up on my first bullet from before:

  1. after more closely looking at our temporary solution, we think it's solid enough to work for us going forward, and takes pressure away from DBAs to create any special cases for us. As @Bstorm points out, it seems more and more necessary to improve performance. But if others accept what I describe below, we don't have to be consulted how exactly to improve it.

Basically, we are importing actor and comment from production replicas, and these include comment text that might have been redacted via a rev_deleted, log_deleted, etc. flag. But we import revision, archive, and logging from the cloud replicas. When I took a closer look at those view definitions, I realized that the columns we would use to join to the actor or comment tables are already sanitized according to the flags. For example, rev_comment_id is null if rev_deleted says it should be hidden: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#664

Now, the only thing we do with actor and comment is to join to the rows we grab from the cloud replicas. So we don't have to do any sanitizing on our side, the join will sanitize for us, because the keys have been nulled or zeroed out. If everyone agrees this is not a new implementation of the privacy pipeline, but just a nice side-effect, then we can go with this solution for the time being and revisit once we have a different sanitize pipeline.

From my chat with @Banyek the approach now is to create tables populated with the following queries and then Analytics will query those directly with their own queries sort of defined at T210693#4795364 and T210693#4797967 ?:

So we are all on the same sheet of music, the query that creates the comment view in question (with a particular definer set as well) is:

CREATE VIEW `comment` AS select `enwiki`.`comment`.`comment_id` AS `comment_id`,`enwiki`.`comment`.`comment_hash` AS `comment_hash`,`enwiki`.`comment`.`comment_text` AS `comment_text`,`enwiki`.`comment`.`comment_data` AS `comment_data` from `enwiki`.`comment` where (exists(select 1 from `enwiki`.`image` where (`enwiki`.`image`.`img_description_id` = `enwiki`.`comment`.`comment_id`)) or exists(select 1 from `enwiki`.`filearchive` where (`enwiki`.`filearchive`.`fa_deleted_reason_id` = `enwiki`.`comment`.`comment_id`)) or exists(select 1 from `enwiki`.`filearchive` where ((`enwiki`.`filearchive`.`fa_description_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`filearchive`.`fa_deleted` & 2) = 0))) or exists(select 1 from `enwiki`.`ipblocks` where ((`enwiki`.`ipblocks`.`ipb_reason_id` = `enwiki`.`comment`.`comment_id`) and (`enwiki`.`ipblocks`.`ipb_deleted` = 0))) or exists(select 1 from `enwiki`.`oldimage` where ((`enwiki`.`oldimage`.`oi_description_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`oldimage`.`oi_deleted` & 2) = 0))) or exists(select 1 from `enwiki`.`protected_titles` where (`enwiki`.`protected_titles`.`pt_reason_id` = `enwiki`.`comment`.`comment_id`)) or exists(select 1 from `enwiki`.`recentchanges` where ((`enwiki`.`recentchanges`.`rc_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`recentchanges`.`rc_deleted` & 2) = 0))) or exists(select 1 from (`enwiki`.`revision` join `enwiki`.`revision_comment_temp` on((`enwiki`.`revision_comment_temp`.`revcomment_rev` = `enwiki`.`revision`.`rev_id`))) where ((`enwiki`.`revision_comment_temp`.`revcomment_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`revision`.`rev_deleted` & 2) = 0))) or exists(select 1 from `enwiki`.`logging` where ((`enwiki`.`logging`.`log_comment_id` = `enwiki`.`comment`.`comment_id`) and ((`enwiki`.`logging`.`log_deleted` & 2) = 0) and (`enwiki`.`logging`.`log_type` in ('abusefilter','articlefeedbackv5','block','campus','close','contentmodel','course','create','delete','eparticle','gather','gblblock','gblrename','gblrights','globalauth','gwtoolset','import','institution','instructor','interwiki','liquidthreads','lock','managetags','massmessage','merge','moodbar','move','mwoauthconsumer','newsletter','newusers','notifytranslators','online','pagelang','pagetranslation','pagetriage-curation','pagetriage-deletion','patrol','protect','renameuser','review','rights','spamblacklist','stable','student','tag','thanks','timedmediahandler','translationreview','upload','usermerge')))))

And for actor:

CREATE VIEW `actor` AS select `enwiki`.`actor`.`actor_id` AS `actor_id`,`enwiki`.`actor`.`actor_user` AS `actor_user`,`enwiki`.`actor`.`actor_name` AS `actor_name` from `enwiki`.`actor` where (exists(select 1 from `enwiki`.`user` where (`enwiki`.`user`.`user_id` = `enwiki`.`actor`.`actor_user`)) or exists(select 1 from `enwiki`.`archive` where ((`enwiki`.`archive`.`ar_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`archive`.`ar_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`ipblocks` where ((`enwiki`.`ipblocks`.`ipb_by_actor` = `enwiki`.`actor`.`actor_id`) and (`enwiki`.`ipblocks`.`ipb_deleted` = 0))) or exists(select 1 from `enwiki`.`image` where (`enwiki`.`image`.`img_actor` = `enwiki`.`actor`.`actor_id`)) or exists(select 1 from `enwiki`.`oldimage` where ((`enwiki`.`oldimage`.`oi_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`oldimage`.`oi_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`filearchive` where ((`enwiki`.`filearchive`.`fa_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`filearchive`.`fa_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`recentchanges` where ((`enwiki`.`recentchanges`.`rc_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`recentchanges`.`rc_deleted` & 4) = 0))) or exists(select 1 from `enwiki`.`logging` where ((`enwiki`.`logging`.`log_actor` = `enwiki`.`actor`.`actor_id`) and ((`enwiki`.`logging`.`log_deleted` & 4) = 0))) or exists(select 1 from (`enwiki`.`revision` join `enwiki`.`revision_actor_temp` on((`enwiki`.`revision_actor_temp`.`revactor_rev` = `enwiki`.`revision`.`rev_id`))) where ((`enwiki`.`revision`.`rev_deleted` & 4) = 0)))

If we were just using enwiki.

Will this need to eventually happen on _all_ wikis? If that is the case, I don't think we'd have enough space to do so. But the test for enwiki will say.

Because re-creating these tables on a regular basis is likely to not be a small operation that might run into table locking problems, I am not all that hopeful that it will be able to be done more often than once a week. I can dream.

Anyway, we discussed running a test tomorrow to see what kind of resources this sort of operation takes up, doing a table create on one of the replica servers depooled in a wiki_p database, I presume. I'm not sure the tables are completely filled at > this time or what kind of backfill is planned for them, so our mileage may vary from the test.

There is also the fact that this operation will most likely generate metadata locking if the host isn't depooled. If Analytics gets their own host, that shouldn't be a problem but for Cloud Replicas that would be a no go, as we'd need a manual pooling/depooling of the host when we have to regenerate the materialized views.

Marostegui added a comment.EditedWed, Dec 5, 6:47 AM

And to follow up on my first bullet from before:

  1. after more closely looking at our temporary solution, we think it's solid enough to work for us going forward, and takes pressure away from DBAs to create any special cases for us. As @Bstorm points out, it seems more and more necessary to improve performance. But if others accept what I describe below, we don't have to be consulted how exactly to improve it.

    Basically, we are importing actor and comment from production replicas, and these include comment text that might have been redacted via a rev_deleted, log_deleted, etc. flag. But we import revision, archive, and logging from the cloud replicas. When I took a closer look at those view definitions, I realized that the columns we would use to join to the actor or comment tables are already sanitized according to the flags. For example, rev_comment_id is null if rev_deleted says it should be hidden: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#664

    Now, the only thing we do with actor and comment is to join to the rows we grab from the cloud replicas. So we don't have to do any sanitizing on our side, the join will sanitize for us, because the keys have been nulled or zeroed out. If everyone agrees this is not a new implementation of the privacy pipeline, but just a nice side-effect, then we can go with this solution for the time being and revisit once we have a different sanitize pipeline.

I am not really sure I get the idea.
Let me rephrase to see if I got it right:

  • You have actor and comment directly imported from production
  • You import revision archive and logging from cloud replicas already existing views which have sanitized data
  • You join actor and comment production rows with the rows from revision archive and logging which already have sanitized data as they come from the cloud views?

Questions:

  • Given that you are importing actor and comment directly from production and not from Cloud Replicas, you would be bypassing any sanitized data done on a view level to those two tables?:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#223
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#288

So you'd be joining sanitized data from revision archive and logging with non sanitized data for actor and comment?

Mentioned in SAL (#wikimedia-operations) [2018-12-05T10:10:09Z] <banyek> depooling labsdb1010 for testing materialized views - T210693

Change 477624 merged by Banyek:
[operations/puppet@production] wiki replicas: depool labsdb1010 for testing materialized view

https://gerrit.wikimedia.org/r/477624

For testing I created the view as comment_view_temp with the query @Bstorm wrote in T210693#4798638 and the table is being created with

create table comment_mat_view as select * from comment_view_temp;

it will take time :/

On which host and on which database?

Ah, I see, labsdb1010 but why is the view comment_view_temp on enwiki and not on enwiki_p where the views live? Is there any specific reason?

No, there is no any real reson, but as this is a test only I wanted to keep this 'clean' and enwiki feels more suitable for testing than enwiki_p.
Don't ask why, I can't answer it.
But at the end I can move the test table to enwiki_p is that is preferable

Thanks for the clarification - I just wanted to know if there was some specific reason for it that I might have missed.
Once the test is done make sure to either clean it up or move it to the views DB, as that is where the labsdbuser has access to and to keep it consistent. (views on one db and underlying core tables on the other)

Banyek added a comment.Wed, Dec 5, 1:48 PM

Thanks for the clarification - I just wanted to know if there was some specific reason for it that I might have missed.
Once the test is done make sure to either clean it up or move it to the views DB, as that is where the labsdbuser has access to and to keep it consistent. (views on one db and underlying core tables on the other)

aye

Bstorm added a comment.Wed, Dec 5, 3:45 PM

Basically, we are importing actor and comment from production replicas, and these include comment text that might have been redacted via a rev_deleted, log_deleted, etc. flag. But we import revision, archive, and logging from the cloud replicas.

Strictly speaking, if you aren't using the whole comment or actor table, the definition of whether it is sanitized or not is always dependent on whether the row in the external table is marked somehow deleted in the original table. So I imagine that if you only join to the comment or actor table without making those tables available in some way in the same "channel", it's functionally sanitized, yes (with the current definitions of the tables--should that change this may not be true). That means that if you scoop out the views for just revision, archive and logging, I believe that you could join that off prod versions of comment and actor and get sanitized data if you don't miss something that is applied at the sanitarium level for those tables (Is anything? @Marostegui ).

Bstorm added a comment.Wed, Dec 5, 3:46 PM

That's relying on a very particular special case, but I think it is actually true. 😁

We don't sanitize anything on actor or comment on a triggers level on sanitarium if that is what you ask.

*************************** 1. row ***************************
             Trigger: abuse_filter_log_insert
               Event: INSERT
               Table: abuse_filter_log
           Statement: SET NEW.afl_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 2. row ***************************
             Trigger: abuse_filter_log_update
               Event: UPDATE
               Table: abuse_filter_log
           Statement: SET NEW.afl_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 3. row ***************************
             Trigger: archive_insert
               Event: INSERT
               Table: archive
           Statement: SET NEW.ar_comment = '', NEW.ar_comment_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 4. row ***************************
             Trigger: archive_update
               Event: UPDATE
               Table: archive
           Statement: SET NEW.ar_comment = '', NEW.ar_comment_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 5. row ***************************
             Trigger: ep_courses_insert
               Event: INSERT
               Table: ep_courses
           Statement: SET NEW.course_token = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 6. row ***************************
             Trigger: ep_courses_update
               Event: UPDATE
               Table: ep_courses
           Statement: SET NEW.course_token = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 7. row ***************************
             Trigger: recentchanges_insert
               Event: INSERT
               Table: recentchanges
           Statement: SET NEW.rc_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 8. row ***************************
             Trigger: recentchanges_update
               Event: UPDATE
               Table: recentchanges
           Statement: SET NEW.rc_ip = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 9. row ***************************
             Trigger: revision_insert
               Event: INSERT
               Table: revision
           Statement: SET NEW.rev_text_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 10. row ***************************
             Trigger: revision_update
               Event: UPDATE
               Table: revision
           Statement: SET NEW.rev_text_id = 0
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 11. row ***************************
             Trigger: user_insert
               Event: INSERT
               Table: user
           Statement: SET NEW.user_password = '', NEW.user_newpassword = '', NEW.user_email = '', NEW.user_options = '', NEW.user_token = '', NEW.user_email_authenticated = '', NEW.user_email_token = '', NEW.user_email_token_expires = '', NEW.user_newpass_time = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
*************************** 12. row ***************************
             Trigger: user_update
               Event: UPDATE
               Table: user
           Statement: SET NEW.user_password = '', NEW.user_newpassword = '', NEW.user_email = '', NEW.user_options = '', NEW.user_token = '', NEW.user_email_authenticated = '', NEW.user_email_token = '', NEW.user_email_token_expires = '', NEW.user_newpass_time = ''
              Timing: BEFORE
             Created: NULL
            sql_mode: IGNORE_BAD_TABLE_OPTIONS
             Definer: root@localhost
character_set_client: utf8
collation_connection: utf8_general_ci
  Database Collation: binary
This comment was removed by Milimetric.

Following up on a good problem that @Bstorm raised with my approach. I would love @Anomie to take a look at my comment as well and see if it makes sense. Ok, so to confirm @Marostegui's understanding, yes, we are importing archive, logging, and revision from the views in the cloud replicas, and actor, comment unsanitized from the production replicas. And the only way we use rows from actor and comment is to join them to rows from the three views, which are sanitized. And I believe what @Bstorm was pointing out was that if rev_deleted is set to sanitize rev_actor, that would be fine for the revision table. But if the archive table references the same actor_id through ar_actor, but the ar_deleted flag is not set, then that would not be sanitized. According to the actor view definition: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#222, a particular actor_id is not present if it's sanitized by *any* log/rev/ar/etc_deleted flag.

But when building archive_compat, we join to the actor table, right? And only sanitize if ar_deleted & 4. That would mean:

  • if ar_actor and rev_actor point to the same actor_id, say 123
  • and the corresponding ar_deleted is 0 but rev_deleted is 4, then
    • in archive_compat we get the actor_name as ar_user_text
    • in revision or revision_compat that actor is sanitized

But as I see it, that's ok, because no id would show up in the revision table that would allow us to find that actor. Not the rev_user nor the rev_actor. I'm guessing this is also a case where the views for actor and comment are forced to do too much. If a particular comment or actor name should disappear from everywhere on the wiki, someone would have to set all the corresponding deleted flags to make it so. But the views have to sanitize using an OR instead of an AND just to be safe.

In any case, I think in our use case we're safe, as we're still sanitizing according to the deleted flags, which is what the _compat views are doing. @Bstorm, @Anomie, what do you think?

Anomie added a comment.Wed, Dec 5, 4:57 PM

According to the actor view definition: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#222, a particular actor_id is not present if it's sanitized by *any* log/rev/ar/etc_deleted flag.

You seem to have the logic backwards. The actor view will expose the row if any log/rev/ar/etc_deleted flag doesn't "sanitize" it, or if it's referenced from the user or image tables which don't even have an xx_deleted flag.

According to the actor view definition: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#222, a particular actor_id is not present if it's sanitized by *any* log/rev/ar/etc_deleted flag.

You seem to have the logic backwards. The actor view will expose the row if any log/rev/ar/etc_deleted flag doesn't "sanitize" it, or if it's referenced from the user or image tables which don't even have an xx_deleted flag.

I think you're just saying the contrapositive of what I said. I said "not present if sanitized" and you said "present if not sanitized". Is there some subtlety I'm missing?

Anomie added a comment.EditedWed, Dec 5, 5:14 PM

You said "any reference is sanitized → not present". The contrapositive of that would be "present → not (any reference is sanitized)" which is equivalent to "present → all references are unsanitized". Note that production allows for there to be no references at all.

What I said was "there exists at least one reference AND any reference is unsanitized → present". The contrapositive of that would be "not present → not (there exists at least one reference AND any reference is unsanitized)" which is equivalent to "not present → there are no references OR all references are sanitized".

EDIT: Fixed some typos that completely changed the meaning.

Bstorm added a comment.Wed, Dec 5, 5:14 PM

@Anomie I think you actually answered the question in that. As it currently stands, the logic there already effectively shows the row if it shows up in any of those other tables, not the other way around. I was confused at first there.

@Milimetric would like to try using the sanitized views of the revision, archive and logging tables to determine if he can expose data from the actor or comment tables via join (dodging the use of the really slow view on the replicas for that join). Since the condition is an OR, and the check is whether it is NOT deleted in an external table, I think you are effectively saying that he could do that "safely".

Right?

Bstorm added a comment.Wed, Dec 5, 5:16 PM

To be clear, I'm deliberately trying to find problems with this because it sounds correct to me. I want to be sure, though.

Anomie added a comment.EditedWed, Dec 5, 5:23 PM

@Milimetric would like to try using the sanitized views of the revision, archive and logging tables to determine if he can expose data from the actor or comment tables via join (dodging the use of the really slow view on the replicas for that join). Since the condition is an OR, and the check is whether it is NOT deleted in an external table, I think you are effectively saying that he could do that "safely".

Right?

I'm not sure what exactly you're asking, so I'll give several answers.

You can safely make a view of actor that includes only the "OR EXISTS( ... )" clauses for revision, archive, and/or logging. It'll be a subset of the current actor view.

You can safely make a view of actor that checks against the sanitized revision, archive, logging, etc views (all of them or any subset). In that case you don't even need to check the xx_deleted fields, because those sanitized views already have xx_actor zeroed out.

You can safely make revision_compat, archive_compat, and logging_compat views that join with the unsanitized actor table and consider only the current row's rev_deleted/ar_deleted/log_deleted. That's exactly what the existing revision_compat, archive_compat, and logging_compat views already do, I believe.

Thanks very much @Anomie, I understand my misunderstanding, and your third answer is what I was asking.

Banyek moved this task from Backlog to In progress on the User-Banyek board.Wed, Dec 5, 7:39 PM
Banyek added a comment.Wed, Dec 5, 7:50 PM

The 'materialized view' for comments is completed. I moved it into enwiki_p with name comment_mat.
It took almost six hours, and the file is 31G

MariaDB [enwiki_p]> create table comment_mat as select * from comment_view_temp;
Query OK, 248688864 rows affected (5 hours 59 min 50.11 sec)
Records: 248688864  Duplicates: 0  Warnings: 0
MariaDB [enwiki_p]> exit
root@labsdb1010:/srv/sqldata/enwiki_p# ls -lh comment_mat.ibd
-rw-rw---- 1 mysql mysql 31G Dec  5 17:14 comment_mat.ibd

I do the same now with the actors

Banyek added a comment.Wed, Dec 5, 8:01 PM

The actor view was empty, and the empty actor_mat too.

Anomie added a comment.EditedWed, Dec 5, 8:06 PM

If you want to fake up an actor_mat for a size estimate, I think something like this would do it. Obviously it won't work for a speed estimate though.

INSERT INTO actor_mat (actor_user, actor_name) SELECT user_id, user_name FROM user;
 -- All the rest have "xx_user = 0" because any that aren't 0 should have been done above
INSERT INTO actor_mat (actor_user, actor_name) 
 SELECT DISTINCT rev_user, rev_user_text FROM revision WHERE rev_user = 0
 UNION SELECT DISTINCT ar_user, ar_user_text FROM archive WHERE ar_user = 0
 UNION SELECT DISTINCT ipb_by, ipb_by_text FROM ipblocks WHERE ipb_by = 0
 UNION SELECT DISTINCT img_user, img_user_text FROM image WHERE img_user = 0
 UNION SELECT DISTINCT oi_user, oi_user_text FROM oldimage WHERE oi_user = 0
 UNION SELECT DISTINCT fa_user, fa_user_text FROM filearchive WHERE fa_user = 0
 UNION SELECT DISTINCT log_user, log_user_text FROM logging WHERE log_user = 0;

Mentioned in SAL (#wikimedia-operations) [2018-12-05T20:08:07Z] <banyek> repooling labsdb1010 - T210693

Banyek added a comment.Wed, Dec 5, 8:09 PM

If you want to fake up an actor_mat for a size estimate, I think something like this would do it. Obviously it won't work for a speed estimate though.

INSERT INTO actor_mat (actor_user, actor_name) SELECT user_id, user_name FROM user;
 -- All the rest have "xx_user = 0" because any that aren't 0 should have been done above
INSERT INTO actor_mat (actor_user, actor_name) 
 SELECT DISTINCT rev_user, rev_user_text FROM revision WHERE rev_user = 0
 UNION SELECT DISTINCT ar_user, ar_user_text FROM archive WHERE ar_user = 0
 UNION SELECT DISTINCT ipb_by, ipb_by_text FROM ipblocks WHERE ipb_by = 0
 UNION SELECT DISTINCT img_user, img_user_text FROM image WHERE img_user = 0
 UNION SELECT DISTINCT oi_user, oi_user_text FROM oldimage WHERE oi_user = 0
 UNION SELECT DISTINCT fa_user, fa_user_text FROM filearchive WHERE fa_user = 0
 UNION SELECT DISTINCT log_user, log_user_text FROM logging WHERE log_user = 0;

Thank you! I'll look at it

Banyek added a comment.Wed, Dec 5, 8:09 PM

also we need to create indices for comment_mat

I am not sure it makes sense to continue that approach.
We have more than 900 wikis, even if it takes 5GB average per wiki to for that materialized table...we don't have enough space for creating all those tables.

Defining all the triggers might be a pain, but https://dev.mysql.com/doc/refman/5.5/en/replication-features-differing-tables.html makes it sound like we could add an extra column for a "show" flag to the comment and actor tables on the sanitarium without breaking replication. That flag would default to false, and then triggers (like in T209031#4740258) would update it based on changes to the referring tables. Then the views could show rows based on the flag column rather than having to do all those subqueries.

Or we could do more or less the same with added comment_show and actor_show tables, which would each have two columns (the actor_id/comment_id and the "show" flag). One join in the view would still be better than a bunch of subqueries.

On the issue of storage, the average per wiki would definitely not come out to 5GB. If you plot all wikis by number of revisions, the curve has a very long tail, with big wikis like enwiki, commons, wikidata being more outliers than the norm. I just did a basic count(*) in hadoop and enwiki has 801325542 out of a total of 3704431563 revisions across all wikis across all time. If the comment and actor table scale roughly in line with this*, and enwiki has a 31GB materialized comment view as @Banyek mentioned, then my estimate for the overall size of the materialized comment view is around 150GB. Actor should be smaller because it benefits from sharing. I'm not sure if that's still too big, but just wanted to say I have good reason to think it's not as bad as the 4500GB estimate.

*comments should be distributed in the same way across wikis, because they mostly come out of revision and logging which are similar enough in size

Banyek added a comment.Thu, Dec 6, 4:57 PM

150 Gb seems acceptable to me, but

  • we still have some time factor to think about, because those tables have to be created.
  • if the analytics will have a separate instance that shouldn't be a problem, but until/if the labsdb hosts will be used, we need to depool one of the hosts while the mat.views getting generated
  • we also need to calculate with sizes (and creation time) of indexes, because the new tables will not have those
  • even if we can afford depooling one of the labsdb hosts for building the new indexes, we still have to solve it to do automatically, because now it involves puppet deploy
Banyek added a comment.Thu, Dec 6, 4:57 PM

@Bstorm if you prepare a depool patch for me for tomorrow I can start create the mat. view on the other schemas along enwiki, and we can evaluate time and disk space

Thanks very much @Anomie, I understand my misunderstanding, and your third answer is what I was asking.

Whilst we still discuss if this JOIN "feature" can do what we think I would like also to bring to the table the fact that we should think about a check to make sure the JOIN (or whatever is underlying it) isn't or won't expose private data. Same way we have a check_private_data on our sanitarium and labs hosts, to make sure we are not exposing anything considered private.
How could we do that?
Maybe some sort of checking that runs a few days before the actual import/export runs, so if something is detected, the process won't start. That would make me feel sleep better.

Whilst we still discuss if this JOIN "feature" can do what we think

I believe we settled this part, the join I proposed is safe, in that it exposes no more than the cloud views.

I would like also to bring to the table the fact that we should think about a check to make sure the JOIN (or whatever is underlying it) isn't or won't expose private data. Same way we have a check_private_data on our sanitarium and labs hosts, to make sure we are not exposing anything considered private.
How could we do that?

The check_private_data script checks three things:

  • whether private databases exist
  • whether private tables exist
  • whether columns that are filtered out by filtered_tables.txt have data in them

To do this properly, we would write a script that pulls in these definitions and lists of what's private and what's not, and build Hive queries that check the resulting data. We would want to do this as late as possible in the publishing process, because once the dataset is published, it's out there. In Oozie, we could apply this check as a step after the JOINs. And if there's a problem, we can throw an error that will stop the job and alert us.

But I think this effort would be focusing our energies in the wrong place. So, right now, our data flow looks like this if you're thinking about private data:

  1. private data leaks to cloud dbs
  2. private data is copied to Hadoop
    • a. datasets are crunched and published from Hadoop (the JOINs happen here)
  3. private data is found
    • a. found on cloud dbs by the check_private_data script, and fixed
    • b. found on the wikis by editors or WMF staff, and fixed, which is reflected by replication to cloud dbs
  4. on the next snapshot, all fixes are propagated to Hadoop

So what you're proposing is that we copy step 3a and run it in Hadoop as part of 2a. We could do it as I mentioned above, but we would still leak the vast majority of private data, because of the snapshot-in-time nature of what we're doing here. So any problems found from the 2nd of the month to the last day of the month, or 93% of the time, would not be known to us when we build that month's snapshot. So we're just like dumps, we leak most of the private data that's fixed during the month. The main thing that affects what private data we leak is what point in time we cut the snapshot, weighed against how quickly or slowly private data is found and fixed via either 3a or 3b.

So, if we're serious about making a dataset that has the least amount of private data in it within the timeframe that we want to publish it, the way to go about it would be to graph incoming changes to log_deleted, rev_deleted, etc. flags on rows from last month, over time. If there's a big spike of them at the beginning of this month, say on the 3rd of the month, and we delay our snapshot until the 4th, then that would be much more efficient than duplicating the logic of check_private_data and running it when it doesn't have much hope of finding any issues.

What I've said so far kind of sidesteps that check_private_data is more of a check to make sure triggers and MariaDB work the way they're supposed to work. This is not really an issue in Hadoop, we have never seen Hive or sqoop fail in the sense that they bring in the wrong data in the wrong column or something like that. Our queries could be dynamically written to take into account filtered_tables and private_tables and private_databases, and that would be just as good as duplicating check_private_data. But our queries are also very simple and easily checked against the views for even better assurances of privacy.

To me, the best place to put our effort is to recognize that sometimes we have serious privacy leaks that stay around in dumps or analytics datasets for at least a month if not forever. And to start to design and build the next sanitize pipeline that would allow us to propagate fixes in real-time or more regularly to these datasets.

Thanks for the detailed analysis.
My proposal was more in the line to check that the JOIN is always safe.
On the sanitarium+labs side we do have 4 kind of "security" (note the " "!) layers

  • Replication filters
  • Triggers
  • GRANTs
  • Views

And yet, we still have check_private_data script just to make sure there is something that would alert us if there is some data, column, database that made it (somehow) thru all this layers. Some sort of safety net.
While the JOIN might be safe, I still think we should have some sort of alert mechanism to let us know when the JOIN isn't safe or has changed or might be querying the wrong things.
How to do that? I don't know - but I think we need to spend a bit of time thinking about how could we do that.

To do this properly, we would write a script that pulls in these definitions and lists of what's private and what's not, and build Hive queries that check the resulting data. We would want to do this as late as possible in the publishing process, because once the dataset is published, it's out there. In Oozie, we could apply this check as a step after the JOINs. And if there's a problem, we can throw an error that will stop the job and alert us.

How hard do you think that would take to build?
I am really afraid of just leaving all safety to a JOIN based on views and not having another check which lives outside of the clouds replicas one :-)

Thank you - really appreciate your clear answer!

To duplicate something like check_private_data in Hadoop, I'd guess a day to write it and a couple of days to review and test it. So probably like a week to get everything deployed and integrated with the current job. We have to change some other jobs too to make them depend on this check.

Unlike the cloud replicas, we don't replicate and publish everything, we only pull a specific set of fields from a specific set of tables. So I don't see how something would sneak by, but if it makes you feel better, I'm happy to spend the week and get it done :)

I should still say that I think we would get much more out of spending a couple of days to understand the data better and then adding a small delay. I'm willing to bet that would sanitize many more private records than the check_private_data approach.

To duplicate something like check_private_data in Hadoop, I'd guess a day to write it and a couple of days to review and test it. So probably like a week to get everything deployed and integrated with the current job. We have to change some other jobs too to make them depend on this check.

Unlike the cloud replicas, we don't replicate and publish everything, we only pull a specific set of fields from a specific set of tables. So I don't see how something would sneak by, but if it makes you feel better, I'm happy to spend the week and get it done :)

Hehe thanks - it is not really my call - I just want to bring awareness that whilst we have checks on the cloud replicas - we won't have any on the other side of the equation.
How we do that, is obviously up for discussion!

I should still say that I think we would get much more out of spending a couple of days to understand the data better and then adding a small delay. I'm willing to bet that would sanitize many more private records than the check_private_data approach.

Sure! You know the Hadoop environment better than me! :)
Again, my point is that we are relatively safe from the cloud replicas side with the checks we have, and I want to feel we are also covered from the other side.
This is not really my call - I am not the one who has to say we have to spend time on that, but I want to make sure this is discussed among everyone and whether it signed off to go "live" without any check or not.

Change 478932 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] labsdb: depool labsdb1010

https://gerrit.wikimedia.org/r/478932

Mentioned in SAL (#wikimedia-operations) [2018-12-11T14:49:17Z] <banyek> depooling labsdb1010 - T210693

Change 478932 merged by Banyek:
[operations/puppet@production] labsdb: depool labsdb1010

https://gerrit.wikimedia.org/r/478932

on labsdb1010 I stared to create materialized views from all the comment views, except enwiki - as we know how much time and space it demands. The process is running on the host in a screen named create_mat also it creates a create_mat.loglogfile about the time it takes

The materialized view generation completed.
The total size of the materialized views are ~150 G all together.

root@labsdb1010:~# find /srv -iname comment_mat.ibd -ls | awk '{size_in_g +=$7} END {print "Total size: " size_in_g/1024/1024/1024}'
Total size: 149.266

The total time view generation take is ~18 hours (!This run excluded enwiki_p.comments_mat as it was created earlier and took ~5hrs!)

root@labsdb1010:~# cat create_mat.log | egrep  "Starting|Completed"
Tue Dec 11 15:07:11 UTC 2018 - Starting
Wed Dec 12 04:18:08 UTC 2018 - Completed

nb. there were no indexes added for the newly created tables, and that also would take disk space and time.

Mentioned in SAL (#wikimedia-operations) [2018-12-12T09:40:08Z] <banyek> repooling labsdb1010 - T210693

We didn't talked about this so far, but these views doesn't ask for having proper indexes?

@Banyek are you asking whether the views should have indices? I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide. For the Analytics use case, which shouldn't matter, we just get everything without joining, so an index wouldn't help much.

I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

Yeah, then at least in this incarnation these views are not a good solution. Honestly this is a really unique problem. The use cases for this cluster are both OLTP and OLAP. The OLTP ones suffer because this replica is not tuned the same as the production one. And the OLAP ones suffer because the schema is made with OLTP in mind. And the scale of the data makes a single solution very unlikely. If we could do two things, we could have:

  • a near-real-time denormalized-on-replication schema to cover the OLAP use cases
  • a real-time replica that serves a "recent changes" version of all tables

Talking to consumers to even get the requirements for that kind of split seems impossible, but solving this problem with a single approach is also beginning to seem impossible. Maybe the "talk to people" route is long but at least it's relatively easy.

I'm not sure for the cloud users use case, you'd have to look at common or slow queries against the _compat views to decide.

I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.

Agreed - I don't think this solution is for every user case and not sure it is worth all the orchestration work (there is a lot) that will need to happen behind the scenes