⚓ T209031 Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}

	Subject	Repo	Branch	Lines +/-
	Sqoop actor and comment from production monthly	operations/puppet	production	+31 -0

Status	Assigned	Task
Resolved	None	T209178 Refactor Mediawiki-Database ingestion
Resolved	Milimetric	T209031 Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}
Resolved	• Banyek	T210693 Create materialized views on Wiki Replica hosts for better query performance

In T209031#4732318, @Nuria wrote:

@Krenair: we are looking how to best import the public dataset from labs, we have already looked into scooping data from the non public data hosts and the sanitization is a lot harder than you might (by no means as simple as "runing your own views") so we need to come up with a strategy for us to scoop data from labs efficiently.

Can you point to the ticket where it was decided to use labs for this purpose?

• Nuria renamed this task from Add index to `comment_id` field in `comment` table (all wikis) to Not able to scoop coment table in labs for mediawiki reconstruction process. .Nov 9 2018, 5:28 PM

• Nuria updated the task description. (Show Details)

• Nuria updated the task description. (Show Details)Nov 9 2018, 5:43 PM

• Nuria added a subscriber: mforns.Nov 9 2018, 5:46 PM

Reedy renamed this task from Not able to scoop coment table in labs for mediawiki reconstruction process. to Not able to scoop comment table in labs for mediawiki reconstruction process.Nov 9 2018, 7:31 PM

@Krenair: there were a handful of discussions, but the gist of the reasoning is that we want to pull data *after* it's sanitized. If we pull it before, we have to duplicate the sanitizing process, or analyze the results to make sure we're not leaking anything. So it's just easier to pull from cloud-dbs for that reason. Let me know if you have further questions and I can try to dig up the discussions.

@Marostegui and @jcrespo, I want to touch base with one or both of you on this, it turns out to be a tricky problem from an analytics point of view.

Once we understood this change and the comment view [1] defined in the cloud db, we tried to sqoop from it, joining with logging. The query would boil down to something like:

select log_id, ... comment_text as log_comment, ...
  from logging
           left join
       comment        on log_comment_id = comment_id
 where log_timestamp between x and y
   and ...

Basically re-creating the old logging table. We would need to do this for revision as well. If I understand the view correctly, though, this kind of join is basically hopeless (because of how many tables it has to search through to even build a list of comment ids to join to). Indeed, we ran a trial sqoop and it's still not finished over a day later, for just the logging table. We figured this would be the case so for this month to get our datasets done, we also sqooped in parallel from dbstore1002, applying the same sanitizing rule (log_comment&2 = 0). But long term, we'd like to sqoop from cloud-dbs, to keep sanitizing in one place as we all agreed and to avoid depending on dbstore1002. So, that's where we'd love your opinion. Some possible solutions we thought of:

sqoop all of comment and join once it gets into Hadoop. This seems close to optimal, because revision and logging probably make up the majority of records in that view, and the sanitizing logic would have to apply at some point no matter how we pulled data out.
if the assumption above is incorrect, we could make another comment view, for just revision and logging, or separate views for each. This would allow us to sqoop exactly what we need.

Is there any other solution you see, or problems you see with the above proposals? Maybe we can chat next week if you're available. If you have other urgent work, we are ok, we have backup plans and ways to get the data. But we would like to work on the long-term approach in parallel.

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml#300

Milimetric mentioned this in T209178: Refactor Mediawiki-Database ingestion.Nov 9 2018, 8:13 PM

pinging @JAllemandou and @elukey Let's please follow up with dbas early next week on this ticket.

This is not something we handle- we don't decide on the table structure (this refactoring, comment storage, was owned by Platform team), while the actual view structure changes is handled by Cloud. I don't even know which structure there is on wikireplicas- we only handle the production filtering. When this was setup (views) we weren't very happy about it, and we did only "accept" it with the condition that cloud would handle on its own wikireplicas filtering. They have been working together on T181650- you should comment there your needs and one of the 2 teams may serve you better :-)

Pinging @bd808 and @Fjalapeno and @tstarling per above comment.

@Nuria, I'm catching up on the task that Jaime recommended and will comment there.

Milimetric mentioned this in T181650: Change views for the new columns of the refactored comment storage.Nov 12 2018, 3:51 PM

I have taken the liberty to remove "Cloud Services" as a subscriber to this ticket as I do not think every toollabs user wants to receive notifications about this.

Pintoch unsubscribed.Nov 12 2018, 3:52 PM

Why am I getting emails to this task?

Cyberpower678 unsubscribed.Nov 12 2018, 3:54 PM

Nuria apparently subscribed 120 cloud users to this task by mistake- please be careful when using Phabricator to not annoy (with spam) our valuable contributors just to get the desired attention.

• fdans assigned this task to JAllemandou.Nov 12 2018, 5:26 PM

• fdans triaged this task as High priority.

• fdans added a project: Analytics-Kanban.

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Note the actor view will likely turn out to have similar issues.

As suggested in T209031#4736006, one solution would be to create specialized views like comment_revision, comment_logging, comment_image, and so on that only expose comment rows for one comment-using table and so only need one EXISTS clause.

Or the cloud replicas could materialize the filtered comment view. Something like this seems like it should do it.

CREATE TABLE materialized_comment LIKE comment;

delimiter //
CREATE FUNCTION materializeCommentShow(id BIGINT) RETURNS INT NOT DETERMINISTIC READS SQL DATA
BEGIN
 RETURN COALESCE(
  (SELECT 1 FROM image WHERE img_description_id = id LIMIT 1) OR
  (SELECT 1 FROM oldimage WHERE oi_description_id = id AND (oi_deleted & 2) = 0 LIMIT 1) OR
  -- ... and so on ...
 , 0);
END //
delimiter ;

delimiter //
CREATE PROCEDURE materializeCommentRowInsert(id BIGINT, del INT) NOT DETERMINISTIC MODIFIES SQL DATA
BEGIN
 IF (del & 2) = 0 THEN 
  INSERT IGNORE INTO materialized_comment SELECT * FROM comment WHERE comment_id = id;
 END IF;
END //
delimiter ;

delimiter //
CREATE PROCEDURE materializeCommentRowDelete(id BIGINT, del INT) NOT DETERMINISTIC MODIFIES SQL DATA
BEGIN
 IF (del & 2) = 0 AND NOT materializeCommentShow(id) THEN
  DELETE FROM materialized_comment WHERE comment_id = id;
 END IF;
END //
delimiter ;

delimiter //
CREATE PROCEDURE materializeCommentRowUpdate(oldId BIGINT, oldDel INT, newId BIGINT, newDel INT) NOT DETERMINISTIC MODIFIES SQL DATA
BEGIN
 IF oldId != newId OR (oldDel & 2) != (newDel & 2) THEN
  CALL materializeCommentRowDelete(oldId, oldDel);
  CALL materializeCommentRowInsert(newId, newDel);
 END IF;
END //
delimiter ;

CREATE TRIGGER image_materializeCommentRowInsert AFTER INSERT ON image FOR EACH ROW CALL materializeCommentRowInsert( NEW.img_description_id, 0 );
CREATE TRIGGER image_materializeCommentRowUpdate AFTER UPDATE ON image FOR EACH ROW CALL materializeCommentRowUpdate( OLD.img_description_id, 0, NEW.img_description_id, 0 );
CREATE TRIGGER image_materializeCommentRowDelete AFTER DELETE ON image FOR EACH ROW CALL materializeCommentRowDelete( OLD.img_description_id, 0 );
CREATE TRIGGER oldimage_materializeCommentRowInsert AFTER INSERT ON oldimage FOR EACH ROW CALL materializeCommentRowInsert( NEW.oi_description_id, NEW.oi_deleted );
CREATE TRIGGER oldimage_materializeCommentRowUpdate AFTER UPDATE ON oldimage FOR EACH ROW CALL materializeCommentRowUpdate( OLD.oi_description_id, OLD.oi_deleted, NEW.oi_description_id, NEW.oi_deleted );
CREATE TRIGGER oldimage_materializeCommentRowDelete AFTER DELETE ON oldimage FOR EACH ROW CALL materializeCommentRowDelete( OLD.oi_description_id, OLD.oi_deleted );
-- ... and so on ...

INSERT IGNORE INTO materialized_comment SELECT * FROM comment WHERE materializeCommentShow(comment_id);

Thanks @Anomie. We (analytics team) also had thought of a third potential solution. I list the 3 solutions below by increasing level of complexity (IMO),

Specialized views - Views for comments from each of revision, archive, and logging, separately. We have to test whether or not sqooping from these views would be fast enough, but it seems they would be useful for cloud db users in general.
Access to underlying tables - We could query the underlying tables, and that would bypass any performance problems we have with the views. We would duplicate the sanitizing logic from the views, and maintain it to be always the same as it is in cloud db. This would require special permissions to the cloud db.
Materialized views - This sounds like the best choice, as suggested by @Anomie. We thought they were discouraged by DBAs due to the implied slow-downs in replication. But if that's not a concern, let's do it!

Being no MySQL expert, could @Marostegui, @jcrespo, @daniel give us their views on the ideas above?
We are not expecting the DBAs to implement the chosen solution, but their expertise in picking the best one will be greatly appreciated.
Obviously, any other comment welcome !
Many thanks

JAllemandou added a parent task: T209178: Refactor Mediawiki-Database ingestion.Nov 13 2018, 8:27 AM

ping @Bstorm for the above comment, as she's the one having setup views.
Thanks

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 13 2018, 8:32 AM

In T209031#4732318, @Nuria wrote:

@Krenair: we are looking how to best import the public dataset from labs, we have already looked into scooping data from the non public data hosts and the sanitization is a lot harder than you might (by no means as simple as "runing your own views") so we need to come up with a strategy for us to scoop data from labs efficiently. Let's keep this ticket about the ways we can make the scooping task easier.

Can you explain what kind of sanitization is needed, exactly? It seems to me that the only thing that you'd need to do is to mask the comment text (and data) and the user name based on the rev_deleted flags. Assuming your process is based on the revision table. If this is not the case, you'd also have to filter out anything that is not referenced from the revision table.

What are your requirement wrt latency? Is it ok if it takes a day for the data to make it into your system?

In any case, I'm wondering: doen't labs also offer a sanitized view of the new schema? That should be a lot more efficient than the compat views.

In T209031#4740771, @JAllemandou wrote:

Access to underlying tables - We could query the underlying tables, and that would bypass any performance problems we have with the views. We would duplicate the sanitizing logic from the views, and maintain it to be always the same as it is in cloud db. This would require special permissions to the cloud db.

At that point you wouldn't be using the provided views so there'd be no point connecting out to the cloud DB replicas. And I don't think you'd get special permissions.

In T209031#4741574, @daniel wrote:

In any case, I'm wondering: doen't labs also offer a sanitized view of the new schema? That should be a lot more efficient than the compat views.

That's what they're talking about, the sanitised views in labs.

In T209031#4741603, @Krenair wrote:

In T209031#4740771, @JAllemandou wrote:

Access to underlying tables - We could query the underlying tables, and that would bypass any performance problems we have with the views. We would duplicate the sanitizing logic from the views, and maintain it to be always the same as it is in cloud db. This would require special permissions to the cloud db.

At that point you wouldn't be using the provided views so there'd be no point connecting out to the cloud DB replicas. And I don't think you'd get special permissions.

My understanding is that the CloudDB-replicas have a sanitized version of the data, sanitization being done through replication. Being able to connect directly to the underlying datasource prevents the views performance cost to impact us while still accessing already sanitized data (we would still reimplement the equivalent of the view sanitization in the cluster if we choose this solution).

In T209031#4741603, @Krenair wrote:

In T209031#4741574, @daniel wrote:

In any case, I'm wondering: doen't labs also offer a sanitized view of the new schema? That should be a lot more efficient than the compat views.

That's what they're talking about, the sanitised views in labs.

As far as I can see, the discussion was about the sanitized compat view, which maps comment text and user names back into the revision table. I'm asking about using the sanitized view of the new comment and actor tables directly. That would have much less of a performance penalty. Depending on how sanatization works on these tables, this may or may not be the same thing as bypassing the views.

By the way, if you are also looking at things like content model or content size, you will also need to consider the new slots and content tables.

@daniel : We were using sanitized comment view (named comment), not the compat one (named revision_compat).
We tried two things:

export the comment view (select comment_id, comment_text from comment) - but this is very slow due to the way the comment view is built.
export a manually built equivalent of the logging_compat view - we were doing the join in the query, and it was also very slow as it took more than a day for the logging table.

In T209031#4742129, @JAllemandou wrote:

@daniel : We were using sanitized comment view (named comment), not the compat one (named revision_compat).

Ah, ok.

export a manually built equivalent of the logging_compat view - we were doing the join in the query, and it was also very slow as it took more than a day for the logging table.

In my experience, when dealing with the need to combine information from two very large tables, the best approach is: 1) paged access to one of the tables, based on a unique key. If that doesn't resolve the issue, try to 2) use a sub-query instead of a join. With small-ish patches/batches, sub-queries are often faster than joins.

Specialized views - Views for comments from each of revision, archive, and logging, separately. We have to test whether or not sqooping from these views would be fast enough, but it seems they would be useful for cloud db users in general.

Access to underlying tables - We could query the underlying tables, and that would bypass any performance problems we have with the views. We would duplicate the sanitizing logic from the views, and maintain it to be always the same as it is in cloud db. This would require special permissions to the cloud db.

Materialized views - This sounds like the best choice, as suggested by @Anomie. We thought they were discouraged by DBAs due to the implied slow-downs in replication. But if that's not a concern, let's do it!

I've been implementing specialized views wherever possible since that works with current tooling fairly easily (though it probably doesn't help load issues). It seems that if we did that approach, we would certainly need to for actors as well, making tool/cloud schemas somewhat odd compared to mediawiki schemas. As is, I think there is a documentation refresh needed around that to help developers navigate the quirks of the views.

Access to underlying tables dodges much of the reason for the replicas in some ways. That's outside my area :)

Materialized views in mariadb requires a plugin that basically is adding some kind of hooks. It's going to cause drag on replication and increased load (possibly too much). I love the idea in theory for so many reasons, but I must defer to the DBAs on that because I can easily see how that would just make nothing work.

In the absence of other people's input, I'd probably jump on option one. However, this, too, could have performance/load consequences that I'd be interested to hear input from @jcrespo on.

In T209031#4743469, @Bstorm wrote:

Materialized views in mariadb requires a plugin that basically is adding some kind of hooks.

I guess you're referring to https://mariadb.com/kb/en/library/flexviews/? I don't think the plugin is required, although using it might be easier than writing all the hooks manually, particularly for materializing views that are more complicated than the simple filtering we'd be doing here.

I recently came across mention (T209048#4731971) that there's already some sort of filtering going on in the setup for the replicas, so possibly something could be done at that level too.

Note: I'm not done reading back yet--but yeah, that's what I was thinking of. Didn't think anybody would want to do all the hooks by hand. I'm thinking of comparing to, say, Oracle :)

Thanks again a lot @Anomie, @daniel, @Bstorm for having chimed in, your questions and suggestions have helped us move forward.

Short term solution: We have tested exporting data from the logging_compat view and it's fast enough to get us by for now. A lot faster than getting the comment view on its own, or getting the joint logging and comment views. But we still can't export the revision table using the revision_compat view, it's too slow.
Long term solution : Having invested more time into understanding how the sanitization of data happens for labs, we think that it would be a better solution to use the analytics-replica to get the data from MariaDB, and then sanitize it in hadoop. This will be a long project, and I will reuse the parent task to coordinate. Ideally, we'll just use your maintain-views.yaml logic directly, so we stay in sync with any changes made on currently existing views sanitization.

I don't want to muddy the waters as I have not been involved here :). But
worth noting there is more than the views at play for sanitization. To
mimic the end state of labsdb you would need the equivalent of triggers and
such on sanitarium at least as well.

Thanks @chasemp for raising the point. We are aware of the 2 steps for sanitization.

For us, code for sanitarium is mostly available here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb . I said mostly cause there table filters stored in another place (as per @Marostegui
irc chat), that defines which tables are not replicated at all in labs (text for instance).
And code for views is here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml

We're currently reviewing thoroughly both, and will document and present the approach we think is best in the parent task.
Ok for you @chasemp ?

Makes sense, again sorry for the drive by comment. Let me know if I can be helpful :)

• TBolliger subscribed.Nov 14 2018, 11:46 PM

• TBolliger mentioned this in T209549: Add ipblocks_restrictions table to Data Lake.Nov 14 2018, 11:56 PM

• Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.Nov 15 2018, 6:46 AM

It seems filtered_tables.txt is not updated to work with the new columns, I asked about it here: rOPUP188a50fa82a005563b2400493eb2fe182ca878fd#6408393.

To keep everyone in the loop with our current plan. Short term, we've tested the compat views even more, and those seem to work, we're going ahead with that for now.

Long term, we're implementing sanitization on our end while using as much of the logic from the existing process as possible. This seems to me to help everyone:

Cloud dbs get rid of our heavy monthly jobs, since we'll be importing from our analytics replicas
We can run the import earlier and faster because we have no competition on our own replicas, so data will be available sooner
We don't duplicate logic, as explained in our plan below
We'll have more data, like change_tag and ipblocks_reason in hadoop, making some analysts happy

So the process will reuse the two main artifacts that define sanitizing, as follows:

Step 1: use filtered_tables.txt to generate sqoop selects, selecting all the K fields and nulling all the F fields
Step 2: use maintain-views.py [0] to generate queries that would do the actual sanitizing. These are the same queries that would run to create the views. The idea, that would need @Bstorm's approval, is:
- update yaml file to reduce as much as possible mariadb-specific syntax. For example, instead of if(log_deleted&2, ...) we could change it to if(log_deleted&2 != 0, ...)
- factor out the query building parts of [0]. For example by adding a new method create_customview_select to serve the existing create_customview [1]

Of course we'll carefully test any changes we make as we do this and try to minimize any time we ask from Brooke or Manuel. We figure we'll get everything tested and worked out by the end of this year and submit something in early January. Not a desperate rush, but we'd like to get to a more stable situation for everyone involved.

[0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/labs/db/views/maintain-views.py
[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/labs/db/views/maintain-views.py#259

In T209031#4751986, @Milimetric wrote:

Step 1: use filtered_tables.txt to generate sqoop selects, selecting all the K fields and nulling all the F fields

It seems like this would be easier solved by setting up a MariaDB server inside the Analytics realm and making it a replica of the existing sanitarium servers just like the Wiki Replica servers in Cloud Services (labsdb10{09,10,11}).

Step 2: use maintain-views.py [0] to generate queries that would do the actual sanitizing. These are the same queries that would run to create the views.

Or you use maintain-views unaltered to install the same view layer that is used on the Wiki Replica servers. This plus the replication from sanitarium would give you a functional clone of a Wiki Replica server reserved only for Analytics use rather than sharing the existing resources with the larger technical community.

This of course assumes that your main difficulty today is getting exclusive use of enough CPU and RAM to extract the data that your are interested in from the Wiki Replicas and into you Hadoop target. Did I miss something in your plan to recreate the sanitarium+views logic that would lead to a drastically different result?

In T209031#4752128, @bd808 wrote:

In T209031#4751986, @Milimetric wrote:

Step 1: use filtered_tables.txt to generate sqoop selects, selecting all the K fields and nulling all the F fields

It seems like this would be easier solved by setting up a MariaDB server inside the Analytics realm and making it a replica of the existing sanitarium servers just like the Wiki Replica servers in Cloud Services (labsdb10{09,10,11}).

No, I don't think so. We still would need our production replicas for analysis that requires data not available to cloud db. So this would double our infrastructure. This step doesn't cost us almost anything, it's very easy to generate the selects we need based on the filtered_tables.txt, actually we have to do it anyway as we expand our sqoop. So this is more convenient for us than I could explain here.

Step 2: use maintain-views.py [0] to generate queries that would do the actual sanitizing. These are the same queries that would run to create the views.

Or you use maintain-views unaltered to install the same view layer that is used on the Wiki Replica servers. This plus the replication from sanitarium would give you a functional clone of a Wiki Replica server reserved only for Analytics use rather than sharing the existing resources with the larger technical community.

This of course assumes that your main difficulty today is getting exclusive use of enough CPU and RAM to extract the data that your are interested in from the Wiki Replicas and into you Hadoop target. Did I miss something in your plan to recreate the sanitarium+views logic that would lead to a drastically different result?

Same as above, unsanitized data is useful in Hadoop. Once data is in Hadoop, we could create Hive Views, but I think tables stored as Parquet would be better optimized for the jobs that need them. It's not just exclusive access to a replica, it's also getting data both ways that we need separate from this issue discussed here. Sorry we're sort of solving both things at once, just trying to make good use of minimal infrastructure (our ops are already too busy).

We might apply the column filtering once data is in Hadoop, that might be even better. So sqoop everything, then apply the two sanitize steps.

I'm aiming to write tests for this script shortly because it is too complex to not have them. Overloading the scripts functionality with something it wasn't written for makes me a bit nervous. It is already very easy to introduce mistakes requiring very careful review and manual QA in test dbs when I make updates.

What you aim for is printed out queries that can be consumed by sqoop/hive/drill/etc., generated out of the yaml, which currently serves as the record of the sanitization on the replicas, correct?

This seems almost like it would be a better job for another script that uses similar logic but relies on the yaml's structure and content (neither of which has been static this year, but comments can be added). Changing the yaml content to suit this particular case doesn't sound awful as long as it doesn't impact the logic and performance of statements. Though an external script could also strip such things as Mariadb-isms out on the fly, that sounds like a mess.

In T209031#4752361, @Bstorm wrote:

I'm aiming to write tests for this script shortly because it is too complex to not have them. Overloading the scripts functionality with something it wasn't written for makes me a bit nervous. It is already very easy to introduce mistakes requiring very careful review and manual QA in test dbs when I make updates.

got it, I'll stay away from the script then. We have a bunch of experience with sqlalchemy and doing local unit tests with sqlite, happy to help code review if you need.

What you aim for is printed out queries that can be consumed by sqoop/hive/drill/etc., generated out of the yaml, which currently serves as the record of the sanitization on the replicas, correct?

yes, exactly

This seems almost like it would be a better job for another script that uses similar logic but relies on the yaml's structure and content (neither of which has been static this year, but comments can be added).

Totally ok with me. Would you like the new script to be in the same place as maintain-views.py or in our refinery repository (I have no preference)?

Changing the yaml content to suit this particular case doesn't sound awful as long as it doesn't impact the logic and performance of statements. Though an external script could also strip such things as Mariadb-isms out on the fly, that sounds like a mess.

Ok, after I write the yaml-to-sql-query logic, I'll find all the incompatible syntax. If any of it would change the yaml significantly, I'll find another way. But if it's all like the if statement example I gave above, I'll submit a gerrit change and you can decide then.

Great! Thank you so much. I'll keep going with this Monday, do let me know if you have any other concerns.

Milimetric mentioned this in T209819: Expose new ipblocks_restrictions table to Wiki Replica users.Nov 19 2018, 9:48 PM

In T209031#4740258, @Anomie wrote:

Note the actor view will likely turn out to have similar issues.

As suggested in T209031#4736006, one solution would be to create specialized views like comment_revision, comment_logging, comment_image, and so on that only expose comment rows for one comment-using table and so only need one EXISTS clause.

Or the cloud replicas could materialize the filtered comment view. Something like this seems like it should do it.

CREATE TABLE materialized_comment LIKE comment;

delimiter //
CREATE FUNCTION materializeCommentShow(id BIGINT) RETURNS INT NOT DETERMINISTIC READS SQL DATA
BEGIN
 RETURN COALESCE(
  (SELECT 1 FROM image WHERE img_description_id = id LIMIT 1) OR
  (SELECT 1 FROM oldimage WHERE oi_description_id = id AND (oi_deleted & 2) = 0 LIMIT 1) OR
  -- ... and so on ...
 , 0);
END //
delimiter ;

delimiter //
CREATE PROCEDURE materializeCommentRowInsert(id BIGINT, del INT) NOT DETERMINISTIC MODIFIES SQL DATA
BEGIN
 IF (del & 2) = 0 THEN 
  INSERT IGNORE INTO materialized_comment SELECT * FROM comment WHERE comment_id = id;
 END IF;
END //
delimiter ;

delimiter //
CREATE PROCEDURE materializeCommentRowDelete(id BIGINT, del INT) NOT DETERMINISTIC MODIFIES SQL DATA
BEGIN
 IF (del & 2) = 0 AND NOT materializeCommentShow(id) THEN
  DELETE FROM materialized_comment WHERE comment_id = id;
 END IF;
END //
delimiter ;

delimiter //
CREATE PROCEDURE materializeCommentRowUpdate(oldId BIGINT, oldDel INT, newId BIGINT, newDel INT) NOT DETERMINISTIC MODIFIES SQL DATA
BEGIN
 IF oldId != newId OR (oldDel & 2) != (newDel & 2) THEN
  CALL materializeCommentRowDelete(oldId, oldDel);
  CALL materializeCommentRowInsert(newId, newDel);
 END IF;
END //
delimiter ;

CREATE TRIGGER image_materializeCommentRowInsert AFTER INSERT ON image FOR EACH ROW CALL materializeCommentRowInsert( NEW.img_description_id, 0 );
CREATE TRIGGER image_materializeCommentRowUpdate AFTER UPDATE ON image FOR EACH ROW CALL materializeCommentRowUpdate( OLD.img_description_id, 0, NEW.img_description_id, 0 );
CREATE TRIGGER image_materializeCommentRowDelete AFTER DELETE ON image FOR EACH ROW CALL materializeCommentRowDelete( OLD.img_description_id, 0 );
CREATE TRIGGER oldimage_materializeCommentRowInsert AFTER INSERT ON oldimage FOR EACH ROW CALL materializeCommentRowInsert( NEW.oi_description_id, NEW.oi_deleted );
CREATE TRIGGER oldimage_materializeCommentRowUpdate AFTER UPDATE ON oldimage FOR EACH ROW CALL materializeCommentRowUpdate( OLD.oi_description_id, OLD.oi_deleted, NEW.oi_description_id, NEW.oi_deleted );
CREATE TRIGGER oldimage_materializeCommentRowDelete AFTER DELETE ON oldimage FOR EACH ROW CALL materializeCommentRowDelete( OLD.oi_description_id, OLD.oi_deleted );
-- ... and so on ...

INSERT IGNORE INTO materialized_comment SELECT * FROM comment WHERE materializeCommentShow(comment_id);

Alternatively, the query:

select comment_id from comment order by comment_id asc limit 1;
and
select comment_id from comment order by comment_id desc limit 1;

Can get the same thing efficiently on tool labs

@Bawolff: You quoted by comment, but I can't see how your reply is relevant. How would fetching the minimum and maximum comment ID be equivalent to either method of more efficiently fetching comment rows?

I think @Bawolff was referring to the automatic query that Sqoop generates against the table you point it at, usually something like select min(id_you_split_by), max(id_you_split_by) from table_to_sqoop. I would hope the mariadb optimizer knows to treat that the same as the order-by-comment approach you mentioned, but you never know :) In any case, we don't have much control over how Sqoop does these queries, but we could choose to import those tables in parallel, which would skip that inefficient query. But that would slow it down as well. Basically it's a tricky problem, and we tried a few different solutions but ultimately the new views are just a bit too slow.

I was assuming bases on this comment

In T209031#4732101, @Krenair wrote:

By the way, the query discussed on IRC was SELECT MIN(comment_id), MAX(comment_id) FROM (select comment_id, convert(comment_text using utf8) as comment_text from comment where (1 = 1) ) AS t1 which simplifies to SELECT MIN(comment_id), MAX(comment_id) FROM comment;

that the solution the materialized view was trying to solve was fetching the min and max id number. I may have skimmed this task too fast.

• Fjalapeno added projects: Platform Team Legacy, Platform Engineering.Nov 21 2018, 7:12 PM

• Fjalapeno added subscribers: CCicalese_WMF, • kchapman.

In T209031#4763909, @Bawolff wrote:

I was assuming bases on this comment

In T209031#4732101, @Krenair wrote:

By the way, the query discussed on IRC was SELECT MIN(comment_id), MAX(comment_id) FROM (select comment_id, convert(comment_text using utf8) as comment_text from comment where (1 = 1) ) AS t1 which simplifies to SELECT MIN(comment_id), MAX(comment_id) FROM comment;

that the solution the materialized view was trying to solve was fetching the min and max id number. I may have skimmed this task too fast.

The materialized view was to more generally address the fact that using the comment table can be slow, since that slowness could be relevant to other users of the WMCS (e.g. Toolforge) as well.

JAllemandou updated the task description. (Show Details)Nov 26 2018, 1:21 PM

CCicalese_WMF edited projects, added Platform Team Legacy (Watching / External); removed Platform Engineering, Platform Team Legacy.Nov 27 2018, 12:18 AM

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Nov 29 2018, 12:52 AM

JAllemandou moved this task from In Progress to Parent Tasks on the Analytics-Kanban board.Nov 30 2018, 12:46 PM

JAllemandou set the point value for this task to 0.

JAllemandou moved this task from Parent Tasks to In Progress on the Analytics-Kanban board.

Anomie mentioned this in T210693: Create materialized views on Wiki Replica hosts for better query performance.Dec 5 2018, 10:17 PM

JAllemandou reassigned this task from JAllemandou to Milimetric.Dec 10 2018, 10:17 AM

• fdans merged a task: T211535: Comments missing in mediawiki_history table.Dec 10 2018, 4:46 PM

• fdans added a subscriber: Tgr.

Milimetric moved this task from In Progress to In Code Review on the Analytics-Kanban board.Dec 10 2018, 10:03 PM

nshahquinn-wmf subscribed.Dec 20 2018, 9:09 PM

kzimmerman subscribed.Jan 3 2019, 10:22 PM

• Banyek closed subtask T210693: Create materialized views on Wiki Replica hosts for better query performance as Resolved.Jan 9 2019, 9:52 AM

Milimetric moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Jan 23 2019, 5:02 PM

Change 486203 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Sqoop actor and comment from production monthly

https://gerrit.wikimedia.org/r/486203

gerritbot added a project: Patch-For-Review.Jan 24 2019, 2:47 AM

Change 486203 merged by Elukey:
[operations/puppet@production] Sqoop actor and comment from production monthly

https://gerrit.wikimedia.org/r/486203

• Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jan 25 2019, 5:03 PM

• Bstorm mentioned this in T215445: comment and actor view challenges for Cloud Services.Feb 6 2019, 6:30 PM

• Nuria renamed this task from Not able to scoop comment table in labs for mediawiki reconstruction process to Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}.Feb 14 2019, 5:07 AM

• Nuria closed this task as Resolved.

Aklapper removed subscribers: • TBolliger, Anomie.Oct 16 2020, 5:40 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2020, 6:33 PM

Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}
Closed, ResolvedPublic0 Estimated Story Points
Actions

Description

Details

Related Objects
Search...

Event Timeline

	JAllemandou
	Nov 8 2018, 10:00 AM

Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}Closed, ResolvedPublic0 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...