Page MenuHomePhabricator

Update the sqoop configuration for mediawiki to obtain linktarget from the production replicas, instead of wikireplicas
Closed, ResolvedPublic

Description

This is a follow-up from this ops week incident investigation.

We recently changed the view definition for the linktarget on the wikireplicas. The change was deployed to all wikireplicas, including clouddb1021, which is the special wikireplica used primarily for sqoop.

During the first monthly sqoop after this change, we noticed that sqooping the linktarget tables has now become a very time consuming operation, which we attribute to this additional complexity in calculating the view.

We can avoid this by updating the sqoop configuration so that it sqoops this table from the analytics production MariaDB replicas, instead of the sanitized wikireplicas.

Event Timeline

BTullis triaged this task as High priority.Dec 6 2023, 3:53 PM
BTullis moved this task from Incoming to Watching on the Data-Platform-SRE board.
BTullis added a subscriber: JAllemandou.

Sqooping from the production replicas would mean applying the same sanitization rules on our side. I see the filter here is:

where:
exists(
 select 1 from templatelinks
  where tl_target_id = lt_id 
) or
exists(
 select 1 from pagelinks
  where pl_target_id = lt_id
)

so just changing the config to sqoop from the linktarget table wouldn't allow us to add this filter. We'd also have to change the query to add this logic, and then we'd have duplicated that logic. @Ladsgroup can we get confirmation that this is necessary in this view? The table definition doesn't seem to betray any private information in any field, so it looks like maybe this was an attempt at just reducing the number of rows that are available? But it seems to have caused an even worse performance situation, so let's discuss.

Repeating from IRC for the sake of documentation:

it's for PII protection, imagine someone adds [[Real name of User:Foo is Foo Bar]] to wikis, now we have a linktarget row with that text so if removed, then in wikireplicas, it won't show up
but also, do the analytics sqoop need only absolutely public info? Isn't the whole thing private?

I guess the filtering in analytics can happen somewhere else. I need to automate the linktarget clean up as well.

may sound stupid but maybe just create a temp table (it's rather small) and sqoop that? Would that be doable? Leave that filtering to mysql basically

Alternatively, just move it to the private one and read from production.

Change 980936 had a related patch set uploaded (by Milimetric; author: Milimetric):

[operations/puppet@production] maintain-views: add note on linktarget sanitization

https://gerrit.wikimedia.org/r/980936

Change 980937 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Sanitize the linktarget sqoop

https://gerrit.wikimedia.org/r/980937

Change 980936 merged by Ladsgroup:

[operations/puppet@production] maintain-views: add note on linktarget sanitization

https://gerrit.wikimedia.org/r/980936

Change 980939 had a related patch set uploaded (by Milimetric; author: Milimetric):

[operations/puppet@production] sqoop: move where we get the linktarget from

https://gerrit.wikimedia.org/r/980939

Change 980939 merged by Ladsgroup:

[operations/puppet@production] sqoop: move where we get the linktarget from

https://gerrit.wikimedia.org/r/980939

Change 981351 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Move linktarget table to the private folder

https://gerrit.wikimedia.org/r/981351

Change 980937 merged by Milimetric:

[analytics/refinery@master] Sanitize the linktarget sqoop

https://gerrit.wikimedia.org/r/980937

Change 981351 merged by Milimetric:

[analytics/refinery@master] Move linktarget table to the private folder

https://gerrit.wikimedia.org/r/981351

xcollazo subscribed.

Assigning this task as per activity and email threads.

The underlying cause: T343198#9391318

Fixing this as we speak.

Hi, If someone is around, please stop the ongoing sqoop so the schema changes could go through.

This is fixed. Thanks Dan for finding the underlying issue.