Page MenuHomePhabricator

Add MariaDB replicas to Superset
Closed, DeclinedPublic

Description

Investigate the feasibility of adding MariaDB replicas to Superset including identifying any potential performance issues. Providing this access would allow users to directly query database tables in MariaDB directly from Superset and create charts updated in real-time.

Per discussions in the Product Analytics ⇄ Data Engineering meeting on 15 September 2021, each database shard can be added individually if needed.

Use Cases:

Event Timeline

We talked about this in the meeting, but good reminder from @elukey that Superset is accessible by everyone with an LDAP nda account, so we should openly discuss whether that's sufficient to access PII in the production replicas. For context, we are indeed talking about dbstore100[3-5]. He also suggests that we add 2FA to Superset, and I agree. I forget if we talked about making yubi keys necessary for access, shall we just do that?

Without 2FA anybody with a weak password can be problematic, I would prefer to use clouddb1020 since it contains only sanitized data (if possible).

odimitrijevic subscribed.

@MNeisler can you please add the use case(s) that this is requested for so that we can explore other possible solutions.

@odimitrijevic - Sure, I've updated the task subscription with one current open request from the Editing team. I'll reach out to my team to see if there are others that would be useful to include.

@MNeisler Hi :) I'd be very interested to know if any table with PII data will be needed, or if something like the mariadb sanitized replicas could be fine for your use case (https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs for more info)

@elukey: Megan's most pressing use case is discussiontools_subcription table, which is in frwiki but not frwiki_p.

@Milimetric: do you have a sense for whether it would be realistic for the Editing-team to think y'all have the capacity to implement what this task is describing in the next two weeks? [i]

I ask the above wondering whether it would be worthwhile for us – Editing / Product Analytics – to implement an interim solution (T290516) so that we can have as close-to-real-time data about how people are using Topic Subscriptions as possible before Automatic Topic Subscriptions mare made available (T282636).

We think the real-time nature of this data is especially important in this context because we think there is a possible scenario wherein Automatic Topic Subscriptions generate more notifications than people have the tools to manage themselves at this stage. Thus, we'd like to be able to detect said scenario and intervene quickly should it arise.


Note: if there is a better person/place to ask about prioritization and capacity please let me know! My instinct was to review wikitech:Analytics Engineering#Prioritization and when I didn't see anything there, I thought I'd stop by here to comment.


i. Of course, if there is any information we can provide that could help make answering this question more straightforward, please let me know.

@elukey: Megan's most pressing use case is discussiontools_subcription table, which is in frwiki but not frwiki_p.

This is useful. I think we could satisfy this use case easier/faster in a direct way, doing something like sqooping incrementally very often. Otherwise, if we made a general connection to the replicas, we'd open up too big a hole into tables like user, with emails, passwords, and other troubles.

@Milimetric: do you have a sense for whether it would be realistic for the Editing-team to think y'all have the capacity to implement what this task is describing in the next two weeks? [i]

I ask the above wondering whether it would be worthwhile for us – Editing / Product Analytics – to implement an interim solution (T290516) so that we can have as close-to-real-time data about how people are using Topic Subscriptions as possible before Automatic Topic Subscriptions mare made available (T282636).

I think we could and should collaborate on a solution, but we don't have bandwidth to drive this right now.

We think the real-time nature of this data is especially important in this context because we think there is a possible scenario wherein Automatic Topic Subscriptions generate more notifications than people have the tools to manage themselves at this stage. Thus, we'd like to be able to detect said scenario and intervene quickly should it arise.

We should talk more about this, because there may be other solutions, like ways for people to let you know about this problem directly. In any case, let's make time and discuss, feel free to schedule a meeting and we'll bring this up at our team meeting today (in a few minutes).

@Milimetric: do you have a sense for whether it would be realistic for the Editing-team to think y'all have the capacity to implement what this task is describing in the next two weeks? [i]

I ask the above wondering whether it would be worthwhile for us – Editing / Product Analytics – to implement an interim solution (T290516) so that we can have as close-to-real-time data about how people are using Topic Subscriptions as possible before Automatic Topic Subscriptions mare made available (T282636).

I think we could and should collaborate on a solution, but we don't have bandwidth to drive this right now.

Sounds great and understood about bandwidth. In the interim, can you see any risks with us moving forward with the job approach T290516 is proposing?

We think the real-time nature of this data is especially important in this context because we think there is a possible scenario wherein Automatic Topic Subscriptions generate more notifications than people have the tools to manage themselves at this stage. Thus, we'd like to be able to detect said scenario and intervene quickly should it arise.

We should talk more about this, because there may be other solutions, like ways for people to let you know about this problem directly. In any case, let's make time and discuss, feel free to schedule a meeting and we'll bring this up at our team meeting today (in a few minutes).

Great call and agreed. We'll be proactively seeking qualitative feedback about this feature as well.

But yes, talking more sounds like a good idea. I've proposed a time for you, @MNeisler, and I to talk on 19-October.

Looking forward to the meeting. Until then the team's general consensus is that we can't drive this, it just won't fit anywhere with more and more stuff falling on our plate. I haven't read T290516 carefully but experimentation is encouraged. Worst case is we'll have to migrate later when we have a more general solution, but do what you need now. And we can brainstorm more here and during our meeting.

Looking forward to the meeting. Until then the team's general consensus is that we can't drive this, it just won't fit anywhere with more and more stuff falling on our plate.

Understood and I appreciate you being explicit about this :)

I haven't read T290516 carefully but experimentation is encouraged. Worst case is we'll have to migrate later when we have a more general solution, but do what you need now. And we can brainstorm more here and during our meeting.

This sounds great; thank you, @Milimetric.

Declining this to reflect the discussion above. The use cases will be handled separately for now. I'm helping Megan with the one we discussed above.