Add MariaDB replicas to Superset
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	MNeisler
	Sep 16 2021, 3:27 PM

Description

Investigate the feasibility of adding MariaDB replicas to Superset including identifying any potential performance issues. Providing this access would allow users to directly query database tables in MariaDB directly from Superset and create charts updated in real-time.

Per discussions in the Product Analytics ⇄ Data Engineering meeting on 15 September 2021, each database shard can be added individually if needed.

Use Cases:

T290516 Request for real-time updated charts of topic subscriptions status in Superset based on the discussiontools_subcription table. Additional context in T287126.

Related Objects
Search...

Status	Assigned	Task
Open	None	T276497 Scale DiscussionTools to all projects
Open	None	T251207 [Epic] Scale DiscussionTools to all Wikipedias
Resolved	ppelberg	T284489 Deploy topic subscriptions (desktop) as opt-out feature at all projects
Resolved	ppelberg	T284490 Offer topic subscriptions as a Beta Feature at all projects
Resolved	ppelberg	T287277 Offer automatic topic subscriptions as Beta Feature at all projects
Open	None	T233447 [OKR] Increase response rates
Open	None	T273920 [Interventions] Notifications
Open	None	T263819 [Intervention] Automatic topic subscriptions
Resolved	ppelberg	T282636 [Release Ticket] Offer Automatic Topic Subscriptions as Beta Feature at Initial Wikis
Resolved	ppelberg	T290516 Enable topic subscription dashboard to update regularly
Declined	None	T291195 Add MariaDB replicas to Superset

Event Timeline

MNeisler created this task.Sep 16 2021, 3:27 PM

MNeisler mentioned this in T290516: Enable topic subscription dashboard to update regularly .Sep 16 2021, 3:36 PM

elukey subscribed.Sep 16 2021, 3:44 PM

We talked about this in the meeting, but good reminder from @elukey that Superset is accessible by everyone with an LDAP nda account, so we should openly discuss whether that's sufficient to access PII in the production replicas. For context, we are indeed talking about dbstore100[3-5]. He also suggests that we add 2FA to Superset, and I agree. I forget if we talked about making yubi keys necessary for access, shall we just do that?

Without 2FA anybody with a weak password can be problematic, I would prefer to use clouddb1020 since it contains only sanitized data (if possible).

@MNeisler can you please add the use case(s) that this is requested for so that we can explore other possible solutions.

odimitrijevic triaged this task as Medium priority.Sep 16 2021, 4:45 PM

@odimitrijevic - Sure, I've updated the task subscription with one current open request from the Editing team. I'll reach out to my team to see if there are others that would be useful to include.

@MNeisler Hi :) I'd be very interested to know if any table with PII data will be needed, or if something like the mariadb sanitized replicas could be fine for your use case (https://wikitech.wikimedia.org/wiki/MariaDB/Sanitarium_and_Labsdbs for more info)

nshahquinn-wmf subscribed.Sep 16 2021, 6:18 PM

ppelberg added a project: Editing-team (Tracking).Sep 16 2021, 7:39 PM

ppelberg updated the task description. (Show Details)

ppelberg subscribed.

ppelberg added a parent task: T290516: Enable topic subscription dashboard to update regularly .Sep 16 2021, 7:43 PM

@elukey: Megan's most pressing use case is discussiontools_subcription table, which is in frwiki but not frwiki_p.

@Milimetric: do you have a sense for whether it would be realistic for the Editing-team to think y'all have the capacity to implement what this task is describing in the next two weeks? [i]

I ask the above wondering whether it would be worthwhile for us – Editing / Product Analytics – to implement an interim solution (T290516) so that we can have as close-to-real-time data about how people are using Topic Subscriptions as possible before Automatic Topic Subscriptions mare made available (T282636).

We think the real-time nature of this data is especially important in this context because we think there is a possible scenario wherein Automatic Topic Subscriptions generate more notifications than people have the tools to manage themselves at this stage. Thus, we'd like to be able to detect said scenario and intervene quickly should it arise.

Note: if there is a better person/place to ask about prioritization and capacity please let me know! My instinct was to review wikitech:Analytics Engineering#Prioritization and when I didn't see anything there, I thought I'd stop by here to comment.

i. Of course, if there is any information we can provide that could help make answering this question more straightforward, please let me know.

In T291195#7363033, @mpopov wrote:

@elukey: Megan's most pressing use case is discussiontools_subcription table, which is in frwiki but not frwiki_p.

This is useful. I think we could satisfy this use case easier/faster in a direct way, doing something like sqooping incrementally very often. Otherwise, if we made a general connection to the replicas, we'd open up too big a hole into tables like user, with emails, passwords, and other troubles.

In T291195#7363517, @ppelberg wrote:

@Milimetric: do you have a sense for whether it would be realistic for the Editing-team to think y'all have the capacity to implement what this task is describing in the next two weeks? [i]

I ask the above wondering whether it would be worthwhile for us – Editing / Product Analytics – to implement an interim solution (T290516) so that we can have as close-to-real-time data about how people are using Topic Subscriptions as possible before Automatic Topic Subscriptions mare made available (T282636).

I think we could and should collaborate on a solution, but we don't have bandwidth to drive this right now.

We think the real-time nature of this data is especially important in this context because we think there is a possible scenario wherein Automatic Topic Subscriptions generate more notifications than people have the tools to manage themselves at this stage. Thus, we'd like to be able to detect said scenario and intervene quickly should it arise.

We should talk more about this, because there may be other solutions, like ways for people to let you know about this problem directly. In any case, let's make time and discuss, feel free to schedule a meeting and we'll bring this up at our team meeting today (in a few minutes).

matmarex subscribed.Sep 27 2021, 8:06 PM

ldelench_wmf moved this task from Triage to Tracking on the Product-Analytics board.Sep 28 2021, 5:23 PM

In T291195#7380492, @Milimetric wrote:

In T291195#7363517, @ppelberg wrote:

@Milimetric: do you have a sense for whether it would be realistic for the Editing-team to think y'all have the capacity to implement what this task is describing in the next two weeks? [i]

I ask the above wondering whether it would be worthwhile for us – Editing / Product Analytics – to implement an interim solution (T290516) so that we can have as close-to-real-time data about how people are using Topic Subscriptions as possible before Automatic Topic Subscriptions mare made available (T282636).

I think we could and should collaborate on a solution, but we don't have bandwidth to drive this right now.

Sounds great and understood about bandwidth. In the interim, can you see any risks with us moving forward with the job approach T290516 is proposing?

We think the real-time nature of this data is especially important in this context because we think there is a possible scenario wherein Automatic Topic Subscriptions generate more notifications than people have the tools to manage themselves at this stage. Thus, we'd like to be able to detect said scenario and intervene quickly should it arise.

We should talk more about this, because there may be other solutions, like ways for people to let you know about this problem directly. In any case, let's make time and discuss, feel free to schedule a meeting and we'll bring this up at our team meeting today (in a few minutes).

Great call and agreed. We'll be proactively seeking qualitative feedback about this feature as well.

But yes, talking more sounds like a good idea. I've proposed a time for you, @MNeisler, and I to talk on 19-October.

Looking forward to the meeting. Until then the team's general consensus is that we can't drive this, it just won't fit anywhere with more and more stuff falling on our plate. I haven't read T290516 carefully but experimentation is encouraged. Worst case is we'll have to migrate later when we have a more general solution, but do what you need now. And we can brainstorm more here and during our meeting.

In T291195#7396005, @Milimetric wrote:

Looking forward to the meeting. Until then the team's general consensus is that we can't drive this, it just won't fit anywhere with more and more stuff falling on our plate.

Understood and I appreciate you being explicit about this :)

I haven't read T290516 carefully but experimentation is encouraged. Worst case is we'll have to migrate later when we have a more general solution, but do what you need now. And we can brainstorm more here and during our meeting.

This sounds great; thank you, @Milimetric.

Declining this to reflect the discussion above. The use cases will be handled separately for now. I'm helping Megan with the one we discussed above.

Add MariaDB replicas to SupersetClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add MariaDB replicas to Superset
Closed, DeclinedPublic
Actions

Related Objects
Search...