Page MenuHomePhabricator

Enable topic subscription dashboard to update regularly
Closed, ResolvedPublic

Description

T287126 extended the Talk Pages Project Superset Dashboard to include the information about topic subscriptions listed below. This data comes from the discussiontools_subscription table and currently requires manual updates. This task represents the work with iterating upon the Talk Pages Project Superset Dashboard to make it so the data shown within the Topic Notifications tab updates is automatically updated each time someone views the dashboard.

Requirements

The requirements below were finalized in the 19 October meeting between @Milimetric, @MNeisler, and @ppelberg.

  • The Topic Notifications tab within the Talk Pages Project Superset Dashboard [i] is automatically updated hourly
  • All data within the discussions_subscriptions tables should be "replicated" (right word?) within Hadoop
  • discussions_subscriptions from all Wikimedia wikis should be "replicated" within Hadoop, and subsequently be made available within Superset

Open questions

  • 1. How might we ensure that the data presented in Superset remains accurate while Superset is attempting into incorporate new data? Asked another way: How can Superset atomically read while Sqoop writes to the data behind the superset dashboard?

Done

  • The answers to all ===Open questions are documented within this ticket
  • The data within the Topic Notifications tab within the Talk Pages Project Superset Dashboard is automatically updated hourly.

Event Timeline

  • Frequency of Updates These charts are unable to be updated in real-time as the data source for these charts comes from a query that currently needs to be manually run and then uploaded into Superset. This process does not take long but will require some planning and discussion on the needed frequency of updates since it's not automated. If needed frequently, a job scheduler can eventually be set-up to automate this process.

Note: The current query and notebook require some reformatting so that it can be automated and run by a job scheduler. I can update this ticket with the updated query once complete.

ppelberg renamed this task from Enable topic subscription dashboard to update in real-time to Enable topic subscription dashboard to update regularly .Sep 9 2021, 9:23 PM
ppelberg updated the task description. (Show Details)

Analytics Engineering is looking into the feasibility of adding access to the MariaDB replicas in Superset. This would allow us to directly query the discussiontools_subscription database table within Superset and create a chart that is updated in real-time.

We will decide whether we need to create a job scheduler to update the topic subscriptions charts depending on the estimated timeline and confirmation of feasibility in T291195.

Meta: per T291195#7396005, Analytics Engineering is fine with us implementing what this ticket is describing in the time between now and when work on T291195 can be prioritized.

ppelberg updated the task description. (Show Details)
ppelberg added a subscriber: Milimetric.

(@Milimetric + @MNeisler: I've updated the task description with the ===Requirements and ===Open questions we talked about during today's meeting; please comment here if see you anything unexpected.)

Just a quick note that I remember you saying you needed this in one or two weeks, but I haven't seen a patch yet (I'm super sorry if you sent it somewhere and I've missed it). Since it's already been a week, it would be good to get something out sooner than later, even if it's not perfect, so we can start to plug it in.

@Milimetric Thanks for following up on this! My work on this was delayed due to a couple of other priorities that came up. I plan on working on this today/tomorrow and hope to have a patch for you soon to review. Regarding timing, we'd still like to ideally have something working by the end of next week but we have some leeway if that's not possible given the short turnaround now.

Also, do you mind resending me the sqoop file for me to modify for this patch? Thank you!

Also, do you mind resending me the sqoop file for me to modify for this patch? Thank you!

No problem, this is the dictionary where we define the queries that our sqoop job can use to pull different tables:

https://gerrit.wikimedia.org/g/analytics/refinery/+/0d79e188649f93f7b9228a95d93ae873ae0f685b/python/refinery/sqoop.py#234

If you add an entry there for your table, I can help you test it. Once it works, we can merge and deploy, and then I'll set up a job in puppet that will run your query hourly. We'll just overwrite everything for now and see how that works. If it makes the dashboard perform poorly, we can change it to an incremental query, but for that we'd have to change some of our sqoop framework since this is the only table that would work like that.

MNeisler moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Change 736021 had a related patch set uploaded (by MNeisler; author: MNeisler):

[analytics/refinery@master] Add discussiontools_subscription query to sqoop

https://gerrit.wikimedia.org/r/736021

@Milimetric I took a first attempt at adding an entry for the discussiontools_subscription table in the sqoop dictionary. Here's a link to the patch for review: https://gerrit.wikimedia.org/r/736021

Let me know if you have any questions or suggested changes. I used the other entries as examples in how to format the query but let me know if there are any adjustments that need to be made to meet the requirements of this job (As mentioned in the task description, we'll need the query to be run hourly, across all Wikis, and need all data within the discussiontools_subscription table). Thank you!

Change 736021 merged by Milimetric:

[analytics/refinery@master] Add discussiontools_subscription query to sqoop

https://gerrit.wikimedia.org/r/736021

A one-off job is executing with @MNeisler's query. I put milimetric.mediawiki_private_discussiontools_subscription on top of the data just for now so we can test. If everything's ok, I'll create a puppet job that pulls the data on a daily basis. If that all works great and we want to try hourly we can do that too.

Confirming I tested the one-off job was able to easily connect to, query and explore the data in Superset. @Milimetric is going to create a puppet job to pull the data on a daily basis.

Note: The job will be set up so that the new data should overwrite the old data. Once ready, we'll plan to test how this affects the dashboard.

Change 739922 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Create discussiontools_subscription table in Hive

https://gerrit.wikimedia.org/r/739922

Change 739923 had a related patch set uploaded (by Milimetric; author: Milimetric):

[operations/puppet@production] analytics/systemd/sqoop: Add daily sqoop

https://gerrit.wikimedia.org/r/739923

The last two patches should be everything we need to get this running daily.

Change 739922 merged by Mforns:

[analytics/refinery@master] Create discussiontools_subscription table in Hive

https://gerrit.wikimedia.org/r/739922

Thanks for submitting the patches @Milimetric . One note - Given the anticipated unavailability of data while the job is running, I'd recommend scheduling the daily job for around midnight east coast time to avoid disruption to the primary anticipated users of the dashboard.

Change 739923 merged by Ottomata:

[operations/puppet@production] analytics/systemd/sqoop: Add daily sqoop

https://gerrit.wikimedia.org/r/739923

Ok, with the job merged and the table created, this is good to go for our initial experiment. The job will run nightly and get fresh data. This should be accessible in the wmf_raw.mediawiki_private_discussiontools_subscription table. Whenever there's a new wiki, we need to call msck repair table mediawiki_private_discussiontools_subscription to add that new partition. Remember when querying to either specify a wiki partition as in wiki_db='enwiki' or the general snapshot='latest' when querying the whole table.

Thanks @Milimetric!

I've updated the topic subscriptions dashboard to pull data from the wmf_raw.mediawiki_private_discussiontools_subscription table and added a note that the dashboard will be inaccessible for one hour during the daily updates scheduled at 05:00:00 UTC daily. I'll recheck it tomorrow just to make sure everything went ok following the update and then we should be able to resolve this task.

I rechecked the dashboard this morning and confirmed it was updated appropriately. All looks good.

@ppelberg - Reassigning to you for sign-off. Note: The Topic Notifications charts may take a little while to load. If we run into any timeout issues as the dataset grows, I can work on making some adjustments to the query and/or charts to help reduce the load time.

I rechecked the dashboard this morning and confirmed it was updated appropriately. All looks good.

Excellent! I'm already putting the updated dashboard to use: T296048#7541914

@ppelberg - Reassigning to you for sign-off. Note: The Topic Notifications charts may take a little while to load. If we run into any timeout issues as the dataset grows, I can work on making some adjustments to the query and/or charts to help reduce the load time.

Sounds great to me, @MNeisler. I figure we can file a ticket for making the improvements you described above if/when we encounter a scenario that demands it.