Page MenuHomePhabricator

Investigate pulling in page view data to the fr-tech version of superset
Open, Needs TriagePublic

Description

@EYener has use cases here:
https://docs.google.com/document/d/1Y5uyYJqFZuDBi2GcmFyRZpg_8UAmArYTxxg-QO9BquE/edit#

The top section about pageview data is what to pay attention to.

Event Timeline

DStrine renamed this task from Investigate pulling in page view data to our version of superset to Investigate pulling in page view data to the fr-tech version of superset.May 6 2020, 5:53 PM

@Jgreen pointed out that we should be able to connect fr-superset to Hive without pulling data into Fundraising from the data lake itself - this might be a great option

Can someone describe the use case?

pointed out that we should be able to connect fr-superset to Hive without pulling data into Fundraising from the data lake itself - this might be a great option

This does not work for large datatsets, superset will time out. The consumable pageview data on supeset lives on Druid, not Hive. Now, before getting into technical solutions let's please describe the use case on the ticket of what is that we are trying to achieve.

Can someone describe the use case?

pointed out that we should be able to connect fr-superset to Hive without pulling data into Fundraising from the data lake itself - this might be a great option

This does not work for large datatsets, superset will time out. The consumable pageview data on supeset lives on Druid, not Hive. Now, before getting into technical solutions let's please describe the use case on the ticket of what is that we are trying to achieve.

I don't know about the use case question so I will leave that for others. But to clarify, this was out of a discussion about importing/exporting data to/from Fundraising. I suggested that it may be feasible to consume production Analytics data sources directly from the Fundraising Superset instance, rather than import data to into a separate Fundraising data store.

I suggested that it may be feasible to consume production Analytics data sources directly from the Fundraising Superset instance, rather than import data to into a separate Fundraising data store.

Let's look at use cases cause in the absence of those I really see no benefit of surfacing the data in dashboards that duplicate the many available (and for the most part well maintained) dashboards that already exist.

@Nuria Is there documentation for us to understand the effort required to pull data from Druid?

For use cases: the Advancement analytics team will need to get page view data into the fundraising version of superset for a bunch of different visualizations. It might be hard to list all possibilities at this time.

For use cases: the Advancement analytics team will need to get page view data into the fundraising version of superset for a bunch of different visualizations. It might be hard to list all possibilities at this time.

Let me understand, what prevent us from doing those visualizations in the current instance of superset? (https://superset.wikimedia.org) . Superset is just a UI and anyone from FR has access to it.

We can't pull any fundraising data into the regular version of superset. That sort of project, if it is at all possible, might be a long way off and the analytics team would be blocked for far too long. It's easier to pull data into our version for now.

It's easier to pull data into our version for now.

Let me explain: pulling data from the analytics domain requires infra that FR (I think) lacks, for example, kerberos. So issue #1: data is secured via kerberos access for all datasources. The superset instance would need to be retrofitted to authenticate using kerberos for all FR users that have access to it.

Issue #2:network domains and vlans and firewalls, analytics domain is not prod domain (how easy is to work around those I do not know as I know little about how the FR network is setup)

Issue #3: we update superset and druid together so they play well (while superset is custom made for druid it does not automagically play well with any one version). With this in mind and given your use case are around "building dashboards with pageview data" I think it would be a lot easier to create those dashboards on the analytics instance of superset.

There is nothing additional that you would be able to do with that data in the FR superset instance as you cannot cross join data sources on the UI with this type of access.

Hopefully this makes sense, otherwise I can try to explain further.

@Jgreen can probably speak of engineering efforts of #1 and #2

Some links:

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide

https://superset.incubator.apache.org/druid.html

Also please see ::profile::kerberos::client on puppet

@Jgreen can probably speak of engineering efforts of #1 and #2

#1 For now the fundraising superset instance is secured with a 2FA combination using SSL client certificates plus superset's mysql backed passwords. I have not looked at integrating it with our existing kerberos service or with the production one, but I'll take a look at these docs and production puppet to see if I can figure out what would be required.

#2 Should just be a matter of firewall policy on both sides. Otherwise, traffic is already routeable between these WMF private subnets via the core routers.

#3 We're using the production superset deployment project/branch and we're planning to keep the FR instance as close as possible to the production one. Does that mean we're already getting the coordinated updates you're referring to?

#3.5 Regarding where dashboards get created, one of the use case issues we're trying to accommodate is a unified user experience, rather than having people jump around to different interfaces handled by different teams to get the reports they need. At the same time we're trying to avoid data replication and inconsistency. This is why I suggested the possibility of viewing production data via the fundraising superset instance. As you said, it doesn't solve the issue of creating reports that merge data from production and fundraising, and it doesn't resolve the issue of which data can be stored in different places, but it would keep fundraising reports in one place rather than fragmenting that between two.

Some links:

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide

https://superset.incubator.apache.org/druid.html

Also please see ::profile::kerberos::client on puppet

Also, I would encourage the FR team to explore making non PCI data available in the analytics environment. That data can have restricted access. Access to hive does not have to mean access to all datasets.

@Jgreen sounds good, let us know of your findings.

Also, I would encourage the FR team to explore making non PCI data available in the analytics environment. That data can have restricted access. Access to hive does not have to mean access to all datasets.

@Nuria does Analytics have any plans for 2FA at this point?

@Jgreen yes, it is a project we will do with main SRE team next year but 2FA will not substitute kerberos, it's an add on to the ldap based auth similar to how it works on wikitech

Thanks all for the info. This is just an investigation and I appreciate all the work to understand the effort involved in this.

I want to recognize the interest in avoiding duplication of effort or increasing complexity. I want to make sure everyone feels that feedback is always welcome. However the topic of donor data being displayed in the regular superset is happening at a very high level at this point and will take a long time to resolve.

To the point of maintenance of the fundraising superset: Fr-tech has already understood that we will be incurring maintenance cost but if we can do this better or if we are missing something, let's spin off other tasks.

I am interested in finding ways to unblock Advancement analytics as quickly as possible. I am convinced that the fastest way to do that is to try pulling data into the fundraising version of superset. Let's use this space to discuss the effort for that work. Thanks all!

Related tasks/projects on authentication to keep in mind as we explore options here: T251890 and T233921.

After looking at how Analytics uses ldap/krb5 now and considering their plan to switch to CAS/SSO next year, I think it makes sense to hold off until CAS/SSO is ready.

To be super clear we will still use kerberos, just SSO will be added at the UI layer, hopefully this makes sense.

To add to the context and background on this, it is often important to in Fundraising to report on impression rate (impressions / pageviews) for a particular campaign to ensure that there are not technical issues during campaigns. Viewing this metric for a live campaign alongside other health metrics in a Fundraising dashboard would be of great value to our Creative team. @AndyRussG had a great example of monitoring impression rates by country during our last Big English - and we were just discussing this use case today - so I will let him discuss in more detail.

Hi all! Thanks for all the info @EYener, @Nuria, @Jgreen and @DStrine!

Impression rates have mostly been surfaced by combining data from two existing Druid datasets, banner_activity_minutely and pageviews_hourly.

Queries are normally run using the centralnotice_analytics library inside a Jupyter notebook. The library makes it easy to set up queries for user segments targeted by specific CentralNotice campaigns. Basically, it abstracts away the details of filtering the two datasets.

The library was written with the intention that it could also run on a backend to generated dashboards on impression rates.

If we'd like to to try that, we could set up a job to run either on the FR cluster or the Analytics cluster, and then surface the combined data somehow. My first choice for that might be the Analytics cluster, because that way it'd be easier for non-fundraising campaign users to access. There'd be some adaptation of the library needed, though, and some thought would have to go into just how to set things up.

So, maybe we should create a new task for the specific use case of impression rate dashboards?

What are the other possible use cases? Maybe we should set up separate tasks for those use cases too?

(I can think of many possible interesting ways to join FR data and pageview data, though I imagine most would be deep-dive investigations using full Hive data.)

For providing a unified dashboard experience to FR Creative, if we just want to surface existing Analytics Druid-based datasets, maybe an option would a front-end-only solution--like a single page with a couple iframes?

Thanks again!! :)

Just to clarify, even given the above, it does seem like a good idea that we be able to query Druid datasets from the FR cluster... Also very understandable that we'd want a detailed use case to justify the work involved...

My first choice for that might be the Analytics cluster, because that way it'd be easier for non-fundraising campaign users to access.

Agreed, given that 100% of data of report at hand is on analytics cluster seems of little benefit to build the reporting anywhere else.

maybe we should create a new task for the specific use case of impression rate dashboards?

+1, that work does not seem related to the premise of this ticket

@Nuria, the approach we're taking is that we want to move data into Advancement's version of superset. Can you please designate someone to work with Jeff to pull the relevant data from Druid to make that happen? Thanks.

@EBjune :

@Jgreen is already working on that on this ticket: {T251890} there is no action needed on analytics as all the works is on the FR infra to setup kerberos (strong authentication layer for data access)

@Nuria @EBjune: To clarify, as I understand it Advancement and FR Tech have been talking about three possibilities.

  1. Push Advancement data to a data store in the Analytics cluster so it can be synthesized with production data, and build reports in Analytics tools including Superset. [no investigation tasked yet]
  2. Pull production data into a Advancement data store so it can be synthesized there with Advancement data, and build reports in the Advancement Superset instance. [T253050]
  3. As a stopgap/interim idea we've talked about connecting Advancement Superset to production/Druid as an additional data source. [T252049 - this task] This does not permit data synthesis but does put all the Advancement reports in one place. After some investigation I recommended we stall on this until integrating with SRE's future SSO is an option, some time next year.

@EBjune please correct me if I'm wrong, but I believe you were talking about #1? If that's the case can we continue this on T253050?

This comment was removed by Jgreen.

Can someone give me permits to see ticket: T253050: Bring Banner History data into Fundraising infrastructure?

Done! (Not sure why it's private...)

@Nuria you should be able to see that task now. I gave you access to private FR-tech tasks

@Jgreen #1 is not something we have the capacity to discuss at least until Q3 or later. So for the foreseeable future no donor data will leave the our space. Thus we need to unblock fundraising analytics through other means.

#2 and #3: the task about banner history is different from a product perspective. If it takes the same amount of effort to pull both data sets in, that's great. However pageview data is slightly higher priority.

@Nuria I just spoke with @Jgreen and he doesn't understand how T251890 will help our superset pull in data. Can you explain or point us to documentation?

This might warrant a quick meeting to clear up these 2 requests and get a general understanding of the setup in analytics engineering.

#2 and #3: the task about banner history is different from a product perspective. If it takes the same amount of effort to pull both data sets in, that's great. However pageview data is slightly higher priority.

Maybe we could setup a meeting to clarify? Pulling data and connecting FR superset to druid/presto are two different tasks, you do not need to "pull data" to connect the FR instance of superset to presto/druid as data is not moving, data is hosted in the analytics druid/presto storages, the UI is just connecting to those storages. Makes sense?

Connecting superset to druid/presto in the analytics infrastructure requires setting up a kerberos client that fr-superset has access to, this was discussed earlier on this ticket by @Jgreen and myself (T252049#6117295, T252049#6118724) and I thought it was part of the work on {T251890}. Sorry I miss-understood.

Pertinent puppet code:
https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/analytics_cluster/superset.pp

@Jgreen

@EBjune please correct me if I'm wrong, but I believe you were talking about #1? If that's the case can we continue this on T253050?

No, I'm actually thinking about #2, but just for pageview data to get us started while we figure out the bigger-picture approach.

Maybe we could setup a meeting to clarify? Pulling data and connecting FR superset to druid/presto are two different tasks, you do not need to "pull data" to connect the FR instance of superset to presto/druid as data is not moving, data is hosted in the analytics druid/presto storages, the UI is just connecting to those storages. Makes sense?

A meeting would be good!

@Nuria Yes, this makes sense. I think I understand what it would take to connect FR superset to druid/presto. Could you help us understand the other option, setting up pipelines for "banner history" and for "page views" to a data store in Advancement?

@Jgreen

@EBjune please correct me if I'm wrong, but I believe you were talking about #1? If that's the case can we continue this on T253050?

No, I'm actually thinking about #2, but just for pageview data to get us started while we figure out the bigger-picture approach.

Oh ha, that's actually what I thought you meant but I got 1 and 2 confused! Ok thanks!

Could you help us understand the other option, setting up pipelines for "banner history" and for "page views" to a data store in Advancement?

Banner history is a small dataset, to "pull" it you need (again) a kerberos client and the ability to read it, scoop there can be an option

"pageviews" (so called pageview_hourly) is a gigantic dataset, to be honest I do not see how you would pull that data anywhere in the FR cluster that would be query-able unless you want to run your own instance of druid or hadoop. We get as much as 6000 pageviews per second and data can occupy as much as 1G per hour (unreplicated). Superset would not be able to query that data unless it is on a columnar datastore similar to druid or an ultrafast datastore for large scale like clickhouse. Even if you just want to query a few months you would need a taylored storage solution to be able to query the data somewhat fast.

Could you help us understand the other option, setting up pipelines for "banner history" and for "page views" to a data store in Advancement?

Banner history is a small dataset, to "pull" it you need (again) a kerberos client and the ability to read it, scoop there can be an option

We would use a kerberos client for an automated system process? Or are you talking about a user logging in to download the data manually?

"pageviews" (so called pageview_hourly) is a gigantic dataset, to be honest I do not see how you would pull that data anywhere in the FR cluster that would be query-able unless you want to run your own instance of druid or hadoop. We get as much as 6000 pageviews per second and data can occupy as much as 1G per hour (unreplicated). Superset would not be able to query that data unless it is on a columnar datastore similar to druid or an ultrafast datastore for large scale like clickhouse. Even if you just want to query a few months you would need a taylored storage solution to be able to query the data somewhat fast.

Ok, I'll need more info from FR analytics folks on exactly what they mean here--clearly importing the raw dataset would not be feasible.

We would use a kerberos client for an automated system process? Or are you talking about a user logging in to download the data manually?

Depends how frequently is data updated, if you want just a one off (data you are not going to update) you can do a one-off scoop with your kerberos user. If the banner data is going to be used for anything that is somewhat real-timeish you of course need kerberos-aware automation. We can continue this on T253050: Bring Banner History data into Fundraising infrastructure. The historic banner dataset is small, a few gigs (parquet compressed).

Here's a task with a detailed proposal for how to create impression rate dashboards (following the idea described above): T254792.

I think this is the main use case folks are interested in?

Thanks!!!

@Nuria, thank you so so much for the explanations here, hugely appreciated!!!! Thanks also @Jgreen, @DStrine, @EBjune and @EYener for digging in!

Following are a few more thoughts on this... Really hope they're useful!

  • The pageview data for the production cluster is not only vast, but also extremely complex. In addition to infrastructure, Analytics invests a huge amount of time and energy in curating this data. This includes not only the raw data in Hive, but also the aggregated data in Druid. They do an absolutely fantastic job! This incredible resource is used in countless projects. Thank you all so so much for all this work.
  • For research and analytics, it is important to have a single canonical source of data, with agreed-upon, consistent definitions and processes.
  • The above are additional reasons for not duplicating pageview data storage on the FR cluster. That is, if we did so, in addition to duplicating infrastructure and maintenance work, we would also end up duplicating curation effort, and would be forking a fundamental data source, complicating comparisons and collaboration in research.
  • The above holds not only for any attempt to copy the raw data, but also for any work on storage of general aggregated data on the FR cluster. If in FR we made and stored our own aggregated version of this data, we would still be needlessly duplicating infrastructure and effort and complicating future research. (That is, exactly the same issues as if we were to store our own raw version.)
  • So, while I agree that it is important that we be able to query Analytics datasets from the FR cluster, and copy over data as necessary and appropriate, and I also agree that it's preferable to keep donor data only ever on the FR cluster, I would strongly recommend that any queries of Anaytics' datasets that we run from the FR cluster, and any data that we copy over, be as specific and tailored to FR needs as possible. This is just a straightforward result of the nature of the pageview data, and I think it would be evident to folks who have worked closely with it. Stated differently: all general work, including storage of raw data, general-purpose aggregation, and storage of general-purpose aggregated data, should be done by Analytics, on their infrastructure. If the datasets provided by Analytics somehow do not suit Advancement's needs, our first option should be to suggest ways to update this upstream source, not fork and do things separately.
  • In light of the above, @Nuria's initial question about specific use cases is eminently reasonable and necessary.
  • Also to note, once use cases are identified and described in detail, the route to implementation (including where data should be stored) is in large measure an engineering decision and should be studied mostly as such.
  • As for specific use cases, the only one I have heard about is impression rates (impressions/pageviews for a targeted user segment). Significant work has been done previously on this. Again, see T254792 for a detailed proposal.
  • The only other use case that I can think of is exploratory research. For this, I would suggest that initially, exploratory queries be run on Analytics infrastructure. This should be fine because there are very few joinable columns shared between donor data and pageview data. I'm sure any such exploratory queries would not expose donor data outside the FR cluster, especially if the queries of Analytics systems are made from the FR cluster. If, after such explorations, we find queries we'd like to run repeatedly to join results with donor data, we can find ways to productionize them.
  • If there are further use cases that have not been described, could they be described in detail? Simply "joining datasets" is not detailed enough, in view of the nature of this data.

Thanks so so much again, everyone!

DStrine updated the task description. (Show Details)

(Please see the newly linked doc in the description for more on use cases. Thanks so much @DStrine for adding that!! :) )

Hi!!! :) Some salient points from yesterday's meeting:

  1. FR-Tech cannot dedicate any substantial time to this, because of other priorities. (I hope I'm accurately representing this point; please correct me, @DStrine, @EBjune, if I'm not.)
  2. A direct connection from Fundraising's Superset instance to Anaytics' Druid instance is not simple to enable, because of authentication setup issues and a changing authentication landscape on the wider cluster. (Please let me know if this is wrong, @Jgreen, @elukey!)

Please also allow me to share some thoughts about other, very lightweight options here, at least for placekeeping! (In accordance with the first point, above, this should not be taken to imply that FR-Tech can allot dev time to this...)

So... While it'd be great to have the Analytics-Druid <=> FR-Superset connection, there would still be other ways to move small amounts of data, say, the results of specific queries, from the Analytics cluster over to the FR Cluster, no?

For example, it might be possible to query Analytics' Druid using a script running on a server in the FR cluster? Or, baring that, we could run a Druid query from a script running on the Analytics cluster, save the results as a csv file, and then copy the file to the FR cluster, maybe over ssh?

Either way, we could get a csv file with Druid query results onto the FR cluster, and that could then be accessed by FR Superset, I think?

A quick-and-dirty solution like this would have many limitations. You wouldn't get anything like the querying options and flexibility available when Superset works directly with Druid. But you could at least surface limited pageview and impression rate data in the FR Superset instance this way.

You could also set up a dead-simple job to run the queries at regular intervals (and another simple job to copy csv files, if the queries couldn't be run from the FR cluster). You'd have to manually adjust the query parameters to surface data for whatever FR campaigns are happening at the time... but it would work. (This could also be automated, but doing so would require a bit more dev time.)

Here is an example of how simple the code could be (using the library we made for impression rates). To create a csv with the last week of impression rates for the low-level campaign on desktop English Wikipedia in the US, you could run this Python script:

import centralnotice_analytics as cna

c = cna.CampaignSpec(
    name_regex = 'C1920_enUS_dsk_FR',
    projects = [ 'wikipedia' ],
    languages = [ 'en' ],
    countries = [ 'US' ],
    devices = [ 'desktop' ]
)

r = cna.RatesQuery( c, '2020-06-10T00:00Z/P7D', 'hour' )

r.pandas_df().to_csv( '/path/filename.csv' )

So again, while not nearly as good as what's described in T254792, still better than nothing, and much, much less work, eh?

Really hope this is helpful!!! :)

Or, baring that, we could run a Druid query from a script running on the Analytics cluster, save the results as a csv file, and then copy the file to the FR cluster, maybe over ssh?

The ability to run queries and create csv files on analytics cluster exist via a framework called reportupdater. This is an example of some queries that are run on a monthly basis: https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/structured-data

Results from these queries are publicy available via http, an example from before mentioned queries: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/structured-data/
these queries access hadoop directly.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater

Can someone describe the use case?

pointed out that we should be able to connect fr-superset to Hive without pulling data into Fundraising from the data lake itself - this might be a great option

This does not work for large datatsets, superset will time out. The consumable pageview data on supeset lives on Druid, not Hive. Now, before getting into technical solutions let's please describe the use case on the ticket of what is that we are trying to achieve.

I think I have a clearer understanding of Hive, Druid, Superset, and how Analytics is using Kerberos. I think everything FR folks would want to report within the FR Superset instance would be with Druid as the backend. I'm not clear on how authentication works in this case. I think we're using production LDAP to authenticate users to Superset, but I'm not seeing where the Superset->Druid connection and authentication is configured. @Nuria or @elukey could either of you help clarify that?

, but I'm not seeing where the Superset->Druid connection and authentication is configured

it is not yet merged in puppet as it is the upgrade to druid that we are hoping to do next week the one that will include the kerberos connection to the new version of druid. We can touch base on this next week?

, but I'm not seeing where the Superset->Druid connection and authentication is configured

it is not yet merged in puppet as it is the upgrade to druid that we are hoping to do next week the one that will include the kerberos connection to the new version of druid. We can touch base on this next week?

Ok that helps, I'll keep an eye out for the upgrade--thanks!

Luckily, the banner impression dataset seems to not contain anything more sensitive than pageviews. It makes sense that banner and pageview would live together because they can be usefully joined. My outsider opinion is that JGreen's suggestion #1 to move banner data out of fr-tech infrastructure and on to the analytics cluster is the soundest way forward.

Of course donor data should not be mixed with analytics data, I don't think anyone in this thread was suggesting that should be moved out of FR's private cluster? That's far more sensitive than the normal NDA that regulates access to analytics.