Page MenuHomePhabricator

Aggregate ReadingDepth data in a form suitable for interactive visualization
Open, HighPublic

Description

Create a job that generates a regularly updated table suitable for ingestion into Druid, with at least the following measures derived from
the ReadingDepth schema:

  • median visibleLength
  • average visibleLength
  • other quartiles and 95% percentile of visibleLength
  • average logarithm of visibleLength

aggregated per hour or day (TBD).

(This selection is based on @Groceryheist's recommendations from the exploration of this data in the Reading time project and may still be refined. The purpose is to make these metrics available to be viewed as dashboards/charts in Turnilo - see T205562 - and Superset,)

The same metrics for the totalLength field would be nice to have as well. On the other hand, we can ignore the schema's other quantitative fields (firstPaintTime etc.) for this purpose.

Dimensions to be carried over:

  • isAnon
  • skin
  • namespaceId
  • all relevant ones from the event capsule, including:
    • browser (family, major, minor)
    • OS (family, major, minor)
    • webhost and wiki ( project)
    • country (but no smaller subdivisions)

Background: In T205562, we found that Druid does not provide the capability to calculate these aggregates directly while ingesting the original EventLogging table (event.readingdepth), hence the need for this task.

Other notes:

  • Ignore/drop the high-cardinality fields: pageTitle (and pageID and` revisionID` after they will be added), pageToken, sessionToken
  • This is based on the action = "pageUnloaded" events only (i.e. we can ignorepageLoaded events ).
  • Only use the data from the default sample (default_sample = true)
  • ...

Event Timeline

@ovasileva and @Groceryheist , feel free to weigh in if there is anything missing or off.

ovasileva triaged this task as High priority.Nov 15 2018, 5:02 PM

@ovasileva and @Groceryheist , feel free to weigh in if there is anything missing or off.

looks good

Nuria added a subscriber: Nuria.EditedNov 15 2018, 10:53 PM

I think you need to flash out a bit more what are the questions you want answer and evaluate whether druid is the best tool to answer those.

@Tbayer: please describe the dataset you expect (column wise) as it will help to decide whether druid is actually the best choice here.

Some thoughts:

If you want to load into druid a "percentile series", say, " 1, 50 and 90 percentiles calculated for visibleLength for anonymous users per wiki per day" you really cannot import dimensions like browser family as they are not related to the other aggregated dimensions (visibleLength, wiki, day). They do not belong in the same dataset as they cannot be aggregated (like visibleLength) per day.

If you want all the dimensions you listed above you would need to calculate percentiles per day, per wiki, per browser family, per country. Probably this is not what you want as it would mean that you have a timeseries per day, wiki, browser family and country which equals to loads of timeseries per day.

I am guessing what you want is probably data shaped similar to this: (day, percentile, country), (day, percentile, wiki) and (day, percentile, browserFamily) and these would be 3 independent datasets neither of which will be well suited for turnilo but once ingested into druid you could make dashboards in superset.

As I said, let's please describe the expected dataset to evaluate what is the best tool to produce it.

@Nuria I thought the requirements from the user perspective were evident from the task, but to clarify it a bit more:

The given dimensions and measures are what we would like to have available in an interactive data visualization such as the existing ones on Turnilo (e.g. for pageissues) or Superset, where a data analyst or product managers can, for example, select a combination of values in the different dimensions (say Safari users on desktop in India and Pakistan) and plot the given measure (say median visibleLength) over a timespan. It seems one wouldn't get that from these three separate time series.

I may have misunderstood your earlier remarks in T205562 and our subsequent offline discussion about the limitations of Druid and the missing pieces required here. After reading up a bit more on its technical background and having some other conversations I can see how it's doubtful whether Druid (or any tool limited to summing up data cubes) is capable of this at all. I'll give the task a more technology-neutral title. Hopefully @jlinehan and others with advanced expertise can figure out a solution. I recall that the algebird library you mentioned in T205562 has already been used in production for the app sessions metrics jobs, although in a very limited fashion.

Also CCing @mpopov who has been acquiring Druid expertise as well and may have additional ideas.

Tbayer renamed this task from Aggregate ReadingDepth data for ingestion into Druid to Aggregate ReadingDepth data in a form suitable for interactive visualization.Nov 20 2018, 10:41 PM

For a quick stab at this data I will submit counts for visibleLength and similar measures to graphite and visualize those in a dashboard (you will have percentiles but not other dimensions).

The EL code already supports submitting data to graphite and reading team uses it for data such as this one:
https://grafana.wikimedia.org/dashboard/db/reading-web-dashboard?panelId=15&fullscreen&orgId=1

kzimmerman moved this task from Triage to Tracking on the Product-Analytics board.Feb 19 2019, 6:20 PM