Page MenuHomePhabricator

Ingest data aggregate ReadingDepth data into Druid
Closed, ResolvedPublic5 Estimated Story Points

Description

Ingest data from the intermediate aggregate table to be created in T209598.
(Per the discussion below, the required aggregation steps can't be done in Druid itself.)

Event Timeline

Nuria created this task.Sep 26 2018, 5:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2018, 5:45 PM
fdans moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 27 2018, 11:36 AM
fdans added a comment.Sep 27 2018, 5:00 PM

Hi @Tbayer, just wanted to let you know that at the moment we can't index the time values stored in this dataset due to their high cardinality, so I wanted to ask if the dataset has value to you in druid without them, or you would prefer to wait for the completion of T205641 (referenced here as a subtask), which would put the time values into buckets and make it possible for them to be ingested in druid?

Nuria added a comment.Sep 27 2018, 5:12 PM

The other option is that data time measurements are bucketized when sent from the client, in buckets like [0-50ms [50-100] etc , that will lower cardinality of integers and allow for easier ingestion

mforns claimed this task.Sep 28 2018, 4:43 PM
mforns added a subscriber: fdans.

Hi @Tbayer, just wanted to let you know that at the moment we can't index the time values stored in this dataset due to their high cardinality, so I wanted to ask if the dataset has value to you in druid without them, or you would prefer to wait for the completion of T205641 (referenced here as a subtask), which would put the time values into buckets and make it possible for them to be ingested in druid?

Oh, those time fields (visibleLength and totalLength are particularly relevant) are to be understood as measures, not as dimensions, to use the terms referred to in https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines and the Druid documentation. It's actually very similar to the examples from the draft guidelines about "time spent" and "time since last action".

A dashboard for ReadingDepth in Superset or Turnilo should be able to display at least the mean (average) for both these metrics, and also the median is possible. (@Groceryheist is currently running custom queries for both mean and median as part of the ongoing research project involving this data.)

Nuria added a comment.Sep 28 2018, 9:46 PM

@Tbayer these are the numeric aggregations supported by druid and the ones that are available for this data: http://druid.io/docs/latest/querying/aggregations
Druid strength is to explore timeseries with n dimensions but it is not designed as a tool to explore numeric data so it does not calculate percentiles or similar (there is some experimental work to calculate approximate percentiles being done at this time by druid team but it has many limitations)

@Nuria I figured that percentiles including the median might be more demanding, but I didn't expect that the mean would be a problem too. Considering that Druid's aggregators include sums and counts, is there a way to calculate their quotient (i.e. the mean) later in Turnilo or Superset?

@Tbayer:

is there a way to calculate their quotient (i.e. the mean) later in Turnilo or Superset?

sum(blah)/ count(blah) should be possible I think but really unadvisable from any performance measure perspective on my opinion.

I figured that percentiles including the median might be more demanding,

Right. In general percentile calculations is not something that you do with massive amounts of data, you can 1) do an approximate calculation using something like https://twitter.github.io/algebird/datatypes/approx/q_tree.html o (this library is available and you can use it in spark)
or 2) use the approach of monitoring tools like graphite: calculate percentiles with windows of data. This second is a far better approach for performance, see: https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-continent?orgId=1

fdans triaged this task as Medium priority.Oct 1 2018, 4:00 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
mforns added a comment.Oct 2 2018, 2:44 PM

@Tbayer @Nuria

I loaded 1 month (Sept 2018) of ReadingDepth to Druid as a test. You can see it in Turnilo:
https://turnilo.wikimedia.org/#event_ReadingDepth/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjAAcpD4AnMF+Ij4+eFExPgB0UT4AWpLE2AAm3L6BpH4+4ZHRsVGJUakAvgC61UxQmkhoTi4a2twWTBkQbNhQWLgEZh0gdjS2MAi0EBrcAAoifgASklCYdPighsaD5t36IF2G5Bg43HBQ5Old9iAVTEgs0/jYEwi1IGznMIZOoNAAshMMPgpIgoMQ6hB7AgdCByJgYNh6EwWECICo4QikWkwOk0GIsfQakxNFCSBkACJ7Xr9ZpVElk4gZADKa08mMRyMIxAcmWeryYlAgdkoSBFnheCDeQA==

To load this test data set, I modified the EventLoggingToDruid.scala job to transform time measure fields into bucketized dimensions as suggested earlier in this task, so that we can discuss its use with some tangible examples. If you split by a time measure dimension, you'll see various branches that indicate orders of magnitude, like: O(10^3). An order of magnitude of O(10^3), in this case, is equivalent to the interval [1000 -> 9999]. O(10^5) = [100000 -> 999999]. In the specific case of ReadingDepth, as measurements are done in milliseconds, i.e. O(10^5) would mean 100 to 999 seconds. The value of each branch at any data point, corresponds to the number of events belonging to that range. Note that raw milliseconds values will still be available in Hive for deep analysis.

Please @Tbayer, have a look and let's discuss whether this is of any value.
Thanks!

mforns added a comment.Oct 2 2018, 2:47 PM

My take on this bucketization:

Disadvantages:

  • buckets are less precise than percentiles
  • buckets are subject to event throughput variations and seasonality

Advantages:

  • we can calculate buckets in druid now, as opposed of percentiles
  • buckets give more information about distribution than averages
mforns added a comment.EditedOct 2 2018, 8:25 PM

@Tbayer
After some discussion in Analytics stand-up meeting, we found the order-of-magnitude notation quite uncomfortable.
I applied some changes to the EL2Druid job to improve readability. The mentioned buckets now read: ..., "100ms-1sec", "1sec-10sec", "10sec-1min", etc...
I loaded the first 15 days of the month of September 2018 with such data, so we can compare.
Cheers

Change 464833 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add druid_load.pp to refinery jobs

https://gerrit.wikimedia.org/r/464833

mforns set the point value for this task to 5.Oct 8 2018, 4:09 PM

Change 464833 abandoned by Mforns:
Add druid_load.pp to refinery jobs

Reason:
already taken care by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465692/

https://gerrit.wikimedia.org/r/464833

Just a note that I'm deleting the Druid dataset temporarily, to apply some renames and productionize the final job.
Will be back up within 1 day hopefully.

Milimetric raised the priority of this task from Medium to High.Oct 18 2018, 5:17 PM
Nuria added a comment.Oct 22 2018, 9:37 PM

Data is back up, let us know if anything is missing: https://turnilo.wikimedia.org/#event_readingdepth

I backfilled the last 3 months of data. This is now productionized!
Data will continue to be imported automatically every hour
(with a 5 hour lag to allow for previous collection and refinement of EL events into Hive).
Next steps are:

  • Write a comprehensive documentation about EventLoggingToDruid ingestion.
  • Remove the confusing Count metric from the datasource in Turnilo, or at least uncheck it by default (and make the default the actual eventCount).
  • Try to add a new metric to the datasource, eventCountPercentage, that normalizes eventCount splits by the total aggregate, so that time measure buckets become percentage-of-total values, instead of frequencies. This way they will not vary with throughput changes or seasonality, and will be a lot easier to follow. (not sure if this will be possible, though)

In any case these items will not be part of this task, I will tackle them as part of T206342.
Will move this task to Done in Analytics-Kanban.
Cheers!

mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Oct 23 2018, 2:02 PM
Nuria closed this task as Resolved.Oct 25 2018, 4:26 PM
Tbayer reopened this task as Open.Nov 15 2018, 3:53 PM

@Tbayer @Nuria

I loaded 1 month (Sept 2018) of ReadingDepth to Druid as a test. You can see it in Turnilo:
https://turnilo.wikimedia.org/#event_ReadingDepth/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjAAcpD4AnMF+Ij4+eFExPgB0UT4AWpLE2AAm3L6BpH4+4ZHRsVGJUakAvgC61UxQmkhoTi4a2twWTBkQbNhQWLgEZh0gdjS2MAi0EBrcAAoifgASklCYdPighsaD5t36IF2G5Bg43HBQ5Old9iAVTEgs0/jYEwi1IGznMIZOoNAAshMMPgpIgoMQ6hB7AgdCByJgYNh6EwWECICo4QikWkwOk0GIsfQakxNFCSBkACJ7Xr9ZpVElk4gZADKa08mMRyMIxAcmWeryYlAgdkoSBFnheCDeQA==

To load this test data set, I modified the EventLoggingToDruid.scala job to transform time measure fields into bucketized dimensions as suggested earlier in this task, so that we can discuss its use with some tangible examples. If you split by a time measure dimension, you'll see various branches that indicate orders of magnitude, like: O(10^3). An order of magnitude of O(10^3), in this case, is equivalent to the interval [1000 -> 9999]. O(10^5) = [100000 -> 999999]. In the specific case of ReadingDepth, as measurements are done in milliseconds, i.e. O(10^5) would mean 100 to 999 seconds. The value of each branch at any data point, corresponds to the number of events belonging to that range. Note that raw milliseconds values will still be available in Hive for deep analysis.

Please @Tbayer, have a look and let's discuss whether this is of any value.
Thanks!

(Already discussed this with @Nuria offline some weeks ago, but to record it here too:)

I appreciate the effort that went into this, but unfortunately the result has very little value from a data analysis / product perspective. The time variations one can glean from the resulting charts in Turnilo (cf. example screenshot below) basically just show the well-known variations in pageviews. Similar for comparisons between projects or along other dimensions. And it's very hard to notice shifts in the proportion of the bucket sizes (i.e. detect growing/decreasing dwell times) with the bare eye there.

As noted early on in this task (T205562#4626349 ), the MVP here should at least yield the average reading time, but we also need the median and other measures derived from the visibleLength field . I am not sure why we went ahead with implementing the bucketing idea without verifying that the result is actually something wanted and needed by data users. But at least the work on this task so far has been valuable for unearthing/clarifying the limitations of Druid for such aggregate metrics. (Thanks also for updating the documentation in that regard. Note that the example schema there is about "time spent" measures, i.e. a very similar use case to this task.)

Anyway, per the discussion with Nuria I have now opened a subtask about the creation of an intermediate aggregate table suitable for ingestion into Druid: T209598
Once that is available, we can proceed with this task again.

Tbayer renamed this task from Ingest data into druid for readingDepth schema to Ingest data aggregate ReadingDepth data into Druid .Nov 15 2018, 3:58 PM
Tbayer updated the task description. (Show Details)
Nuria closed this task as Resolved.Dec 10 2018, 5:59 PM