Ingest data from the intermediate aggregate table to be created in T209598.
(Per the discussion below, the required aggregation steps can't be done in Druid itself.)
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +61 -0 | Add druid_load.pp to refinery jobs |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | mforns | T205562 Ingest data aggregate ReadingDepth data into Druid | |||
Resolved | mforns | T205641 Add ability to bucketize integers as part of event ingestion | |||
Invalid | jlinehan | T209598 Aggregate ReadingDepth data in a form suitable for interactive visualization |
Event Timeline
Hi @Tbayer, just wanted to let you know that at the moment we can't index the time values stored in this dataset due to their high cardinality, so I wanted to ask if the dataset has value to you in druid without them, or you would prefer to wait for the completion of T205641 (referenced here as a subtask), which would put the time values into buckets and make it possible for them to be ingested in druid?
The other option is that data time measurements are bucketized when sent from the client, in buckets like [0-50ms [50-100] etc , that will lower cardinality of integers and allow for easier ingestion
Oh, those time fields (visibleLength and totalLength are particularly relevant) are to be understood as measures, not as dimensions, to use the terms referred to in https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines and the Druid documentation. It's actually very similar to the examples from the draft guidelines about "time spent" and "time since last action".
A dashboard for ReadingDepth in Superset or Turnilo should be able to display at least the mean (average) for both these metrics, and also the median is possible. (@Groceryheist is currently running custom queries for both mean and median as part of the ongoing research project involving this data.)
@Tbayer these are the numeric aggregations supported by druid and the ones that are available for this data: http://druid.io/docs/latest/querying/aggregations
Druid strength is to explore timeseries with n dimensions but it is not designed as a tool to explore numeric data so it does not calculate percentiles or similar (there is some experimental work to calculate approximate percentiles being done at this time by druid team but it has many limitations)
@Nuria I figured that percentiles including the median might be more demanding, but I didn't expect that the mean would be a problem too. Considering that Druid's aggregators include sums and counts, is there a way to calculate their quotient (i.e. the mean) later in Turnilo or Superset?
is there a way to calculate their quotient (i.e. the mean) later in Turnilo or Superset?
sum(blah)/ count(blah) should be possible I think but really unadvisable from any performance measure perspective on my opinion.
I figured that percentiles including the median might be more demanding,
Right. In general percentile calculations is not something that you do with massive amounts of data, you can 1) do an approximate calculation using something like https://twitter.github.io/algebird/datatypes/approx/q_tree.html o (this library is available and you can use it in spark)
or 2) use the approach of monitoring tools like graphite: calculate percentiles with windows of data. This second is a far better approach for performance, see: https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-continent?orgId=1
I loaded 1 month (Sept 2018) of ReadingDepth to Druid as a test. You can see it in Turnilo:
https://turnilo.wikimedia.org/#event_ReadingDepth/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjAAcpD4AnMF+Ij4+eFExPgB0UT4AWpLE2AAm3L6BpH4+4ZHRsVGJUakAvgC61UxQmkhoTi4a2twWTBkQbNhQWLgEZh0gdjS2MAi0EBrcAAoifgASklCYdPighsaD5t36IF2G5Bg43HBQ5Old9iAVTEgs0/jYEwi1IGznMIZOoNAAshMMPgpIgoMQ6hB7AgdCByJgYNh6EwWECICo4QikWkwOk0GIsfQakxNFCSBkACJ7Xr9ZpVElk4gZADKa08mMRyMIxAcmWeryYlAgdkoSBFnheCDeQA==
To load this test data set, I modified the EventLoggingToDruid.scala job to transform time measure fields into bucketized dimensions as suggested earlier in this task, so that we can discuss its use with some tangible examples. If you split by a time measure dimension, you'll see various branches that indicate orders of magnitude, like: O(10^3). An order of magnitude of O(10^3), in this case, is equivalent to the interval [1000 -> 9999]. O(10^5) = [100000 -> 999999]. In the specific case of ReadingDepth, as measurements are done in milliseconds, i.e. O(10^5) would mean 100 to 999 seconds. The value of each branch at any data point, corresponds to the number of events belonging to that range. Note that raw milliseconds values will still be available in Hive for deep analysis.
Please @Tbayer, have a look and let's discuss whether this is of any value.
Thanks!
My take on this bucketization:
Disadvantages:
- buckets are less precise than percentiles
- buckets are subject to event throughput variations and seasonality
Advantages:
- we can calculate buckets in druid now, as opposed of percentiles
- buckets give more information about distribution than averages
@Tbayer
After some discussion in Analytics stand-up meeting, we found the order-of-magnitude notation quite uncomfortable.
I applied some changes to the EL2Druid job to improve readability. The mentioned buckets now read: ..., "100ms-1sec", "1sec-10sec", "10sec-1min", etc...
I loaded the first 15 days of the month of September 2018 with such data, so we can compare.
Cheers
Change 464833 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add druid_load.pp to refinery jobs
Change 464833 abandoned by Mforns:
Add druid_load.pp to refinery jobs
Reason:
already taken care by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465692/
Just a note that I'm deleting the Druid dataset temporarily, to apply some renames and productionize the final job.
Will be back up within 1 day hopefully.
Data is back up, let us know if anything is missing: https://turnilo.wikimedia.org/#event_readingdepth
I backfilled the last 3 months of data. This is now productionized!
Data will continue to be imported automatically every hour
(with a 5 hour lag to allow for previous collection and refinement of EL events into Hive).
Next steps are:
- Write a comprehensive documentation about EventLoggingToDruid ingestion.
- Remove the confusing Count metric from the datasource in Turnilo, or at least uncheck it by default (and make the default the actual eventCount).
- Try to add a new metric to the datasource, eventCountPercentage, that normalizes eventCount splits by the total aggregate, so that time measure buckets become percentage-of-total values, instead of frequencies. This way they will not vary with throughput changes or seasonality, and will be a lot easier to follow. (not sure if this will be possible, though)
In any case these items will not be part of this task, I will tackle them as part of T206342.
Will move this task to Done in Analytics-Kanban.
Cheers!
(Already discussed this with @Nuria offline some weeks ago, but to record it here too:)
I appreciate the effort that went into this, but unfortunately the result has very little value from a data analysis / product perspective. The time variations one can glean from the resulting charts in Turnilo (cf. example screenshot below) basically just show the well-known variations in pageviews. Similar for comparisons between projects or along other dimensions. And it's very hard to notice shifts in the proportion of the bucket sizes (i.e. detect growing/decreasing dwell times) with the bare eye there.
As noted early on in this task (T205562#4626349 ), the MVP here should at least yield the average reading time, but we also need the median and other measures derived from the visibleLength field . I am not sure why we went ahead with implementing the bucketing idea without verifying that the result is actually something wanted and needed by data users. But at least the work on this task so far has been valuable for unearthing/clarifying the limitations of Druid for such aggregate metrics. (Thanks also for updating the documentation in that regard. Note that the example schema there is about "time spent" measures, i.e. a very similar use case to this task.)
Anyway, per the discussion with Nuria I have now opened a subtask about the creation of an intermediate aggregate table suitable for ingestion into Druid: T209598
Once that is available, we can proceed with this task again.