Page MenuHomePhabricator

Explore NavigationTiming by faceted properties - EventLogging refine
Closed, ResolvedPublic

Description

We've sliced NavigationTiming by a few criteria in Grafana (browser, location, etc.) but combining different criteria is impossible, making investigation of NavigationTiming improvements or regressions tedious and often fruitless. We also often need to slice things by new criteria and it's cumbersome to set up a new set of metrics for that.

Overall it seems like Grafana isn't the right tool for the task. Pivot and its ability to add filters, break down by facets, etc. seems to be much closer to what we would need. @Nuria do you think Pivot would be a tool appropriate for this job? If so, what would it take to feed NavigationTiming data into a "data cube"?

Event Timeline

Pivot will work dimension-wise.
The catch is that you need this data to be real-time ish correct? Let's talk a bit more about it cause we can do that too but we need to set up some tooling we do not have.

Note: this using eventlogging refine could be loaded into druid easily.

We don't need the data to be updated in real time, this would be used to investigate performance changes after the fact. Having it updated once a day would be acceptable, hourly would be great.

Then (cc @ootomata and @Joseph for confirmation) we can get it done now in the same fashion that we load pageviews, there is an issue with "merging" data from schemas so rather than 1 dataset from every schema in pivot you have 1 dataset that "merges" all your schemas, this is a bit tricky and we have been working on that as of this quarter. We can use navigationTiming to be our poc forl EL to druid pipeline (druid is the storage behind pivot)

OK, we have a plan to fix some issues with NavigationTiming and its schema: T104902: Refactor Navigation Timing gathering to produce reliable stackable measures (aka "frontend.navtiming2"). We have that work scheduled for next quarter. I think it'll be better if the data sent to Druid is the cleaner, more consistent new version. Should we ping you when we're done with that next quarter?

Gilles triaged this task as Medium priority.May 29 2017, 6:25 PM

@Gilles :please but our work can start earlier, we will just scrape data once you call it good.

This is agood test case for our eventlogging refine cc @Ottomata

Nuria renamed this task from Explore NavigationTiming by faceted properties to Explore NavigationTiming by faceted properties - EventLogging refine.Jul 13 2017, 4:12 PM

Seems that importing plainly NavigationTiming in Druid is the 1st step towards doing what gilles is requesting

This comment was removed by Nuria.

Ping @Gilles added some work on this regard for next quarter

Change 386882 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add scala-spark core class and job to import data sets to Druid

https://gerrit.wikimedia.org/r/386882

We encountered a couple difficulties in the way Pivot works versus the nature of NavigationTiming measures:

  • NavigationTiming's metrics are time measures in milliseconds. Those are "inverted" because the lower the value the better. And are also "bounded" because the minimum is 0. Now the problem is NavigationTiming's fields are not required, and can have NULL values eventually. Druid ingestion transforms NULL values for numerical metrics into 0s. In the case of timely measures we can not count the absence of a metric as 0 because 0 is not a neutral value (it's the "best" value a metric can have). There's ways to work this around for Druid, but when it comes to Pivot, that can not be easily solved.
  • Raw NavigationTiming's metrics are not of much value in Pivot. Pivot will show you a sum of all time measures for a given metric in a given time, but it's not the absolute sum we'd be interested in, rather a percentile value. There are ways to configure average metrics in pivot using the Yaml config file, but those are not scalable and would only provide average metrics, which are probably not interesting for performance measures.

A couple things we can try to solve those issues are:

  • The latest development version of Druid has "approximate histograms" (http://druid.io/docs/latest/development/extensions-core/approximate-histograms.html), that might help in ingesting percentile metrics that can be displayed in Pivot.
  • We could pre-compute percentiles in scala ingestion job, so that Druid would be able to display them as regular metrics. One drawback of this approach is that we'd be forced to choose a granularity for pre-computation (i.e. minutely) and the metric would be frozen to that granularity. It would be kind of contradictory, since Druid is about aggregating.

Concluding, we can have a look at "approximate histograms", updating our Druid version, etc. to have NavigationTiming in Pivot. But, before that we'd like to have a simpler Schema being ingested periodically from Hive to Druid to Pivot. We'll pause this task until we've successfully finished the pipeline that ingests simple Hive schemas, and then will resume it to fix these problems.

Change 386882 merged by jenkins-bot:
[analytics/refinery/source@master] Add core class and job to import EL hive tables to Druid

https://gerrit.wikimedia.org/r/386882

fdans reopened this task as Open.
fdans moved this task from Wikistats to Blocked on the Analytics board.

After our changes to bucketize numeric dimensions we think we can load this data into turnilo and it will actually be pretty useful, stay tuned, we will probably get this done this week.

Change 464833 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add druid_load.pp to refinery jobs

https://gerrit.wikimedia.org/r/464833

Change 464833 abandoned by Mforns:
Add druid_load.pp to refinery jobs

Reason:
already taken care by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465692/

https://gerrit.wikimedia.org/r/464833

Just a note that I'm deleting the Druid dataset temporarily, to apply some renames and productionize the final job.
Will be back up within 1 day hopefully.

Milimetric raised the priority of this task from Medium to High.Oct 18 2018, 5:17 PM

Looks great! Already I'm finding interesting facts about Chrome 69 vs Chrome 70

I backfilled the last 3 months of data. This is now productionized!
Data will continue to be imported automatically every hour
(with a 5 hour lag to allow for previous collection and refinement of EL events into Hive).
Next steps are:

  • Write a comprehensive documentation about EventLoggingToDruid ingestion.
  • Remove the confusing Count metric from the datasource in Turnilo, or at least uncheck it by default (and make the default the actual eventCount).
  • Try to add a new metric to the datasource, eventCountPercentage, that normalizes eventCount splits by the total aggregate, so that time measure buckets become percentage-of-total values, instead of frequencies. This way they will not vary with throughput changes or seasonality, and will be a lot easier to follow. (not sure if this will be possible, though)

In any case these items will not be part of this task, I will tackle them as part of T206342.
Will move this task to Done in Analytics-Kanban.
Cheers!