Explore NavigationTiming by faceted properties - EventLogging refine
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	May 26 2017, 9:06 PM

Description

We've sliced NavigationTiming by a few criteria in Grafana (browser, location, etc.) but combining different criteria is impossible, making investigation of NavigationTiming improvements or regressions tedious and often fruitless. We also often need to slice things by new criteria and it's cumbersome to set up a new set of metrics for that.

Overall it seems like Grafana isn't the right tool for the task. Pivot and its ability to add filters, break down by facets, etc. seems to be much closer to what we would need. @Nuria do you think Pivot would be a tool appropriate for this job? If so, what would it take to feed NavigationTiming data into a "data cube"?

Details

	Subject	Repo	Branch	Lines +/-
	Add druid_load.pp to refinery jobs	operations/puppet	production	+61 -0
	Add core class and job to import EL hive tables to Druid	analytics/refinery/source	master	+1 K -14

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T159170 Sunset MySQL data store for eventlogging
Resolved	Ottomata	T162610 Implement EventLogging Hive refinement
Resolved	mforns	T166414 Explore NavigationTiming by faceted properties - EventLogging refine

Event Timeline

• Gilles created this task.May 26 2017, 9:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2017, 9:06 PM

Krinkle subscribed.May 26 2017, 10:18 PM

Peter subscribed.May 29 2017, 4:06 AM

Pivot will work dimension-wise.
The catch is that you need this data to be real-time ish correct? Let's talk a bit more about it cause we can do that too but we need to set up some tooling we do not have.

Note: this using eventlogging refine could be loaded into druid easily.

We don't need the data to be updated in real time, this would be used to investigate performance changes after the fact. Having it updated once a day would be acceptable, hourly would be great.

Then (cc @ootomata and @Joseph for confirmation) we can get it done now in the same fashion that we load pageviews, there is an issue with "merging" data from schemas so rather than 1 dataset from every schema in pivot you have 1 dataset that "merges" all your schemas, this is a bit tricky and we have been working on that as of this quarter. We can use navigationTiming to be our poc forl EL to druid pipeline (druid is the storage behind pivot)

OK, we have a plan to fix some issues with NavigationTiming and its schema: T104902: Refactor Navigation Timing gathering to produce reliable stackable measures (aka "frontend.navtiming2"). We have that work scheduled for next quarter. I think it'll be better if the data sent to Druid is the cleaner, more consistent new version. Should we ping you when we're done with that next quarter?

• Gilles triaged this task as Medium priority.May 29 2017, 6:25 PM

@Gilles :please but our work can start earlier, we will just scrape data once you call it good.

• Gilles moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.May 31 2017, 8:02 PM

This is agood test case for our eventlogging refine cc @Ottomata

• Nuria renamed this task from Explore NavigationTiming by faceted properties to Explore NavigationTiming by faceted properties - EventLogging refine.Jul 13 2017, 4:12 PM

Krinkle awarded a token.Jul 13 2017, 7:46 PM

• Tbayer subscribed.Jul 14 2017, 7:33 PM

Seems that importing plainly NavigationTiming in Druid is the 1st step towards doing what gilles is requesting

JAllemandou added a parent task: T162610: Implement EventLogging Hive refinement.Aug 28 2017, 3:55 PM

JAllemandou moved this task from Dashiki to Backlog (Later) on the Analytics board.

• Gilles added a subtask: T166390: Remove most Navigation Timing by country metrics from Graphite.Sep 12 2017, 2:34 PM

Krinkle mentioned this in T166390: Remove most Navigation Timing by country metrics from Graphite.Sep 12 2017, 2:37 PM

• Nuria added a comment.Sep 12 2017, 6:26 PM

This comment was removed by • Nuria.

Ping @Gilles added some work on this regard for next quarter

Awesome, thank you!

Krinkle removed a subtask: T166390: Remove most Navigation Timing by country metrics from Graphite.Sep 13 2017, 3:13 PM

• Nuria moved this task from Backlog (Later) to Operational Excellence Future on the Analytics board.Oct 13 2017, 3:53 PM

• Nuria assigned this task to mforns.Oct 13 2017, 4:11 PM

mforns edited projects, added Analytics-Kanban; removed Analytics.Oct 17 2017, 2:22 PM

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 386882 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add scala-spark core class and job to import data sets to Druid

https://gerrit.wikimedia.org/r/386882

gerritbot added a project: Patch-For-Review.Oct 27 2017, 3:35 PM

We encountered a couple difficulties in the way Pivot works versus the nature of NavigationTiming measures:

NavigationTiming's metrics are time measures in milliseconds. Those are "inverted" because the lower the value the better. And are also "bounded" because the minimum is 0. Now the problem is NavigationTiming's fields are not required, and can have NULL values eventually. Druid ingestion transforms NULL values for numerical metrics into 0s. In the case of timely measures we can not count the absence of a metric as 0 because 0 is not a neutral value (it's the "best" value a metric can have). There's ways to work this around for Druid, but when it comes to Pivot, that can not be easily solved.
Raw NavigationTiming's metrics are not of much value in Pivot. Pivot will show you a sum of all time measures for a given metric in a given time, but it's not the absolute sum we'd be interested in, rather a percentile value. There are ways to configure average metrics in pivot using the Yaml config file, but those are not scalable and would only provide average metrics, which are probably not interesting for performance measures.

A couple things we can try to solve those issues are:

The latest development version of Druid has "approximate histograms" (http://druid.io/docs/latest/development/extensions-core/approximate-histograms.html), that might help in ingesting percentile metrics that can be displayed in Pivot.
We could pre-compute percentiles in scala ingestion job, so that Druid would be able to display them as regular metrics. One drawback of this approach is that we'd be forced to choose a granularity for pre-computation (i.e. minutely) and the metric would be frozen to that granularity. It would be kind of contradictory, since Druid is about aggregating.

Concluding, we can have a look at "approximate histograms", updating our Druid version, etc. to have NavigationTiming in Pivot. But, before that we'd like to have a simpler Schema being ingested periodically from Hive to Druid to Pivot. We'll pause this task until we've successfully finished the pipeline that ingests simple Hive schemas, and then will resume it to fix these problems.

mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Nov 7 2017, 7:55 PM

mforns mentioned this in T179976: Create scala-spark job to ingest simple data sets from Hive-EventLogging to Druid to Pivot.Nov 7 2017, 7:58 PM

Ottomata edited projects, added Analytics; removed Analytics-Kanban.Nov 20 2017, 5:00 PM

Ottomata moved this task from Operational Excellence Future to Wikistats on the Analytics board.

Change 386882 merged by jenkins-bot:
[analytics/refinery/source@master] Add core class and job to import EL hive tables to Druid

https://gerrit.wikimedia.org/r/386882

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Jan 16 2018, 3:49 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

• fdans closed this task as Resolved.Mar 29 2018, 5:22 PM

• fdans reopened this task as Open.

• fdans moved this task from Wikistats to Blocked on the Analytics board.

Milimetric moved this task from Blocked to Smart Tools for Better Data on the Analytics board.Aug 20 2018, 4:13 PM

After our changes to bucketize numeric dimensions we think we can load this data into turnilo and it will actually be pretty useful, stay tuned, we will probably get this done this week.

Change 464833 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add druid_load.pp to refinery jobs

https://gerrit.wikimedia.org/r/464833

Change 464833 abandoned by Mforns:
Add druid_load.pp to refinery jobs

Reason:
already taken care by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465692/

https://gerrit.wikimedia.org/r/464833

mforns moved this task from Paused to In Code Review on the Analytics-Kanban board.Oct 11 2018, 9:43 AM

Just a note that I'm deleting the Druid dataset temporarily, to apply some renames and productionize the final job.
Will be back up within 1 day hopefully.

Milimetric raised the priority of this task from Medium to High.Oct 18 2018, 5:17 PM

Data is back in druid , please be so kind as to take a look: https://turnilo.wikimedia.org/#event_navigationtiming

Looks great! Already I'm finding interesting facts about Chrome 69 vs Chrome 70

I backfilled the last 3 months of data. This is now productionized!
Data will continue to be imported automatically every hour
(with a 5 hour lag to allow for previous collection and refinement of EL events into Hive).
Next steps are:

Write a comprehensive documentation about EventLoggingToDruid ingestion.
Remove the confusing Count metric from the datasource in Turnilo, or at least uncheck it by default (and make the default the actual eventCount).
Try to add a new metric to the datasource, eventCountPercentage, that normalizes eventCount splits by the total aggregate, so that time measure buckets become percentage-of-total values, instead of frequencies. This way they will not vary with throughput changes or seasonality, and will be a lot easier to follow. (not sure if this will be possible, though)

In any case these items will not be part of this task, I will tackle them as part of T206342.
Will move this task to Done in Analytics-Kanban.
Cheers!

mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Oct 23 2018, 2:02 PM

• Nuria closed this task as Resolved.Oct 23 2018, 4:55 PM

Explore NavigationTiming by faceted properties - EventLogging refineClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Explore NavigationTiming by faceted properties - EventLogging refine
Closed, ResolvedPublic
Actions

Related Objects
Search...