Add breakdown of zero results rate by language/project pair to dashboard
Closed, ResolvedPublic4 Estimated Story Points
Actions

Assigned To

Authored By

	• Deskana
	Aug 27 2015, 8:22 PM

Description

What it says on the tin, please.

Details

	Subject	Repo	Branch	Lines +/-
	Deploy all bunch of cool stuff	wikimedia/discovery/dashboard	master	+12 -4
	Adds ZRR breakdown by language/project (Beta)	wikimedia/discovery/rainbow	master	+155 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	mpopov	T110590 Add breakdown of zero results rate by language/project pair to dashboard
Resolved	• csteipp	T109384 Security review of apache/avro and nmred/kafka-php
Resolved	bd808	T111851 Package the Avro PHP library for easier Composer usage
Resolved	Ironholds	T112295 Design and agree on an Avro schema for cirrus search request logging to hadoop
Resolved	mpopov	T126244 Add data collection for getting zero results rate by language/project

Event Timeline

• Deskana created this task.Aug 27 2015, 8:22 PM

• Deskana raised the priority of this task from to Needs Triage.

• Deskana updated the task description. (Show Details)

• Deskana added projects: Discovery-Analysis (Current work), Discovery-ARCHIVED.

• Deskana subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 27 2015, 8:22 PM

• Deskana triaged this task as Medium priority.Aug 27 2015, 8:23 PM

Ironholds set Security to None.Sep 1 2015, 8:09 PM

Ironholds edited a custom field.

This will be significantly easier when T109384 is complete, so I am marking this as being blocked by that.

• csteipp closed subtask T109384: Security review of apache/avro and nmred/kafka-php as Resolved.Sep 9 2015, 5:14 PM

mpopov added a subtask: T112295: Design and agree on an Avro schema for cirrus search request logging to hadoop.Sep 11 2015, 11:12 PM

• Deskana removed a project: Discovery-Analysis (Current work).Sep 15 2015, 8:22 PM

• Deskana moved this task from Needs triage to Analysis on the Discovery-ARCHIVED board.

Moving this out of the sprint to reflect reality, as it's been bumped in priority several times. I've placed this card at the top of the backlog for now.

• Deskana closed subtask T112295: Design and agree on an Avro schema for cirrus search request logging to hadoop as Resolved.Sep 23 2015, 5:00 AM

Practically speaking:

This is definitely something we want to work on after the Cirrus->Kafka work is done;
We still have no idea how we're going to visualise that many pairings in a satisfactory way (other than infinitely long sets of dropdowns).

In T110590#1743064, @Ironholds wrote:

Practically speaking:

This is definitely something we want to work on after the Cirrus->Kafka work is done;

This is done. We need to get UDF that marks a search request as concluding in zero or nonzero results. Then we can just aggregate by wikiid and zero_results

We still have no idea how we're going to visualise that many pairings in a satisfactory way (other than infinitely long sets of dropdowns).

I was thinking of doing something similar to the "Tile by zoom level" (http://discovery.wmflabs.org/maps/#tiles_total_by_zoom) where the user can choose an arbitrary combination of zoom levels to visualize simultaneously. So we could have two of those and let the user select arbitrary pairs.

P.S. Extracting 'Language' and 'Project' from wikiid will be trivial after this PR is merged https://github.com/Ironholds/wmf/pull/5 :)

Okay. So I've got a query that works and gets what we want. Problem: we have A LOT of wikis. Specifically, for 2015-11-10, the query returns nonzero/zero results counts for 840 wikis! That means the dataset containing these aggregates is going to grow by ~840 rows every day. That's...not good.

Do we want to limit this to specific wikis? Daily top 100? Daily top 10? Here are the top 20 wikis for 2015-11-10 by # of nonzero-result queries:

wikiid	nonzero	zero
enwiki	36098192	13447949
dewiki	7758329	3313155
eswiki	5458995	4478491
ruwiki	3731576	1997517
frwiki	3256820	1848667
ptwiki	2362003	1625427
itwiki	1863855	1555819
jawiki	1856930	1569374
nlwiki	1048732	666615
plwiki	887922	609215
arwiki	829061	631924
zhwiki	796523	1194192
trwiki	670233	361888
cswiki	646477	407798
svwiki	582628	312900
commonswiki	526404	1104818
enwiktionary	452353	357523
idwiki	436878	316690
wikidatawiki	414617	1001767
fawiki	390492	370154

Thoughts, @Ironholds & @Deskana?

See this is why I said it was a hard problem ;p

P.S. Query for posterity/future ref:

USE ebernhardson;
SELECT wikiid,
  SUM(results.outcome) AS nonzero,
  COUNT(*)-SUM(results.outcome) AS zero
FROM (
  SELECT wikiid, IF(requests.hitstotal[SIZE(requests.querytype)-1] > 0, 1, 0) AS outcome
  FROM cirrussearchrequestset
  WHERE year = 2015 AND month = 11 AND day = 10
) AS results
GROUP BY wikiid;

USE ebernhardson;

I laughed more than I should've done at this.

To answer the actual question I was asked, it might be good to have the top n projects (for, say, n=3) on the dashboard somewhere, but the question is... where? Clutter is bad, so we need to be careful about throwing more data in because we can.

This would be a good topic of discussion for the Analysis meeting this afternoon.

debt added a project: Discovery-Analysis (Current work).Jan 26 2016, 9:11 PM

debt edited a custom field.

• Deskana moved this task from Analysis to On Sprint Board on the Discovery-ARCHIVED board.Feb 2 2016, 9:15 PM

We've solved some of the questions about visualisations here, because we did something very similar with the portal dashboards in T123347: Include geolocation data in portal dashboards. So, given that, this can be reprioritised because there's not as many outstanding product questions.

This still represents a not-unsubstantial amount of engineering work, though.

matmarex reopened subtask T126244: Add data collection for getting zero results rate by language/project as Open.Feb 8 2016, 6:32 PM

mpopov claimed this task.Feb 13 2016, 12:22 AM

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

mpopov edited a custom field.

Change 270449 had a related patch set uploaded (by Bearloga):
Adds ZRR breakdown by language/project - Depends on Iad9600b11443d9bed6aafcc1dd0e11ce5eda0e8e - Dynamically populates list of languages and projects - Lets user select arbitrary combinations to visualize

https://gerrit.wikimedia.org/r/270449

gerritbot added a project: Patch-For-Review.Feb 13 2016, 12:25 AM

Progress update:

Screen Shot 2016-02-16 at 4.28.56 PM.png (547×1 px, 125 KB)

Now waiting for CR from @Ironholds before we deploy to the beta instance. Here's how it looks:

Screen Shot 2016-02-17 at 11.40.49 AM.png (625×1 px, 189 KB)

mpopov moved this task from In progress to Needs review on the Discovery-Analysis (Current work) board.Feb 17 2016, 8:08 PM

Change 270449 merged by OliverKeyes:
Adds ZRR breakdown by language/project (Beta)

https://gerrit.wikimedia.org/r/270449

Ironholds moved this task from Needs review to Done on the Discovery-Analysis (Current work) board.Feb 18 2016, 12:52 PM

Live on beta instance for testing: http://discovery-beta.wmflabs.org/metrics/#failure_langproj

@Deskana have fun and let us know if you run into any problems. If you're happy with it after a few days (or a week?) we'll push it out to production.

I'm personally not satisfied with the performance hit at startup (caused by reading in the 2 new datasets which are substantially larger than the others we have) but there's also not much we can do about that. It's just going to be a slow initial experience for whoever is the first person to open the dashboard on any given day. I wonder if we should move this out of the metrics dashboard and into its own "experimental" dashboard (where the forecasting dash lives). That way Dan and others can still use it but without it having an impact on the main dashboard. @Ironholds, thoughts?

I'm not sure if moving is necessarily the solution. Like, this should eventually live in those dashboards.

Do we gain anything if we do all the processing server-side? Like, we could output both a flat TSV for transparency/reproducibility purposes, and a serialised .RData all the computations have happened on, and rely on the RData. It should be much faster to load.

Change 271822 had a related patch set uploaded (by Bearloga):
Deploy all bunch of cool stuff

https://gerrit.wikimedia.org/r/271822

Change 271822 merged by OliverKeyes:
Deploy all bunch of cool stuff

https://gerrit.wikimedia.org/r/271822

Very nice work!

• Deskana closed subtask T126244: Add data collection for getting zero results rate by language/project as Resolved.Mar 31 2016, 10:27 PM

debt moved this task from Done to Resolved on the Discovery-Analysis (Current work) board.Jul 20 2016, 4:23 PM

	F3366836: Screen Shot 2016-02-17 at 11.40.49 AM.png
	Feb 17 2016, 7:44 PM

	F3365115: Screen Shot 2016-02-16 at 4.28.56 PM.png
	Feb 17 2016, 12:31 AM

Add breakdown of zero results rate by language/project pair to dashboardClosed, ResolvedPublic4 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add breakdown of zero results rate by language/project pair to dashboard
Closed, ResolvedPublic4 Estimated Story Points
Actions

Related Objects
Search...