Page MenuHomePhabricator

Add breakdown of zero results rate by language/project pair to dashboard
Closed, ResolvedPublic4 Estimated Story Points

Description

What it says on the tin, please.

Event Timeline

Deskana raised the priority of this task from to Needs Triage.
Deskana updated the task description. (Show Details)
Deskana subscribed.
Ironholds edited a custom field.

This will be significantly easier when T109384 is complete, so I am marking this as being blocked by that.

Moving this out of the sprint to reflect reality, as it's been bumped in priority several times. I've placed this card at the top of the backlog for now.

Practically speaking:

  1. This is definitely something we want to work on after the Cirrus->Kafka work is done;
  2. We still have no idea how we're going to visualise that many pairings in a satisfactory way (other than infinitely long sets of dropdowns).

Practically speaking:

  1. This is definitely something we want to work on after the Cirrus->Kafka work is done;

This is done. We need to get UDF that marks a search request as concluding in zero or nonzero results. Then we can just aggregate by wikiid and zero_results

  1. We still have no idea how we're going to visualise that many pairings in a satisfactory way (other than infinitely long sets of dropdowns).

I was thinking of doing something similar to the "Tile by zoom level" (http://discovery.wmflabs.org/maps/#tiles_total_by_zoom) where the user can choose an arbitrary combination of zoom levels to visualize simultaneously. So we could have two of those and let the user select arbitrary pairs.

P.S. Extracting 'Language' and 'Project' from wikiid will be trivial after this PR is merged https://github.com/Ironholds/wmf/pull/5 :)

Okay. So I've got a query that works and gets what we want. Problem: we have A LOT of wikis. Specifically, for 2015-11-10, the query returns nonzero/zero results counts for 840 wikis! That means the dataset containing these aggregates is going to grow by ~840 rows every day. That's...not good.

Do we want to limit this to specific wikis? Daily top 100? Daily top 10? Here are the top 20 wikis for 2015-11-10 by # of nonzero-result queries:

wikiidnonzerozero
enwiki3609819213447949
dewiki77583293313155
eswiki54589954478491
ruwiki37315761997517
frwiki32568201848667
ptwiki23620031625427
itwiki18638551555819
jawiki18569301569374
nlwiki1048732666615
plwiki887922609215
arwiki829061631924
zhwiki7965231194192
trwiki670233361888
cswiki646477407798
svwiki582628312900
commonswiki5264041104818
enwiktionary452353357523
idwiki436878316690
wikidatawiki4146171001767
fawiki390492370154

Thoughts, @Ironholds & @Deskana?

See this is why I said it was a hard problem ;p

P.S. Query for posterity/future ref:

USE ebernhardson;
SELECT wikiid,
  SUM(results.outcome) AS nonzero,
  COUNT(*)-SUM(results.outcome) AS zero
FROM (
  SELECT wikiid, IF(requests.hitstotal[SIZE(requests.querytype)-1] > 0, 1, 0) AS outcome
  FROM cirrussearchrequestset
  WHERE year = 2015 AND month = 11 AND day = 10
) AS results
GROUP BY wikiid;
USE ebernhardson;

I laughed more than I should've done at this.

To answer the actual question I was asked, it might be good to have the top n projects (for, say, n=3) on the dashboard somewhere, but the question is... where? Clutter is bad, so we need to be careful about throwing more data in because we can.

This would be a good topic of discussion for the Analysis meeting this afternoon.

We've solved some of the questions about visualisations here, because we did something very similar with the portal dashboards in T123347: Include geolocation data in portal dashboards. So, given that, this can be reprioritised because there's not as many outstanding product questions.

This still represents a not-unsubstantial amount of engineering work, though.

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov edited a custom field.

Change 270449 had a related patch set uploaded (by Bearloga):
Adds ZRR breakdown by language/project - Depends on Iad9600b11443d9bed6aafcc1dd0e11ce5eda0e8e - Dynamically populates list of languages and projects - Lets user select arbitrary combinations to visualize

https://gerrit.wikimedia.org/r/270449

Now waiting for CR from @Ironholds before we deploy to the beta instance. Here's how it looks:

Screen Shot 2016-02-17 at 11.40.49 AM.png (625×1 px, 189 KB)

Change 270449 merged by OliverKeyes:
Adds ZRR breakdown by language/project (Beta)

https://gerrit.wikimedia.org/r/270449

Live on beta instance for testing: http://discovery-beta.wmflabs.org/metrics/#failure_langproj

@Deskana have fun and let us know if you run into any problems. If you're happy with it after a few days (or a week?) we'll push it out to production.

I'm personally not satisfied with the performance hit at startup (caused by reading in the 2 new datasets which are substantially larger than the others we have) but there's also not much we can do about that. It's just going to be a slow initial experience for whoever is the first person to open the dashboard on any given day. I wonder if we should move this out of the metrics dashboard and into its own "experimental" dashboard (where the forecasting dash lives). That way Dan and others can still use it but without it having an impact on the main dashboard. @Ironholds, thoughts?

I'm not sure if moving is necessarily the solution. Like, this should eventually live in those dashboards.

Do we gain anything if we do all the processing server-side? Like, we could output both a flat TSV for transparency/reproducibility purposes, and a serialised .RData all the computations have happened on, and rely on the RData. It should be much faster to load.

Change 271822 had a related patch set uploaded (by Bearloga):
Deploy all bunch of cool stuff

https://gerrit.wikimedia.org/r/271822

Change 271822 merged by OliverKeyes:
Deploy all bunch of cool stuff

https://gerrit.wikimedia.org/r/271822