Page MenuHomePhabricator

Create 1-off tsv files that dashiki would source with standard metrics from datalake
Closed, ResolvedPublic13 Estimated Story Points


Create 1-off tsv files that dashiki would source with standard metrics from datalake

Event Timeline

  • Write queries to run metrics for all wikis
  • create massive tsv file split per wiki on 1st column
  • split file per wiki per metric
  • manual rsync of per-wiki-per file metric to datasets
  • update dashiki (keep convention of url naming PagesCreated/enwiki
Nuria changed the point value for this task from 0 to 8.

Let's update points on completition.


In this mode partitions are created when data is harvested, since we have a strong datamodel we can confidently create partitions when selecting data.
If running with one reducer files will be created per metric per wiki, the way dashiki expects them to be.

  • Run queries
  • Sync files over
  • Set up dashiki dashboards that source metrics
Nuria changed the point value for this task from 8 to 13.Dec 15 2016, 5:03 PM

Path: run 1 small wiki, verify dynamic partitioning running, setup dashiki, run it for all wikis, make sure that dashiki can source all wikis

I found a small bug in event_user_is_anonymous, which is that it's always false. So I couldn't run the metrics that look at that flag, but I ran all the other 7. The latest script code for those is up in gerrit and the script that's running them is doing it one-by-one in a screen on stat1002. I tested all of them on small data and they worked, so I'm pretty confident that we'll have nice clean data Monday. Which is why I'm jinxing myself... hm...

@Milimetric: Bug found and corrected in scala code, new dataset computation launched...
Multi jinx !

:) the jinxes worked, some of the metrics didn't work when I tried to run them one after the other in a shell script. I'll try again one by one. Thanks for starting the job, I think it's ok even if we can't compute some of these metrics. The list we came up with is somewhat arbitrary anyway, we can always add to it.

aha! found the tricky bug. When trying to re-insert some data with dynamic partitions, it's not enough to delete the old files from hdfs, you have to rebuild the partitions for the table, otherwise new inserts will do nothing.

Wow ! That's a tricky one ! Thanks @Milimetric for finding that !
For one-off we probably can fully drop table then recreate it when needed..
But for prod time, this won't work (except if we find a way to swap table names).

even dropping the table and re-creating didn't work. How strange. I'll try making a different table with a different name and partition order, and then if that fails it means something else is going on.

@Milimetric: corrected data is at usual place (/user/joal/wmf/data/wmf/mediawiki/history).
I checked it using: SELECT event_user_is_anonymous, count(1) group by event_user_is_anonymous and got:

true       364861763
null        67953846
false     3079873688

Seems credible :)

Awesome. I moved your new data to /wmf, deleted the old mediawiki_metric table and made a new mediawiki_metric_result table, and I did not test inserting anything into it. This time I partitioned it with metric, wiki_db instead of wiki_db, metric. That allows us to see the problem better and is easier to copy for dashiki to use. And the problem happened again:

ll /mnt/hdfs/wmf/data/wmf/mediawiki/metric_result/metric\=daily_edits/ | wc -l

I don't know what's going on, and why some of the metrics only give results for some of the wikis. It looks almost like it goes in alphabetic order and gets tired somewhere halfway through. Maybe .... oh! maybe one of the limits we didn't set? Because the monthly metrics all complete, and those have less data! The test would be to divide the total number of rows generated by 30 and see if that's higher than the monthly data generates and close to some round number that could be a limit we're missing.

So that's my theory right now, will pick up again tomorrow. Fortunately, the jobs run really fast, most metrics complete in under 5 minutes. So we can test ideas fairly quickly.

@Milimetric : Just had a quick look at the queries, I think we can confirm the LIMIT theory :)
Something to notice: Since those queries are run in non-strict mode, they actually can go without limit even if ordered by !
Some advantages of bypasssing the rules :)

Let's please talk about putting these metrics in a more "findable" location

I'm all for it, but they should be vetted first and then we have a little name collision problem with Dashiki:CategorizedMetrics to solve.