Create 1-off tsv files that dashiki would source with standard metrics from datalake
|Resolved||None||T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics|
|Resolved||None||T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days|
|Resolved||odimitrijevic||T130256 Wikistats 2.0.|
|Resolved||None||T143924 Replacing standard edit metrics in dashiki with data from new edit data depot|
|Resolved||Milimetric||T152034 Create 1-off tsv files that dashiki would source with standard metrics from datalake|
- Write queries to run metrics for all wikis
- create massive tsv file split per wiki on 1st column
- split file per wiki per metric
- manual rsync of per-wiki-per file metric to datasets
- update dashiki (keep convention of url naming PagesCreated/enwiki
- Create dynamically partition table (https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions) by wiki and metric.
In this mode partitions are created when data is harvested, since we have a strong datamodel we can confidently create partitions when selecting data.
If running with one reducer files will be created per metric per wiki, the way dashiki expects them to be.
- Run queries
- Sync files over
- Set up dashiki dashboards that source metrics
I found a small bug in event_user_is_anonymous, which is that it's always false. So I couldn't run the metrics that look at that flag, but I ran all the other 7. The latest script code for those is up in gerrit and the script that's running them is doing it one-by-one in a screen on stat1002. I tested all of them on small data and they worked, so I'm pretty confident that we'll have nice clean data Monday. Which is why I'm jinxing myself... hm...
:) the jinxes worked, some of the metrics didn't work when I tried to run them one after the other in a shell script. I'll try again one by one. Thanks for starting the job, I think it's ok even if we can't compute some of these metrics. The list we came up with is somewhat arbitrary anyway, we can always add to it.
aha! found the tricky bug. When trying to re-insert some data with dynamic partitions, it's not enough to delete the old files from hdfs, you have to rebuild the partitions for the table, otherwise new inserts will do nothing.
even dropping the table and re-creating didn't work. How strange. I'll try making a different table with a different name and partition order, and then if that fails it means something else is going on.
Awesome. I moved your new data to /wmf, deleted the old mediawiki_metric table and made a new mediawiki_metric_result table, and I did not test inserting anything into it. This time I partitioned it with metric, wiki_db instead of wiki_db, metric. That allows us to see the problem better and is easier to copy for dashiki to use. And the problem happened again:
ll /mnt/hdfs/wmf/data/wmf/mediawiki/metric_result/metric\=daily_edits/ | wc -l
I don't know what's going on, and why some of the metrics only give results for some of the wikis. It looks almost like it goes in alphabetic order and gets tired somewhere halfway through. Maybe .... oh! maybe one of the limits we didn't set? Because the monthly metrics all complete, and those have less data! The test would be to divide the total number of rows generated by 30 and see if that's higher than the monthly data generates and close to some round number that could be a limit we're missing.
So that's my theory right now, will pick up again tomorrow. Fortunately, the jobs run really fast, most metrics complete in under 5 minutes. So we can test ideas fairly quickly.