Thanks Luca for the pastes.
I changed a bit the syntax to adapt to python3.
And tested that everything is ok :].
Luckily, the checksums match the ones generated by python2,
so we won't need to change all the checksums.
Here's the change:
Tue, Sep 17
@Nuria yep, makes sense.
Yes! if we manage to use that UDF and populate a hive table with it, I think it would be easy to configure a reportupdater job to generate the desired reports.
Mon, Sep 16
@elukey OK thanks!
Will add those as a TODO for me.
The following scripts should be removable after migration to drop-older-than:
refinery-drop-banner-activity-partitions:#!/usr/bin/env python --> seems unused in puppet refinery-drop-eventlogging-partitions:#!/usr/bin/env python --> seems unused in puppet refinery-drop-hourly-partitions:#!/usr/bin/env python --> seems unused in puppet
And these two, should too I think, I wonder why they are still in puppet. Did I forget to migrate them?
refinery-drop-hive-partitions:#!/usr/bin/env python refinery-drop-webrequest-partitions:#!/usr/bin/env python
Fri, Sep 13
See the final format of the dumps, chosen after the community survey, here: T224459#5491080
So, this is the final format of the MediaWiki history dumps:
Thu, Sep 12
Can you Product Analytics please, as Greg suggests, request a repository in Gerrit to store your team's Oozie jobs?
I can do that as well, but I thought you'd like to chose a repo name, owner and type.
Can you please reply to the previous comment? Thanks! (Also, what is a "user" in which specific system?)
Wed, Sep 11
That patch should do the trick,
but we should wait about 2 months before merging.
Netflow data from 90 days ago still has the old schema and would produce useless and confusing data.
In 60 days, we can merge this and will hopefully work.
Mon, Sep 9
I'm not exactly sure how to fix, but perhaps the Turnilo config needs an explicit wmf_netflow dataCube declared?
Wed, Sep 4
I thought the 'hourly' in pageview_hourly meant aggregated hourly, not updated hourly.
In general I would name a data set after what does it contain, rather than how it is processed or when it is updated.
Now, edit_hourly is partitioned by snapshot, not by hour. So it's structurally different from pageview_hourly.
We could mirror that in the name. Maybe edit_history_hourly? To be a bit shorter than edits_history_aggregated_hourly?
Question: Should we have the 's' at the end of edits or not? I didn't put it there because other Hive data sets seem to lean towards the singular word.
I reviewed the survey responses yesterday, both on survey results and comments in the Phabricator task(s).
There were good insights!
Mon, Sep 2
Wed, Aug 28
We can do it but let's tackle that once we have done the dataset releases we have as high priority for this quarter. Does that sound good?
Sure! Makes sense.
@MNeisler the count metric was removed from Turnilo, let me know if there's any problems. Thanks!
@Nuria, who would be responsible for migrating netflow data set to be ingested into the event pipeline?
I'm not sure if we can change the granularity of the data within a single data set, say have the latest 3 months be minutely, and the rest be 5-minutely. I assume not.
But I will start testing how Druid/Turnilo behave when overriding existing data with new data that does not contain the fields you mentioned.
Tue, Aug 27
So, the idea for solution #2 (see task description) is the following:
Please, review this task and see if it makes sense.
I assume from what Luca told me, unless you tell me the opposite, that keeping the data in Hive/HDFS for only 3 months is not enough.
Please, review this task and let us know how long would you like to keep the netflow data in Druid/Turnilo.
Or put in another way, how interesting is it to you, to have the netflow data accessible in Druid/Turnilo for a long time?
This task is not about the netflow data in Hive/HDFS, I'll create another one for that :-)
I created this page in Wikitech, explains a bit how data_purge.pp works and how the retention period vs timer interval work.
Please, feel free to modify!
Mon, Aug 26
@cchen You should be able to access Hue now.
Please, reach out if you have any problems.
Fri, Aug 23
- IIRC the retention policy is about keeping AT MOST 90 days, so I'd rather keep 65, making sure we always have 2 months of data when the geoeditors job run, and try not to go over instead of having 90 days sure, and delete when there is at most 90+31 = 121 days.
Thu, Aug 22
We should ensure that at least we keep last 90 days.
And delete the data as soon as possible after that.
Aug 19 2019
Aug 14 2019
Aug 8 2019
Is that in general or for longer retention time? Maybe we can store aggregated data (without source/dest IP) for long term storage?
We can also use IP prefixes instead of IP addresses, less ideal for us and I don't know if it helps much with the cardinality issue.
Aug 7 2019
- ip_src - ip address cardinality
- ip_dst - ip address cardinality
Aug 6 2019
Aug 2 2019
Aug 1 2019
Jul 31 2019
Jul 30 2019
Both daily and monthly unique devices per project family are backfilled now.
It remains to merge and deploy changes to the queries and restart the bundle.
Jul 29 2019
Jul 25 2019
The limn-edit-data/edit/ folder contains several RU queries and config, however they are not currently scheduled in puppet for execution.
I think those are the reports used in the old compare Wikitext vs VisualEditor dashboard that we disabled a couple years ago.
I don't think we'll use those again, but let @Jdforrester-WMF decide.
The limn-ee-data/ee-migration folder contains several RU queries and config, but they are not currently scheduled for execution in puppet.
I don't know if we can delete them though, maybe @Catrope knows?
Just for the record, @elukey and I looked into this, and we confirmed that we can merge the puppet patch that will launch the druid loading job.
Jul 23 2019
Jun 28 2019
Jun 27 2019
The README says that Prometheus itself if it doesn't see a metric for 5 minutes it'll think it is stale, however a metric pushed to the pushgateway will stay there until deleted, so Prometheus will never think the metric is stale when it pulls metrics from the pushgateway. With Graphite / statsd you push the metric and that's it, if there are no datapoints the metric will have holes where there haven't been pushes.
Oh, I see! Thanks for the clarification.
Jun 26 2019
Should we do these changes for the dashboard as well?
I think we should! And everywhere in Wikistats, no?
Maybe, we could factor this out into a single place that affects all the app?
Going with https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html I think 50.3M should be probably 50M? and 50.6 M gets shown as 51M?
I think, philosophically, 3 significant digits (50.3M) is more coherent with the fact that we already are simplifying big numbers by way of K, M, etc. abreviations.
Right now, we simplify 534208 to 534K (3 significant digits).
If we did only 2 significant digits, 534805 would rather be simplified to 530K, right?
So following this rule, we can apply the same to numbers that acquire a decimal part, no? 50345719 -> 50.3M, 4378452 -> 4.38M
That said... practically, I think both 2-significant-digits and 3-significant-digits are good for the Wikistats2 case.
@fgiunchedi thanks a lot for the help!