While we're waiting for a proper solution with Dashiki, I made a simple, public-readable spreadsheet that presents all of this:
@Amire80 : Just clarifying we are not waiting for any dashiki work, as I mentioned before superset would be a better alternative for dashboarding, an example from cx translation (per https://www.mediawiki.org/wiki/Content_translation/analytics/queries)
Fri, Jul 12
Let's figure out how to deploy only what we need to notebook hosts
Thu, Jul 11
My mistake, i had refined from 17th onward. All data should be there by now.
pinging @Asaf so he knows we are rerunning data for June 2019
Per @ezachte's criteria of "live" wikipedias this data should not include dead/un-editable wikipedias,. Erik's word on this regard below:
- let's release data for editors with 5+ edits per country (regardless of size of bucket) per wiki, let's not release distinctively the 5+ and 100+ buckets
- some countries in which surveillance is prevalent will be blacklisted and no data will be released. See: https://dash.harvard.edu/bitstream/handle/1/32741922/Wikipedia_Censorship_final.pdf
- let's not release data for countries whose population is below a threshold, regardless of size of bucket.
Wed, Jul 10
The bucket size requested is 10 for editing data.
This might be counter intuitive but the bucket size of the country has little to do with the privacy associated with being able to geo-locate an editor.
Tested that with beeline --verbose -f select.hql > out.txt stack trace is like the one hive would provide so closing!
when at least some should go through if it was only enforcing the 100 req/sec limit.
Let's see, ratelimiting is enforced per IP for public APIs, once you go over the limit of what we think is sustainable your iP will be throttled for a bit (limit enforcing does not automatically stop when you stop making connections but a bit after), so there is no guarantee that 100 of your connections per sec are going to make it once you go above that limit. These (to be clear) are connections from the browser correct?
Also re-refined June, the sanitized data will get adjusted when the 2nd sweep of sanitization runs. I will keep this ticket open as we need to deploy the codefix but all data needed should now be available.
Tue, Jul 9
Ok, data for july is there, onto data for june now:
@Ottomata , +1 to that idea
I am going to:
Ping @Milimetric do we need to update docs in wikitech after last refactor? https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history cause page_first_edit_timestamp field does not appear on docs
@chelsey: we will be needing to whitelist external domains cause most EL traffic that comes from 3rd parties is a "fake", that is, we do not really want to count it as lawful actions performed by wikipedia users. I am going with 'translate.google*' for now unless I hear otherwise.
Please try to log in http://superset.wikimedia.org
Mon, Jul 8
I think this events are being filtered because their host is translate.googleusercontent.com which does NOT map to ANY wikimedia project true domain, should have thought about this before! It looks like "fake" data coming from a third party clone running our code (similar to "fake" eventloggimg data we filter)
Nevermind, was able to re-refine (needed to remove REFINED flags as well as SUCCESS ones) but still there are no changes. I think we just need to debug this on spark cmd line,
Rerun refine for 07/01 to see if anything changes (doesn't see m like it might as latest refine hours are missing a lot)
Okay - so basically anyone that registers on Wikitech and knows the account password for our analytics instance can get access
Definitely something going on here for 2019-07-01 there are ("recorded") 25 "init" events but if I get one of the raw EL files for that day:
Then they should have access already using "Deb_Zierten"
ldap is associated with wikitech, normally 1 word
Please add here your ldap user name
Please add ldap username (it should be one word)
@DLynch For the record: are you a permanent employee of the foundations and thus have a NDA on file?
Talked to Product-Analytics on sync up about best practices of keeping code backed up in gerrit and data that needs to be backed up in hdfs, closing.
Sat, Jul 6
@ljon: that is correct, only about 5% of the anonymous "entities" have more than 5 edits.
Fri, Jul 5
Some recent work on this.
@JFishback_WMF is working on risk assessment framework with legal that we can apply to data releases such as this one. I took a look at the data harvested after the major refactor of the data harvesting jobs. As fas i can see on the daily edits tally of eswiki and arwiki about half of edits daily (agreggated for all countries) come from anonymous-editors. Once we aggregate the data monthly this ratio changes dramatically. There are about 5% authenticated editors and 95% anonymous editors on the monthly tally (again, aggregated per country). Pinging @Asaf in the per country releases requested are you also thinking about anonymous editors?
Ping @Milimetric can we merge the monthly job?
I saw this ticket go by, much has changed since it was filed. ActionAPI table has not been updated for a while, a much more reliable flow of data can be found at: mediawiki_api_request
Ping @Gilles: work on this to start second week of July
At this point I think we can close this ticket and re-open if it is to happen again?
Wed, Jul 3
Tue, Jul 2
nice, seems a fit for data governance (cc @chasemp ) but for stream config? How would say sample rates be represented in the system?
Can the release engineering chime as to whether scap config settings should also delete artifacts from the target of the deploy?
Moving to kanban to take care of this in Q1 2019
Per our conversation at standup we should probably have limits per host, does the IP limit need to be at the master process?
Mon, Jul 1
And versions of mac OS?
Now, I would expect events that do not validate (to the latest version of schema) to be logged in on eventterror, right? pinging @Ottomata