Code merged now, when the entropy counts are re run alarms for may18th will be resend.
Tue, Oct 27
@Dzahn : i do not think so, he should be removed from LDAP
Fri, Oct 23
NDA signed now but I do not have access to https://phabricator.wikimedia.org/L2?
Also, @Rmaung please take a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines and ask any questions you might have about it on task
Thu, Oct 22
if this is not super urgent i can work on it on my volunteer capacity.
Wed, Oct 21
I would implement the daily "top" 1st and once that is in place I would add the monthly job, given the very different amounts of data for both a different strategy might be needed for the second one.
A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data
Nice, +1 to this idea
I found differences of <0.1% for recent months and <0.3% for older months. I think that's acceptable.
Tue, Oct 20
For faster resolution of permits issues add SRE-Access-Requests to ticket, that way the persosn on clinic duty will get to work on it soon after ticket is filed. I understand that process is a bit confusing but permits to access the prod infra (including analytics clusters) are handled by the SRE team at large.
If the intent is to decide whether all errors are from same user you can send the number of errors for that session of that type and that would tell you the piece of info you want to know.
Flushing this a bit more. The number of errors for a device does not need to be per session but rather can be a tally:
We can keep data for longer than 90 days that has no identifying fields. Just need to submit a changeset that lists those fields. Please take a look at docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention
Mon, Oct 19
A hashed IP would still tell you how many IPs are involved, without revealing any individual IP
For it to be truly not revealing on an 2^32 space it will probably needs to be salted.
@elukey to create kerberos credentials
Fri, Oct 16
So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain
This is actually very useful info, thank you.
Thu, Oct 15
Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task).
I think that CKoerner_WMF.'s works fine with editors as well: "When Bethany started editing Malagasy Wikipedia in 2014, there were no Wikipedia editors in her home country of Madagascar" so I do not really see a strong use case for edits versus editors in this case
Adding @Isaac cause I think he can probably be a good person to help to explore more than a simple bucketization solution might be needed.
Wei talked about doing some data analysis to quantify the issues with privacy and country splits. As we spoke we need to quantify the identification risk, an article with 1 pageview in "Greenlandic-language Wikipedia" might carry an identification risk of 1/55,000 (55,000 being the population of Greenland) and article in Malasyan in San marino might have an identification risk of 1/5 (5 citizens with malsyan names in San Marino) so it is not the "number of pageviews" that defines the identification risk but rather "possible population from which this pageviews are drawn"
Moving to kanban and @razzi to work on this.
Wed, Oct 14
Pinging @JAllemandou in case he can think of any reason why we should leave these fields, giving precision.
@CKoerner_WMF just so you know this data has been publicy available for now about a year, the task in question is to visualize it via Wikistats.
Tue, Oct 13
This is scheduled to be added to wikistats Q2 2020 (Sep to Dec)
Mon, Oct 12
Sum up: The timeseries of entropy of os_family per access_method works well to as a data quality timeseries for 'mobile web' (see green line in plot above) and 'mobile app' (orange line). For desktop, the timeseries is a lot better from April onwards when filtering of automated agents is deployed (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Why_do_we_need_more_sophisticated_bot_detection). The blue line in above graph starts to clearly oscillate with a weekly cadence from April onwards. Now, as it can also be appreciated on above graph, there are still spikes due to undetected bots. Those are bots that elude our detection for a number of reasons (they are real well spread geographically or their effect on pageviews is not as high as our thresholds).
ping on announcement e-mail to wikitech-l (cc @fdans )
Thu, Oct 8
+1 to the active/active plan
Expiry contact will be @Ottomata end data is April 1 2021
The raw data is available for per-language project broadly on the API or on dumps.
You are right. i forgot these files are available per project. My reasoning above breaks down where project (more or less) == country (per @lexnasser) example as that data is already public. The example of users in Kyrgyzstan taht "speak french and have access to internet" still stands.
Do not disagree, just mentioning this as something to think about.
@lexnasser Nice, yes, same considerations apply to your example. That such a low count is available speaks of a bug here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/monthly/pageview_top_articles.hql
Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan?
Wed, Oct 7
@JAllemandou this is a good example to be able to keep as many mediawiki history dumps as possible, correcting this metric can only go as far back as the snapshots we keep
Top 100 in each language.
@AMuigai @Amire80 @lexnasser is going to start working on this project this quarter. I propose we use a top 100 of articles per country where these articles can be in *any* language. Thus you could request "the most popular articles in french in spain" in the month of 2020-01 and that list might have say 5 articles because the top 100 includes 95 articles in Spanish and 5 in french, makes sense?
@lexnasser As a remainder here is the design document for the prior AQS endpoint: https://drive.google.com/drive/u/0/folders/1bcy6Iyb_bLwD1jcfjhL4vtKZvD-CN22L
After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).
You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions
Maybe leaning towards using a transform function, because code would be shorter and less moving pieces?
I think having very specific code on refine to apply to just one job is an anti-pattern, albeit (you are right) shorter but on my opinion, much more brittle.
Tue, Oct 6
@mforns: it could also be a second job run after the refined one (similar to how we do virtual-pageviews) as we probably do not want to create special refine functions for just one dataset
Thanks for reporting, new wikis data is not scooped automatically, we will add this one to our list.
Mon, Oct 5
Sat, Oct 3
mmm, no scratch that , it is not fixed but will fail undeterministically
Fri, Oct 2
My opinion on this request is that having non throughly supervised contributors accessing data introduces too big of a risk of a data leak, I think we should strive to make this data available publicly with a differentially private strategy.
FYI, that i reruned this timer as it is now deployed on an-launcher1002 (with only the 'order' fix but not the subsequent fix) and it worked. More comments on patch.
@MMiller_WMF we missed this month deploy of this change, will it be oK to wait for the run of November 1st or you needed it sooner?
Thu, Oct 1
I think these should fine to export, agreed.
ping @nurdinjaelani want to add some info here as to your request?
Entropy of os-family per method of access, orange line shows the big oscillations in mobile apps, oscillation is not present in desktop or mobile app interface
Thanks and agreed on solution 3) . I have assigned to @razzi and we can take this up as part of regular development
Wed, Sep 30
Next quarter as in Q2
Let's close when @fdans sends announcement e-mail
I think it will be fine to archive to hdfs alone. We use those files but sparingly so I do not think there is an issue with them being available just in hdfs.
Assigning to @mforns for final CR.
given that the missing pid package is now debianized let's deploy this code?
Can @mforns confirm that reportupdater can use this package as its?
Ok, let's approve access until 10 March 2021 and when collaboration is extended access can be so. Approved on my end.