Jan 21 2021
in a setting like the one you describe, what would the attacker know, and what would they be trying to find out?
Sorry let me clarify: what would be known to an attacker is the exact pageviews per project per article, see: https://dumps.wikimedia.org/other/pageview_complete/readme.html
An attack might try to remove the noise in order to find the pageviews per article, per country.
Jan 20 2021
Parking some thoughts from my conversation with @Isaac after his good work this past couple weeks.
Jan 19 2021
cc @Slaporte that blogpost about technical measures to detect censhorship is been published
Jan 16 2021
@srodlund I see, how about (probably a reworked version of)
Jan 15 2021
@srodlund in mobile specially the initial paragraph : "The act of detecting anomalous events in a series of events (in this case a time series of Wikipedia pageviews) is called anomaly detection. The anomalies we are looking for are sudden drops in pageviews on a per-country basis." looks, I think, much too prominent, can we remove entirely so blogpost starts at "About four years ago"
"derivative of logo" sounds good. No rush on publishing it whenever works for you.
Jan 12 2021
Jan 6 2021
@srodlund I think it is almost final! Accepted all your corrections and elaborated a bit on the conclusion. Please take a second look. Let me know if the tables are to be translated into images (or HTML tables) or how do you prefer to do that.
Jan 5 2021
Thanks for the fast response!
Jan 4 2021
Dec 18 2020
@srodlund perfect, that gives me next week to finalize the text. The new year sounds great.
Dec 2 2020
@Aklapper I assigned to myself again after my account was re-activated
Nov 27 2020
Done both things, many thanks @Reedy
Nov 18 2020
To keep archives happy, WMF did teh work of productionizing these scripts: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Traffic_per_city_entropy
Nov 12 2020
@JAllemandou Given that user fingerprinting on pageview_hourly data is not effective (and if it were to be it would be a problem) I *think* I am going to center my efforts - when, ahem, I can get to this - in other privacy 'units'
Nov 7 2020
This is WIP. please see: T207171: Have a way to show the most popular pages per country
Nov 6 2020
Thanks @TedTed for all these pointers, on my end I need to digest all this info before I can get back to you, others here might have more questions.
We have IPs in a temporary dataset, called pageview_actor that feeds into pageview_hourly, so that's where we'd get the fingerprint Joseph is talking about. We could insert two steps in between these datasets,
Nov 5 2020
Say that field has value 5, does it means that the page had 5 different views, potentially from 5 different users (but all with the same country, user_agent_map, etc.)?
Yes, exactly, same country, same (broadly) user agent and same article.
This approvals are now handled by @Ottomata
@TedTed Super thanks for chiming in
Oct 29 2020
Oct 28 2020
Code merged now, when the entropy counts are re run alarms for may18th will be resend.
Oct 27 2020
@Dzahn : i do not think so, he should be removed from LDAP
Oct 23 2020
NDA signed now but I do not have access to https://phabricator.wikimedia.org/L2?
Also, @Rmaung please take a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines and ask any questions you might have about it on task
Oct 22 2020
if this is not super urgent i can work on it on my volunteer capacity.
Oct 21 2020
I would implement the daily "top" 1st and once that is in place I would add the monthly job, given the very different amounts of data needed for both a different strategy might be needed for the second one.
A daily release to provide quick information for editors interested in very targeted editing. I suspect that this could even be just a ranking of most popular articles that meet the privacy thresholds without including any raw count data
Nice, +1 to this idea
I found differences of <0.1% for recent months and <0.3% for older months. I think that's acceptable.
Oct 20 2020
For faster resolution of permits issues add SRE-Access-Requests to ticket, that way the persosn on clinic duty will get to work on it soon after ticket is filed. I understand that process is a bit confusing but permits to access the prod infra (including analytics clusters) are handled by the SRE team at large.
If the intent is to decide whether all errors are from same user you can send the number of errors for that session of that type and that would tell you the piece of info you want to know.
Flushing this a bit more. The number of errors for a device does not need to be per session but rather can be a tally:
We can keep data for longer than 90 days that has no identifying fields. Just need to submit a changeset that lists those fields. Please take a look at docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention
Oct 19 2020
A hashed IP would still tell you how many IPs are involved, without revealing any individual IP
For it to be truly not revealing on an 2^32 space it will probably needs to be salted.
@elukey to create kerberos credentials
Oct 16 2020
So, if you just say "this number is too low to be displayed" , I don't think that anyone will complain
This is actually very useful info, thank you.
Oct 15 2020
Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task).
I think that CKoerner_WMF.'s works fine with editors as well: "When Bethany started editing Malagasy Wikipedia in 2014, there were no Wikipedia editors in her home country of Madagascar" so I do not really see a strong use case for edits versus editors in this case
Adding @Isaac cause I think he can probably be a good person to help to explore more than a simple bucketization solution might be needed.
Wei talked about doing some data analysis to quantify the issues with privacy and country splits. As we spoke we need to quantify the identification risk, an article with 1 pageview in "Greenlandic-language Wikipedia" might carry an identification risk of 1/55,000 (55,000 being the population of Greenland) and article in Malasyan in San marino might have an identification risk of 1/5 (5 citizens with malsyan names in San Marino) so it is not the "number of pageviews" that defines the identification risk but rather "possible population from which this pageviews are drawn"
Moving to kanban and @razzi to work on this.
Oct 14 2020
Pinging @JAllemandou in case he can think of any reason why we should leave these fields, giving precision.
@CKoerner_WMF just so you know this data has been publicy available for now about a year, the task in question is to visualize it via Wikistats.
Oct 13 2020
This is scheduled to be added to wikistats Q2 2020 (Sep to Dec)
Oct 12 2020
Sum up: The timeseries of entropy of os_family per access_method works well to as a data quality timeseries for 'mobile web' (see green line in plot above) and 'mobile app' (orange line). For desktop, the timeseries is a lot better from April onwards when filtering of automated agents is deployed (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Why_do_we_need_more_sophisticated_bot_detection). The blue line in above graph starts to clearly oscillate with a weekly cadence from April onwards. Now, as it can also be appreciated on above graph, there are still spikes due to undetected bots. Those are bots that elude our detection for a number of reasons (they are real well spread geographically or their effect on pageviews is not as high as our thresholds).
ping on announcement e-mail to wikitech-l (cc @fdans )
Oct 8 2020
+1 to the active/active plan
Expiry contact will be @Ottomata end data is April 1 2021
The raw data is available for per-language project broadly on the API or on dumps.
You are right. i forgot these files are available per project. My reasoning above breaks down where project (more or less) == country (per @lexnasser) example as that data is already public. The example of users in Kyrgyzstan that "speak french and have access to internet" still stands.
Do not disagree, just mentioning this as something to think about.
@lexnasser Nice, yes, same considerations apply to your example. That such a low count is available speaks of a bug here: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/monthly/pageview_top_articles.hql
Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan?
Oct 7 2020
@JAllemandou this is a good example to be able to keep as many mediawiki history dumps as possible, correcting this metric can only go as far back as the snapshots we keep
Top 100 in each language.
@AMuigai @Amire80 @lexnasser is going to start working on this project this quarter. I propose we use a top 100 of articles per country where these articles can be in *any* language. Thus you could request "the most popular articles in french in spain" in the month of 2020-01 and that list might have say 5 articles because the top 100 includes 95 articles in Spanish and 5 in french, makes sense?
@lexnasser As a remainder here is the design document for the prior AQS endpoint: https://drive.google.com/drive/u/0/folders/1bcy6Iyb_bLwD1jcfjhL4vtKZvD-CN22L
After talking with the, we chose to backup all data except for logs, raw data (unprocessed webrequest, events, and dumps), 2 month of webrequest, and processed wikitext (heavy).
You mean "unprocessed events" ? cause we need to backup the sanitized and unsanitized versions
Maybe leaning towards using a transform function, because code would be shorter and less moving pieces?
I think having very specific code on refine to apply to just one job is an anti-pattern, albeit (you are right) shorter but on my opinion, much more brittle.
Oct 6 2020
@mforns: it could also be a second job run after the refined one (similar to how we do virtual-pageviews) as we probably do not want to create special refine functions for just one dataset
Thanks for reporting, new wikis data is not scooped automatically, we will add this one to our list.
Oct 5 2020