User Details
- User Since
- Apr 5 2021, 8:13 PM (158 w, 4 d)
- Availability
- Available
- LDAP User
- Htriedman
- MediaWiki User
- HTriedman (WMF) [ Global Accounts ]
Wed, Apr 17
going to investigate the feasibility of this at the WMF Hackathon in a few weeks
I believe that we *do* actually publish the data about Singapore editor numbers — when I query wmf.geoeditors_public_monthly internally, I get ~2000 rows from Singapore, including from March 2024. And Singapore is considered a "Lower risk" country on the Country and Territory Protection List (you can open this tsv which is the official source of truth and ctrl-F "Singapore")...
Thu, Mar 28
@VirginiaPoundstone should this be considered under the same auspices as T289532? It may be worthwhile to consider wrapping this up as part of that other task just to limit one-off work that might need to be repeated.
Fri, Mar 22
Yes — there's no further work to be done.
Mar 20 2024
@Milimetric FWIW about the weekly dataset — folks from product analytics told me that maintaining and publishing the monthly dataset is important for continuity with existing dataset.
@kzimmerman @Milimetric happy to set up a meeting next week to discuss the differences between the DP and non-DP versions of the geoeditors monthly/weekly datasets
Hi @Ogiermaitre! Thanks for bringing this to our attention, and sorry it's taken so long to respond to you about this — I didn't know this ticket existed until 20 min ago. The US data problem has been fixed.
Mar 18 2024
@Ladsgroup is correct about this — this is already happening on an ad hoc basis in some cases where there may be concerns about editor safety for sensitive material.
Feb 27 2024
Feb 15 2024
thanks for awareness around your capacity, @DannyS712!
Jan 31 2024
There's no huge rush — we've deployed a short-term fix, but it requires some manual updating from WMF developers on a regular basis. If you can get to this in a few weeks that would be fine.
Hi @DannyS712! Have you made any progress on this?
Jan 24 2024
This task is high priority — the new country and territory protection list policy is out, and we'd like the internal list to reflect that as soon as possible.
Jan 23 2024
Jan 9 2024
I'm assuming that this is all meant to be oversight-level suppression, rather than admin-level (unless we want both as options).
Correct!
Jan 8 2024
@leila Hi! Would love to hear more about what you're thinking with regard to privacy :) Feel free to schedule something with me or continue the conversation here!
Nov 13 2023
Any updates on this?
Oct 31 2023
Update (very late but still necessary): As of Feb 2023, this data request has been completed!
Oct 19 2023
Hi @VirginiaPoundstone! Thanks for the detailed questions! I'll try to answer them one by one :)
Oct 13 2023
@JFishback_WMF I'll invite you to a meeting about this next week!
Oct 12 2023
@JAllemandou Thanks for the kind words! For the moment, yes — let's try to standardize use of the country protection list and try to avoid keeping multiple versions of the list hard-coded in jobs. I will work on the following:
- getting my proposed schema reviewed by legal and human rights
- implementing the new schema in hive
- updating documentation on wikitech
- getting this data release onto a DP framework (cc: @Isaac)
subscribing @Cleo_Lemoisson for visibility
Oct 11 2023
Hi! Thanks for flagging this, @Isaac! Definitely agree that this dataset is a great candidate for differential privacy (DP), which would also likely reduce the minimum publication threshold to <500. I'm happy to start working on that with you — it's a somewhat independent process from the discussion of the country protection list (CPL) and I think this dataset could benefit from it.
Oct 4 2023
@Eevans In that case, I'll change the data model to drop it! Will update this thread when it's done.
@Eevans Understood! I'll make that change to the schema soon.
Oct 3 2023
Hi all! I've made updates to the codebase to better comply with @Eevans' feedback, resulting in a greatly simplified interface. I've listed the following design changes below:
Sep 21 2023
@sbassett tagging you in this for visibility
Sep 20 2023
Some of them are just artifacts of starting from a fork of one of the legacy services. For example, we'll want to adopt a new (better) convention for keyspace and table naming; Names like "local_group_default_T_dp_pageviews".datawere generated by the RESTBase codebase. Likewise, the "_domain" attribute (which is always set to analytics.wikimedia.org for these services) was done to appease RESTBase, and isn't something we should be perpetuating. Easy changes, mostly cosmetic.
Sep 19 2023
I like this idea! Makes a lot of sense and covers more edge cases than my simpler solution was proposing. Feel free to implement this, and if you do, please write it up in a separate document and share it with me — could be very useful in future cases where we're considering releasing similar sensitive data with a relatively small number of raw data entries in the underlying dataset.
Sep 15 2023
As for reporting percentages, you can take an example from the new data publication guidelines. We considered how to report percentages in the "Threshold table" section of the policy: https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines#Threshold_table
Hi @ifried! Thanks for bringing this up — I wrote the initial set of recommendations for obfuscating event data in these contexts, and know that there are many contexts in which showing "<5" to an event organizer will leak the exact number of responses in that category. It is, at best, a partial fix that will be effective at deterring non-malicious people who have access to reports.
Sep 6 2023
Aug 22 2023
Hi all! It's been a few weeks without activity, so I'm following up on this request.
Aug 21 2023
Hi @BTullis! All of these Tumult Labs folks were working in more of an advisory role — even if their directories contain some uncommitted changes, you can delete them and remove their user profiles.
Aug 8 2023
Aug 3 2023
Aug 1 2023
Hi @odimitrijevic! Here's the gitlab repo I worked on during the documentathon :) https://gitlab.wikimedia.org/htriedman/documentathon-eventstream
Jul 25 2023
^^agree with the above analysis — if we can selectively remove the performer of suppressions, then this should be considered resolved.
Jul 24 2023
Thanks for your comments, @fkaelin! I'll get back to you about the topN pages once we meet about it.
Jul 21 2023
@Flomeier85 if you have any questions at all feel free to post them here or reach out to me via email at htriedman@wikimedia.org :)
Jul 20 2023
Jul 17 2023
@Vgutierrez this feature has been working as expected, and this ticket can be closed!
Jul 14 2023
Isaac and I spent some time brainstorming about this last month. Here's a google doc with a bunch of existing ideas in it!
Jul 11 2023
I would be strongly in favor of using mock data over synthetic data, at least for the moment. We should only have an explicit preference for synthetic data if there's a real need for the underlying statistical distribution of the fake data to mirror that of the real data. If it's just for performance testing, that shouldn't be necessary.
Jul 5 2023
Definitely would be pro-overriding the user-agent for fontcdn (and cdnjs) — that would make it significantly easier to argue that they should be considered ok to allowlist for third-party resources.
Jun 30 2023
agree! we can close this out
Jun 29 2023
@elukey Feel free to remove it from Lift Wing for the moment! Thanks for letting me know.
Jun 27 2023
Jun 16 2023
May 30 2023
- What do you consider our next steps would be with this approach (using sample data as the initial source)? I ask that because you mention we shouldn’t use it yet in production environment with private or sensitive data so I guess we need to work more on it (to anonymize, for example). It’s not the case at this moment but it’s something we should explore for the future
See this task instead: https://phabricator.wikimedia.org/T318863
May 23 2023
May 22 2023
I don't quite get this. If I query this URL I thought I get views from Romania drilled-down per project and page (see "FCV_Farul_Constanța", present on both enwiki and rowiki). Is this not true or am I missing the defition of the splits?
Following up on this as the primary person working on this project for the past 18 months with some details of how this dataset is different from the existing API data:
Hi @Strainu! I was the primary person who worked on implementing this data release for the past 18 months and can describe how this data is different from the API.
May 16 2023
@Sfaci I ran this on stat1006 using the conda-created-stacked, conda-activate-stacked, and conda-deactivate-stacked built-in scripts. Are you using stat machines and conda?
May 10 2023
May 9 2023
Did some basic experimentation on this front here: https://gitlab.wikimedia.org/htriedman/synth-data/
May 5 2023
if I have understood right, the library you mentioned would use source data to create synthetic one and that way we could reuse your script to synthetize data from other datasources, so it sounds really interesting and useful.
May 4 2023
@Eevans Thanks so much for the clarification! This rationale makes a great deal of sense to me, and I can focus on trying to provide your team with a simple and repeatable script that can do this across a variety of underlying data sources.
+1, I don't know exactly who maintains the analytics.wikimedia.org domain. There are also two other data releases with more historical data that hopefully can be linked here:
May 3 2023
From a columnar perspective, how large will these datasets need to be? The computational resources required to generate good synthetic data scales nonlinearly with the number of columns. Are we talking about datasets with 40 columns, or 4?
Currently, with AQS 2.0, we are talking about 10 columns. But who knows whether, in the future with new projects, we could need to create larger datasets. Anyway, as I'll mentioned below, we won't need to run the process frequently so data size shouldn't be a big issue
May 2 2023
Hi all! This is a really interesting problem, and I think that there are definitely some data privacy techniques that seem like they could be useful here — primarily, differentially-private data synthesis.
Apr 17 2023
Hi all! Any updates on this? I'd love to be able to publish the DP data that is currently stuck in the hdfs:///tmp folder :)
Apr 11 2023
I think I'd also prefer option B! Seems straightforward enough from a usability perspective.
Apr 6 2023
Hi all! I've read this thread and I want to weigh in on this with a perspective from the Privacy Engineering team. I think that there are two primary facts to consider here:
- The Product Analytics team (PA) has an organizational mandate from WMF to be doing this work, and they have been doing this work, despite the organizational constraint that they cannot share the outputs of their analyses in the same place as the code that produces those outputs. This is unlikely to change any time soon.
- The main issue here is data that is sensitive (i.e. it could potentially be used in harmful ways by a malicious actor) but not confidential (i.e. not certain to be used in harmful ways / not defined as PII in the WMF Privacy Policy) — what @mpopov alludes to with the Turkish editors example above.
Mar 28 2023
@Ottomata: @Milimetric and I have talked about adding this data to AQS at some point in the short-/mid-term future, but I think we're going to wait for AQS 2.0 to be released before we start work on that
+1 to prioritizing this. My usecase for publishing data from HDFS is the following:
Resolving this ticket and add my usecase to T317167
@Dzahn Thank you so much for the help explaining this! Makes a ton of sense, and I'll create that ticket soon.
Mar 27 2023
@MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as analytics-platform-eng on stat machines by using sudo -u analytics-platform-eng <cmd>... and am being prompted for my user password — I don't recall ever having used a password to access my stat machines, and it's not any password I can remember. Do you know where I might be able to go for those credentials?
Mar 21 2023
@Jcross asking for approval from you — I need these rights in order to deploy DP scripts that will run on a schedule on airflow
Mar 20 2023
Hi @MatthewVernon! We're currently running into some weird errors with Aranya's permissions, specifically regarding access to Turnilo and Superset. Is there any way of addressing that on this thread? Or should we start a new ticket? Thanks so much.
Mar 16 2023
just bumping this!
Mar 8 2023
@elukey not exactly sure what's going on here, but I can check into it and get back to you!
Mar 2 2023
@JArguello-WMF nope! I chatted with @Milimetric a couple of days ago and he said that we're good to go (as an initial MVP release, at least). Waiting on him to feel better to give the final approval and merge. I'll follow up on a new ticket if there's anything else I need besides that.
Feb 28 2023
Feb 21 2023
Feb 9 2023
@fgiunchedi I just signed up via lists.wikimedia.org! Thanks for getting back to me.
Feb 8 2023
Feb 6 2023
@Vgutierrez thanks so much! taking a look now
Jan 31 2023
Up and running! thanks for the help
My SQL Lab on superset has also not been working for the past week or so!