User Details
- User Since
- Apr 5 2021, 8:13 PM (103 w, 3 h)
- Availability
- Available
- LDAP User
- Htriedman
- MediaWiki User
- HTriedman (WMF) [ Global Accounts ]
Today
@MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as analytics-platform-eng on stat machines by using sudo -u analytics-platform-eng <cmd>... and am being prompted for my user password — I don't recall ever having used a password to access my stat machines, and it's not any password I can remember. Do you know where I might be able to go for those credentials?
Tue, Mar 21
@Jcross asking for approval from you — I need these rights in order to deploy DP scripts that will run on a schedule on airflow
Mon, Mar 20
Hi @MatthewVernon! We're currently running into some weird errors with Aranya's permissions, specifically regarding access to Turnilo and Superset. Is there any way of addressing that on this thread? Or should we start a new ticket? Thanks so much.
Thu, Mar 16
just bumping this!
Wed, Mar 8
@elukey not exactly sure what's going on here, but I can check into it and get back to you!
Thu, Mar 2
@JArguello-WMF nope! I chatted with @Milimetric a couple of days ago and he said that we're good to go (as an initial MVP release, at least). Waiting on him to feel better to give the final approval and merge. I'll follow up on a new ticket if there's anything else I need besides that.
Tue, Feb 28
Feb 21 2023
Feb 9 2023
@fgiunchedi I just signed up via lists.wikimedia.org! Thanks for getting back to me.
Feb 8 2023
Feb 6 2023
@Vgutierrez thanks so much! taking a look now
Jan 31 2023
Up and running! thanks for the help
My SQL Lab on superset has also not been working for the past week or so!
Nov 28 2022
@andrea.denisse that is correct! 2023-06-30 is the expiry date for @dasm
Nov 15 2022
This model card has already been created!
Nov 9 2022
@fgiunchedi the expiry dates from other @tmlt.io folks are correct!
Nov 3 2022
@Ottomata I think the above is the right approach (if we decide to do it)
Oct 12 2022
Hi all! Just wanted to come back to this thread (even though it's been more than a month or two) with some updates —
Just wanted to jump in here — I'm an engineer on the Privacy Engineering team, and I've been working on releasing this data (pageviews aggregated by country and project) safely for about a year now!
Sep 29 2022
The specific error we're running into now is a Hive configuration issue (I believe), emerging when we use a combination of Skein + Spark3 + a custom virtualenv
Aug 29 2022
Aug 12 2022
woohoo!! it works! so excited to see this up in production; I just tried it myself, and all seems to be working correctly!
Aug 9 2022
@Aklapper it's the output of a project during Innovation Week to 1) retrain an image model for classifying nude, pornographic, gory, etc. imagery and 2) deploy it on the new ML infrastructure (Liftwing) to figure out pain points/issues with non-MLE people and the broader community deploying models.
Jul 25 2022
Hi @awight! I'm the privacy engineer in charge of reviewing data releases at WMF, and I'll try my best to take a look at this request over the course of the next week.
Jul 21 2022
Current iteration of the model is on Github here: https://github.com/htried/Image-Content-Filtration/tree/statbox-retrain-test.
Jun 22 2022
Hi all!
May 11 2022
@Milimetric Thanks for the pointers on this process! I also just talked to @gmodena and think that we're starting to come to a good set of solutions for how we might put together the disparate pieces of this project. I'll be sure to keep you all in the loop.
May 2 2022
@Ottomata Thanks for this update! The differential privacy project is currently using a jerry-rigged version of Spark 3 to run our software packages, so please let me know (either in this thread on phab or via slack) when you've been able to install Spark 3 on anaconda-wmf.
Apr 29 2022
Apr 20 2022
Apr 4 2022
Hi @Addshore! Hope you're well — I'm done with my privacy review and am hoping to share it with you soon (I just need your email)
Hi @Addshore working on this now, hopefully I'll have it done in the next 24h!
Mar 29 2022
Great! I'll be sure to circle back to this in a month or two with some updates.
Mar 28 2022
Thanks so much for getting back to me on this with some more information. We're currently in the middle of establishing protocols and processes around the use of differential privacy (and configuring software!), which should be done by the end of Q4 (June 2022). This data release is definitely possible within certain privacy bounds — if we can wait until then. If not, I can also potentially suggest some other mitigation heuristics.
Mar 25 2022
Hi @JAllemandou — does this pageview data exist in a private table somewhere stripped of the actor_signature field? Or is it preaggregated somehow? This could be a case where differential privacy (which we are currently piloting on similar data) could come in handy.
Mar 22 2022
The privacy team has conducted a review of the proposed data collection scheme — without mitigations, it would be deemed medium risk. However, after privacy-protecting mitigations like automated data deletion after 90 days, bucketing, and restricting access to WMF/NDA'd people, collecting this data was deemed low risk.
Mar 15 2022
Hi @Addshore! I'm Hal, a privacy engineer at WMF, and I'll be taking a look at this (rerunning the notebook, assessing potential harms, writing up a formal privacy review, etc.) in the next few days.
Feb 28 2022
Kiron is leaving Tumult Labs soon, so this task doesn't need to be completed.
Feb 15 2022
Update: they should all be in the NDA and MOU document now
@Dzahn checking in with Legal now
Feb 14 2022
@RhinosF1 just checked, the expiry date is 13 September 2022
I believe the expiry should be roughly 6 months from now — let's say (for the moment, at least) 31 August 2022.
Feb 12 2022
Hi SRE team! Just a couple of clarifications here — the approving party is actually @JBennett, rather than myself.
Jan 26 2022
Also subbing my direct supervisor @Jcross
LDAP and SUL accounts are both linked to this account, let me know if anything else needs to be done on this front!
Jan 19 2022
Dec 14 2021
Nov 3 2021
Totally understand. Thanks for the tips!
Hi @Urbanecm, thanks for the quick response and the helpful pointer. I've been able to get into centralauth by running analytics-mysql centralauth, and can query centralauth.globaluser. I must've been mistaken in thinking that I need access to mwmaint — that came up as part of a discussion with one of my peers who had access to mwmaint, and I didn't realize the same data was accessible with my current user permissions. You can deny this request and close this ticket.
Oct 28 2021
Got it. In that case I don't see it adding any new privacy risk — I'll just make sure to bump my investigation of the frequency and severity of these incidents up on my todo list.
I know that this is the same theoretical attack vector as revision create, e.g. someone creates a page with a title like "Hal Triedman's SSN is XX-XXX-XXXX" that is quickly removed and suppressed, but the revision create event publicly consumable in the event stream for 7 days.
Hi @Milimetric! Thanks for commenting on this task — lots has happened in the last 3-4 weeks and this served as a good reminder to update this thread as to where we currently are on this project.
Oct 22 2021
I have some more updates after working on the algo-accountability repo for another week:
Oct 13 2021
Hi all! Just wanted to post a quick update on some of these ORES transparency efforts — I have (mostly) compiled a repository of datasets (~10GB), model binaries (~0.5 GB), model architectures, model training performance, etc. that are used in ORES. You can check it out at the ores-data repo on Gitlab. There were some holes where datasets/models didn't compile or were otherwise corrupted somehow, but I did my best to document what didn't work for whatever reason.
Sep 30 2021
Just checked the datasets that are going to be made available for public release. Everything is in order and you're all set to share them outside of the NDA group. With the mitigations taken, the residual risk level that this data poses to editors is low.
Hi everyone — I know it's been several months since this ticket has been updated, but work on implementing DP at scale in production has continued over the last several months, and I wanted to post publicly with some updates on our process:
Sep 28 2021
Just took a look. Those files are alright to share with people who have signed NDA with the Foundation. They are not ok to share publicly, since they contain exact counts of editors and edits, rather than aggregated buckets of counts (11-20 editors instead of 14, 100-200 edits instead of 151, etc.).
Hi @GoranSMilovanovic — apologies for the confusion. I understand that you are intending to remove all informations from countries on the Country Protection List, and was trying to respond to @Manuel's follow-up question, just for the sake of learning:
Hi @Manuel — so sorry for the late response; my phabricator account was misconfigured and I didn't get a notification email. Thanks so much for getting back to me with all this information.
Sep 27 2021
Hi again @Isaac! Just wanted to re-respond to your post about protected classes with a strong measure of agreement. The only feature I might add to the content analysis is geography among anonymous IP editors. At the same time, I think a lot of the features we want to measure depend in large part upon the size of the test set to prevent bad stats and data leaks (it really doesn't have to be very large; 200-400 evenly-distributed samples would likely do just fine).
Sep 24 2021
End of the week update: I have officially been able to run a full model card through the pipeline! (yay) Example here (sorry for the sparsity of data, it's only running on ~50 revisions to keep testing relatively quick).
Sep 21 2021
Hi @GoranSMilovanovic and @Manuel! My name is Hal — I'm a privacy engineer on the Privacy Engineering team. There is some precedent for releasing data of this variety, but I still have a couple of questions:
Sep 16 2021
@calbon you're making a good point and this is definitely a conversation worth having.
Sep 13 2021
Sep 10 2021
Quick update — I spent some time visually charting out exactly what the infrastructure/workflow for this system might look like. I've attached an image of my proposed design, and you can make comments on the design on Google Drawings here.
Jul 13 2021
@calbon good first question — yes, the model card bot should overwrite manual edits. In my mind, the canonical way to edit model cards should be by pushing some change to a config file for the card hosted on Gerrit/Gitlab. Then, changes will show up on the next run of the card generator.
Hi all! I just put together a memo synthesizing what dataset/model/service documentation could look like on WMF resources. For generalizability, I've called the documentation "algorithmic accountability sheets". The memo addresses what questions we should be asking for each component of an algorithm, as well as some thoughts on governance, metrics, deprecation, etc.
Jun 24 2021
Jun 4 2021
Part of this task is to make data releases of this type part of the cycle of data releases at WMF so I do not think we should pursue the option of treating this project like a one off data release, rather we should think of it running it as any other data flow as a core requirement.
Thanks @Isaac and @Nuria for the in-depth discussion of the relative pros and cons of these two approaches, and for the deep dive on user-side filtering. I wanted to chime in with some more context that I recently learned about putting DP into production, regardless of how we filter/limit pageviews. These considerations may be relevant as we move toward creating this as a service.
Jun 3 2021
It's working perfectly! Thanks so much for the responsiveness.
Hi all — reopening this task so that I can get access to https://superset.wikimedia.org as a Hive GUI.
Jun 2 2021
@JBennett tagging you to flag that you need to sign off on this
May 21 2021
May 3 2021
Hi all — just finished updating the demo to get it into a good place. You can see the finished product (UI, user- and pageview-level privacy, etc.) at https://diff-privacy-beam.wmcloud.org. Please let me know what you think, and if there are any next steps that any of you can see toward getting this into a production prototype. Thanks for all the help so far :)
Apr 30 2021
@TedTed, thanks for explaining thresholding and why δ is necessary, even with Laplace noise. Really useful to know what's happening under the hood of Privacy on Beam.
Apr 21 2021
- You mention processing 500,000 rows in the README. Am I correct in assuming this is the process: 1) gather top-50 viewed articles from API for that language, 2) de-aggregate the data and load into database so that e.g., an article with 50,000 pageviews becomes 50,000 separate rows, 3) extract the data and run through the diff-privacy framework (any filtering + addition of noise), 4) return privacy-aware counts.
Apr 20 2021
Just wanted to give you a quick status update — I have a somewhat functional re-implementation of @Isaac's tool using Golang/Beam up and running locally. I'm still working on getting it working/hosted in Toolforge (which doesn't as a service play very nicely with Go quite yet), but I'm hoping that should be done this week.
Apr 16 2021
Hi all — I'm Hal Triedman, the new Privacy Engineering intern. Over the last few days, I've been working on re-implementing the tool that @Isaac made (https://diff-privacy.toolforge.org) using Go's Apache Beam SDK and Google's Privacy on Beam package, rather than Python, Flask, and hand-coded DP functions.