User Details
- User Since
- Apr 5 2021, 8:13 PM (201 w, 1 d)
- Availability
- Available
- LDAP User
- Htriedman
- MediaWiki User
- Htriedman [ Global Accounts ]
Jan 9 2025
It's been a minute since I did an update on this — my apologies for that!
Nov 1 2024
Weekly update for the week ending on November 1 2024:
Decided to look at top 10 projects by active user size (as detailed on this page), which yields the following projects:
Oct 30 2024
Initial global (not per-project) analysis:
Oct 17 2024
Oct 8 2024
@SLyngshede-WMF Hello! Today is my first day back as a contractor. As of right now, I'm unable to log into Okta or officewiki — is there any process for reinstating access to those services? Happy to start a new ticket if needed.
Aug 26 2024
Hi! I am no longer a member of the privacy team. I would suggest you direct this request to @mpopov or someone on the privacy legal team at WMF.
Aug 20 2024
As of a few weeks ago I am no longer an employee of WMF, so feel free to change it over whenever @Dzahn!
Aug 13 2024
Got it — thanks for letting me know!
@KFrancis email sent!
Aug 6 2024
@Dzahn Got it — yeah, the one I recall signing a few years ago was through the phab UI. I'll wait for Kate to weigh in and send me the email!
@SLyngshede-WMF that sounds like a good plan! Can you link me too the volunteer NDA? I may have already signed it when I was an intern back in 2021, but I'd love to check. As for my personal email, I've updated wikitech and phab to be associated with it (still waiting on gitlab to update), as detailed higher up in this thread. Anything else I need to do there?
Aug 2 2024
@Dzahn Great questions! I'm planning on rejoining the Foundation as a contractor under WME in early October. I'll be work on data products in and around the analytics infrastructure.
Aug 1 2024
Yep, done!
I would change your email address on the wikitech account to your personal email.
I'm about to leave WMF, and I wanted to leave a comment here summarizing the design spec for this desired functionality.
Jun 4 2024
sounds good to me!
May 29 2024
Reading up on this thread now! I think that idea 1 sounds good and shouldn't be privacy-breaking if we report counts/percentages that are above 250 views. Given that we get something like 600m views per day, that lower threshold accounts for 0.000042% of our traffic.
May 28 2024
@AndrewTavis_WMDE this looks good to me! As for T365700, like I said there, it seems like the features you're publishing are pertaining to the underlying dataset, not user/editor/reader activity, so there are no direct privacy concerns.
May 24 2024
up to you! the only requirement is to redact/filter out that data
May 23 2024
Hi @AndrewTavis_WMDE! Taking a look at this in conjunction with the existing Data Publication Guidelines.
May 14 2024
I've been actively working on parsing on-wiki discussions in the context of the request a query archive, and I took a few hours to adapt that (hacky but mostly working) code to this task!
May 10 2024
I've been doing a lot of wikitext parsing work for the SparQL dataset, including parsing on-wiki conversations. If I can figure that out for (say) the request a query archive, I may take a crack at this adapting the same script to parse RfCs. Will keep everyone updated on this phab task!
Apr 30 2024
Got it! Since this is downstream of an existing event schema, and collects no user identifiers, granular geographic identifiers, or page identifiers, this data collection activity is lower risk. You can go ahead and proceed with building this.
Apr 26 2024
Hi @WMDE-Fisch! I'll be conducting this review on phab (at the request of the WMF Legal team until there's a formal agreement between WMF and WMDE). Here's what I originally posted on the L3SC ticket:
Apr 25 2024
Apr 17 2024
going to investigate the feasibility of this at the WMF Hackathon in a few weeks
I believe that we *do* actually publish the data about Singapore editor numbers — when I query wmf.geoeditors_public_monthly internally, I get ~2000 rows from Singapore, including from March 2024. And Singapore is considered a "Lower risk" country on the Country and Territory Protection List (you can open this tsv which is the official source of truth and ctrl-F "Singapore")...
Mar 28 2024
@VirginiaPoundstone should this be considered under the same auspices as T289532? It may be worthwhile to consider wrapping this up as part of that other task just to limit one-off work that might need to be repeated.
Mar 22 2024
Yes — there's no further work to be done.
Mar 20 2024
@Milimetric FWIW about the weekly dataset — folks from product analytics told me that maintaining and publishing the monthly dataset is important for continuity with existing dataset.
@kzimmerman @Milimetric happy to set up a meeting next week to discuss the differences between the DP and non-DP versions of the geoeditors monthly/weekly datasets
Hi @Ogiermaitre! Thanks for bringing this to our attention, and sorry it's taken so long to respond to you about this — I didn't know this ticket existed until 20 min ago. The US data problem has been fixed.
Mar 18 2024
@Ladsgroup is correct about this — this is already happening on an ad hoc basis in some cases where there may be concerns about editor safety for sensitive material.
Feb 27 2024
Feb 15 2024
thanks for awareness around your capacity, @DannyS712!
Jan 31 2024
There's no huge rush — we've deployed a short-term fix, but it requires some manual updating from WMF developers on a regular basis. If you can get to this in a few weeks that would be fine.
Hi @DannyS712! Have you made any progress on this?
Jan 24 2024
This task is high priority — the new country and territory protection list policy is out, and we'd like the internal list to reflect that as soon as possible.
Jan 23 2024
Jan 9 2024
I'm assuming that this is all meant to be oversight-level suppression, rather than admin-level (unless we want both as options).
Correct!
Jan 8 2024
@leila Hi! Would love to hear more about what you're thinking with regard to privacy :) Feel free to schedule something with me or continue the conversation here!
Nov 13 2023
Any updates on this?
Oct 31 2023
Update (very late but still necessary): As of Feb 2023, this data request has been completed!
Oct 19 2023
Hi @VirginiaPoundstone! Thanks for the detailed questions! I'll try to answer them one by one :)
Oct 13 2023
@JFishback_WMF I'll invite you to a meeting about this next week!
Oct 12 2023
@JAllemandou Thanks for the kind words! For the moment, yes — let's try to standardize use of the country protection list and try to avoid keeping multiple versions of the list hard-coded in jobs. I will work on the following:
- getting my proposed schema reviewed by legal and human rights
- implementing the new schema in hive
- updating documentation on wikitech
- getting this data release onto a DP framework (cc: @Isaac)
subscribing @Cleo_Lemoisson for visibility
Oct 11 2023
Hi! Thanks for flagging this, @Isaac! Definitely agree that this dataset is a great candidate for differential privacy (DP), which would also likely reduce the minimum publication threshold to <500. I'm happy to start working on that with you — it's a somewhat independent process from the discussion of the country protection list (CPL) and I think this dataset could benefit from it.
Oct 4 2023
@Eevans In that case, I'll change the data model to drop it! Will update this thread when it's done.
@Eevans Understood! I'll make that change to the schema soon.
Oct 3 2023
Hi all! I've made updates to the codebase to better comply with @Eevans' feedback, resulting in a greatly simplified interface. I've listed the following design changes below:
Sep 21 2023
@sbassett tagging you in this for visibility
Sep 20 2023
Some of them are just artifacts of starting from a fork of one of the legacy services. For example, we'll want to adopt a new (better) convention for keyspace and table naming; Names like "local_group_default_T_dp_pageviews".datawere generated by the RESTBase codebase. Likewise, the "_domain" attribute (which is always set to analytics.wikimedia.org for these services) was done to appease RESTBase, and isn't something we should be perpetuating. Easy changes, mostly cosmetic.
Sep 19 2023
I like this idea! Makes a lot of sense and covers more edge cases than my simpler solution was proposing. Feel free to implement this, and if you do, please write it up in a separate document and share it with me — could be very useful in future cases where we're considering releasing similar sensitive data with a relatively small number of raw data entries in the underlying dataset.
Sep 15 2023
As for reporting percentages, you can take an example from the new data publication guidelines. We considered how to report percentages in the "Threshold table" section of the policy: https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines#Threshold_table
Hi @ifried! Thanks for bringing this up — I wrote the initial set of recommendations for obfuscating event data in these contexts, and know that there are many contexts in which showing "<5" to an event organizer will leak the exact number of responses in that category. It is, at best, a partial fix that will be effective at deterring non-malicious people who have access to reports.
Sep 6 2023
Aug 22 2023
Hi all! It's been a few weeks without activity, so I'm following up on this request.
Aug 21 2023
Hi @BTullis! All of these Tumult Labs folks were working in more of an advisory role — even if their directories contain some uncommitted changes, you can delete them and remove their user profiles.
Aug 8 2023
Aug 3 2023
Aug 1 2023
Hi @odimitrijevic! Here's the gitlab repo I worked on during the documentathon :) https://gitlab.wikimedia.org/htriedman/documentathon-eventstream
Jul 25 2023
^^agree with the above analysis — if we can selectively remove the performer of suppressions, then this should be considered resolved.
Jul 24 2023
Thanks for your comments, @fkaelin! I'll get back to you about the topN pages once we meet about it.
Jul 21 2023
@Flomeier85 if you have any questions at all feel free to post them here or reach out to me via email at htriedman@wikimedia.org :)
Jul 20 2023
Jul 17 2023
@Vgutierrez this feature has been working as expected, and this ticket can be closed!
Jul 14 2023
Isaac and I spent some time brainstorming about this last month. Here's a google doc with a bunch of existing ideas in it!
Jul 11 2023
I would be strongly in favor of using mock data over synthetic data, at least for the moment. We should only have an explicit preference for synthetic data if there's a real need for the underlying statistical distribution of the fake data to mirror that of the real data. If it's just for performance testing, that shouldn't be necessary.
Jul 5 2023
Definitely would be pro-overriding the user-agent for fontcdn (and cdnjs) — that would make it significantly easier to argue that they should be considered ok to allowlist for third-party resources.
Jun 30 2023
agree! we can close this out
Jun 29 2023
We definitely should run a hadoop query (or a set of queries) to get a sense of access over the past 90 days. I pulled database codes / domain names from canonical_data.wikis where status = "open" and visibility = "private" and got the following list:
@elukey Feel free to remove it from Lift Wing for the moment! Thanks for letting me know.
Jun 27 2023
Jun 16 2023
May 30 2023
- What do you consider our next steps would be with this approach (using sample data as the initial source)? I ask that because you mention we shouldn’t use it yet in production environment with private or sensitive data so I guess we need to work more on it (to anonymize, for example). It’s not the case at this moment but it’s something we should explore for the future
See this task instead: https://phabricator.wikimedia.org/T318863
May 23 2023
May 22 2023
I don't quite get this. If I query this URL I thought I get views from Romania drilled-down per project and page (see "FCV_Farul_Constanța", present on both enwiki and rowiki). Is this not true or am I missing the defition of the splits?
Following up on this as the primary person working on this project for the past 18 months with some details of how this dataset is different from the existing API data:
Hi @Strainu! I was the primary person who worked on implementing this data release for the past 18 months and can describe how this data is different from the API.
May 16 2023
@Sfaci I ran this on stat1006 using the conda-created-stacked, conda-activate-stacked, and conda-deactivate-stacked built-in scripts. Are you using stat machines and conda?