Page MenuHomePhabricator

Htriedman (Hal Triedman)
Privacy Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Apr 5 2021, 8:13 PM (158 w, 6 d)
Availability
Available
LDAP User
Htriedman
MediaWiki User
HTriedman (WMF) [ Global Accounts ]

Recent Activity

Wed, Apr 17

Htriedman added a project to T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page: Wikimedia-Hackathon-2024.

going to investigate the feasibility of this at the WMF Hackathon in a few weeks

Wed, Apr 17, 4:46 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams
Htriedman added a project to T362805: Build a tool (or tools) to easily visualize DP datasets: Toolforge.
Wed, Apr 17, 4:44 PM · Technical-Tool-Request, Wikimedia-Hackathon-2024
Htriedman moved T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page from Backlog to Hacking projects on the Wikimedia-Hackathon-2024 board.
Wed, Apr 17, 4:44 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams
Htriedman moved T362805: Build a tool (or tools) to easily visualize DP datasets from Backlog to Hacking projects on the Wikimedia-Hackathon-2024 board.
Wed, Apr 17, 4:41 PM · Technical-Tool-Request, Wikimedia-Hackathon-2024
Htriedman created T362805: Build a tool (or tools) to easily visualize DP datasets.
Wed, Apr 17, 4:39 PM · Technical-Tool-Request, Wikimedia-Hackathon-2024
Htriedman updated subscribers of T344624: Missing contributor stats for Singapore.

I believe that we *do* actually publish the data about Singapore editor numbers — when I query wmf.geoeditors_public_monthly internally, I get ~2000 rows from Singapore, including from March 2024. And Singapore is considered a "Lower risk" country on the Country and Territory Protection List (you can open this tsv which is the official source of truth and ctrl-F "Singapore")...

Wed, Apr 17, 3:50 PM · Data-Engineering, Data-Engineering-Wikistats

Thu, Mar 28

Htriedman added a comment to T327982: Add cawiki to clickstream dataset.

@VirginiaPoundstone should this be considered under the same auspices as T289532? It may be worthwhile to consider wrapping this up as part of that other task just to limit one-off work that might need to be repeated.

Thu, Mar 28, 10:05 PM · Data Products, Analytics

Mar 22 2024

Htriedman added a comment to T341139: project-title-country missing US data in recent data, and double quote escaping.

Yes — there's no further work to be done.

Mar 22 2024, 3:49 PM · Data Products, Data-Engineering

Mar 20 2024

Htriedman added a comment to T360073: Wikistats "Active Editors by Country" does not follow definition for active editors.

@Milimetric FWIW about the weekly dataset — folks from product analytics told me that maintaining and publishing the monthly dataset is important for continuity with existing dataset.

Mar 20 2024, 4:31 PM · Data Products, Data-Engineering, Movement-Insights, Data-Platform
Htriedman added a comment to T360073: Wikistats "Active Editors by Country" does not follow definition for active editors.

@kzimmerman @Milimetric happy to set up a meeting next week to discuss the differences between the DP and non-DP versions of the geoeditors monthly/weekly datasets

Mar 20 2024, 4:27 PM · Data Products, Data-Engineering, Movement-Insights, Data-Platform
Htriedman added a comment to T341139: project-title-country missing US data in recent data, and double quote escaping.

Hi @Ogiermaitre! Thanks for bringing this to our attention, and sorry it's taken so long to respond to you about this — I didn't know this ticket existed until 20 min ago. The US data problem has been fixed.

Mar 20 2024, 4:09 PM · Data Products, Data-Engineering

Mar 18 2024

Htriedman added a comment to T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page.

@Ladsgroup is correct about this — this is already happening on an ad hoc basis in some cases where there may be concerns about editor safety for sensitive material.

Mar 18 2024, 6:05 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams

Feb 27 2024

Htriedman added a project to T358601: Fix CI/CD issues in the differential-privacy repository: Privacy Engineering.
Feb 27 2024, 4:41 PM · Privacy Engineering, Data-Engineering
Htriedman created T358601: Fix CI/CD issues in the differential-privacy repository.
Feb 27 2024, 4:40 PM · Privacy Engineering, Data-Engineering

Feb 15 2024

Htriedman added a comment to T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page.

thanks for awareness around your capacity, @DannyS712!

Feb 15 2024, 4:04 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams

Jan 31 2024

Htriedman added a comment to T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page.

There's no huge rush — we've deployed a short-term fix, but it requires some manual updating from WMF developers on a regular basis. If you can get to this in a few weeks that would be fine.

Jan 31 2024, 10:03 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams
Htriedman added a comment to T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page.

Hi @DannyS712! Have you made any progress on this?

Jan 31 2024, 8:17 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams

Jan 24 2024

Htriedman added a comment to T355696: Update canonical_data.countries to reflect new country protection policy.

This task is high priority — the new country and territory protection list policy is out, and we'd like the internal list to reflect that as soon as possible.

Jan 24 2024, 10:08 PM · Movement-Insights

Jan 23 2024

Htriedman created T355696: Update canonical_data.countries to reflect new country protection policy.
Jan 23 2024, 5:07 PM · Movement-Insights

Jan 9 2024

Htriedman added a comment to T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page.

I'm assuming that this is all meant to be oversight-level suppression, rather than admin-level (unless we want both as options).

Correct!

Jan 9 2024, 5:10 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams

Jan 8 2024

Htriedman created T354577: Create Mediawiki "oversightprotect" action that suppresses usernames of all edits of a page.
Jan 8 2024, 11:32 PM · Wikimedia-Hackathon-2024, MediaWiki-Revision-deletion, MediaWiki-Page-protection, User-DannyS712, Privacy Engineering, Data-Engineering, Security, Event-Platform, EventStreams
Htriedman added a comment to T353306: [request] consultation for a whitepaper.

@leila Hi! Would love to hear more about what you're thinking with regard to privacy :) Feel free to schedule something with me or continue the conversation here!

Jan 8 2024, 6:17 PM · SecTeam-Processed, Privacy Engineering, Research

Nov 13 2023

Htriedman added a comment to T343855: AQS 2.0 differentially private pageviews deploy API.

Any updates on this?

Nov 13 2023, 8:11 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Oct 31 2023

Htriedman closed T207171: Have a way to show the most popular pages per country as Resolved.
Oct 31 2023, 11:24 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Htriedman closed T299627: Investigate releasing historical top-pageview-per-country data as Resolved.

Update (very late but still necessary): As of Feb 2023, this data request has been completed!

Oct 31 2023, 11:11 PM · Privacy Engineering, Data-Engineering

Oct 19 2023

Htriedman added a comment to T343855: AQS 2.0 differentially private pageviews deploy API.

Hi @VirginiaPoundstone! Thanks for the detailed questions! I'll try to answer them one by one :)

Oct 19 2023, 8:43 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Oct 13 2023

Htriedman added a comment to T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one.

@JFishback_WMF I'll invite you to a meeting about this next week!

Oct 13 2023, 9:06 PM · Data Engineering and Event Platform Team (Sprint 3)

Oct 12 2023

Htriedman added a comment to T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one.

@JAllemandou Thanks for the kind words! For the moment, yes — let's try to standardize use of the country protection list and try to avoid keeping multiple versions of the list hard-coded in jobs. I will work on the following:

  1. getting my proposed schema reviewed by legal and human rights
  2. implementing the new schema in hive
  3. updating documentation on wikitech
  4. getting this data release onto a DP framework (cc: @Isaac)
Oct 12 2023, 6:30 PM · Data Engineering and Event Platform Team (Sprint 3)
Htriedman updated subscribers of T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one.

subscribing @Cleo_Lemoisson for visibility

Oct 12 2023, 4:41 PM · Data Engineering and Event Platform Team (Sprint 3)

Oct 11 2023

Htriedman added a comment to T348504: [Data Platform] Update referer job to use global country deny list instead of a hard-coded one.

Hi! Thanks for flagging this, @Isaac! Definitely agree that this dataset is a great candidate for differential privacy (DP), which would also likely reduce the minimum publication threshold to <500. I'm happy to start working on that with you — it's a somewhat independent process from the discussion of the country protection list (CPL) and I think this dataset could benefit from it.

Oct 11 2023, 6:29 PM · Data Engineering and Event Platform Team (Sprint 3)

Oct 4 2023

Htriedman added a comment to T343855: AQS 2.0 differentially private pageviews deploy API.

@Eevans In that case, I'll change the data model to drop it! Will update this thread when it's done.

Oct 4 2023, 10:06 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE
Htriedman added a comment to T343855: AQS 2.0 differentially private pageviews deploy API.

@Eevans Understood! I'll make that change to the schema soon.

Oct 4 2023, 7:52 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Oct 3 2023

Htriedman added a comment to T343855: AQS 2.0 differentially private pageviews deploy API.

Hi all! I've made updates to the codebase to better comply with @Eevans' feedback, resulting in a greatly simplified interface. I've listed the following design changes below:

Oct 3 2023, 11:50 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Sep 21 2023

Htriedman updated subscribers of T347104: Application Security Review Request : Fundraise Up scripts for Donatewiki.

@sbassett tagging you in this for visibility

Sep 21 2023, 9:19 PM · secscrum, Security, Application Security Reviews

Sep 20 2023

Htriedman added a comment to T343855: AQS 2.0 differentially private pageviews deploy API.

Some of them are just artifacts of starting from a fork of one of the legacy services. For example, we'll want to adopt a new (better) convention for keyspace and table naming; Names like "local_group_default_T_dp_pageviews".datawere generated by the RESTBase codebase. Likewise, the "_domain" attribute (which is always set to analytics.wikimedia.org for these services) was done to appease RESTBase, and isn't something we should be perpetuating. Easy changes, mostly cosmetic.

Sep 20 2023, 8:10 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Sep 19 2023

Htriedman added a comment to T346329: Update visibility rules of aggregated participant responses.

I like this idea! Makes a lot of sense and covers more edge cases than my simpler solution was proposing. Feel free to implement this, and if you do, please write it up in a separate document and share it with me — could be very useful in future cases where we're considering releasing similar sensitive data with a relatively small number of raw data entries in the underlying dataset.

Sep 19 2023, 6:16 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Campaign-Tools (Campaign-Tools-Current-Sprint), Campaign-Registration, CampaignEvents

Sep 15 2023

Htriedman added a comment to T346329: Update visibility rules of aggregated participant responses.

As for reporting percentages, you can take an example from the new data publication guidelines. We considered how to report percentages in the "Threshold table" section of the policy: https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines#Threshold_table

Sep 15 2023, 4:32 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Campaign-Tools (Campaign-Tools-Current-Sprint), Campaign-Registration, CampaignEvents
Htriedman added a comment to T346329: Update visibility rules of aggregated participant responses.

Hi @ifried! Thanks for bringing this up — I wrote the initial set of recommendations for obfuscating event data in these contexts, and know that there are many contexts in which showing "<5" to an event organizer will leak the exact number of responses in that category. It is, at best, a partial fix that will be effective at deterring non-malicious people who have access to reports.

Sep 15 2023, 4:29 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Campaign-Tools (Campaign-Tools-Current-Sprint), Campaign-Registration, CampaignEvents

Sep 6 2023

mfossati awarded T337258: Enable libmamba by default for conda environment solving a 100 token.
Sep 6 2023, 3:52 PM · Data-Platform-SRE, Data Engineering and Event Platform Team, Data-Engineering, Data Pipelines

Aug 22 2023

Htriedman updated subscribers of T343855: AQS 2.0 differentially private pageviews deploy API.

Hi all! It's been a few weeks without activity, so I'm following up on this request.

Aug 22 2023, 7:14 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Aug 21 2023

Htriedman added a comment to T340942: Check home/HDFS leftovers of tmtl.io contractors.

Hi @BTullis! All of these Tumult Labs folks were working in more of an advisory role — even if their directories contain some uncommitted changes, you can delete them and remove their user profiles.

Aug 21 2023, 4:41 PM · Data-Engineering
Htriedman added a comment to T344617: Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError.

Thanks for taking care of this @xcollazo and @BTullis! really appreciate you catching this while I was OOO

Aug 21 2023, 4:38 PM · Data-Platform-SRE

Aug 8 2023

Htriedman created T343855: AQS 2.0 differentially private pageviews deploy API.
Aug 8 2023, 8:13 PM · Cassandra, serviceops, AQS2.0, Service-deployment-requests, Services, SRE

Aug 3 2023

Htriedman updated the task description for T343304: MakeItSPARQL! - build a UI for the LLM that translates natural language into SPARQL queries for Wikidata.
Aug 3 2023, 5:51 PM · Wikimania-Hackathon-2023, Wikidata Query UI, patch-welcome, Wikidata
Htriedman updated the task description for T343304: MakeItSPARQL! - build a UI for the LLM that translates natural language into SPARQL queries for Wikidata.
Aug 3 2023, 5:39 PM · Wikimania-Hackathon-2023, Wikidata Query UI, patch-welcome, Wikidata

Aug 1 2023

Htriedman added a comment to T318863: [Event Platform] Event Platform and DataHub Integration.

Hi @odimitrijevic! Here's the gitlab repo I worked on during the documentathon :) https://gitlab.wikimedia.org/htriedman/documentathon-eventstream

Aug 1 2023, 5:48 PM · Data Engineering and Event Platform Team (Sprint 3), Data-Engineering, Data-Catalog, Event-Platform

Jul 25 2023

Htriedman added a comment to T342487: [Event Platform] Actor performing suppression revealed publicly.

^^agree with the above analysis — if we can selectively remove the performer of suppressions, then this should be considered resolved.

Jul 25 2023, 5:16 PM · Data-Engineering (Sprint 6), MW-1.42-notes (1.42.0-wmf.7; 2023-11-28), SecTeam-Processed, Privacy Engineering, Event-Platform, Vuln-Infoleak, Security

Jul 24 2023

Htriedman added a comment to T340149: Review and provide feedback to Guidelines for Data Publication.

Thanks for your comments, @fkaelin! I'll get back to you about the topN pages once we meet about it.

Jul 24 2023, 4:58 PM · Research

Jul 21 2023

Htriedman added a comment to T207171: Have a way to show the most popular pages per country.

@Flomeier85 if you have any questions at all feel free to post them here or reach out to me via email at htriedman@wikimedia.org :)

Jul 21 2023, 10:07 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews

Jul 20 2023

Htriedman added a comment to T340149: Review and provide feedback to Guidelines for Data Publication.

I know @Isaac has given feedback on this doc — @fkaelin any additional comments?

Jul 20 2023, 7:00 PM · Research

Jul 17 2023

Htriedman added a comment to T315676: Add DP cookie for pageview filtering.

@Vgutierrez this feature has been working as expected, and this ticket can be closed!

Jul 17 2023, 2:50 PM · SRE, Traffic

Jul 14 2023

Htriedman added a comment to T341907: Release datasets in support of Wikimedia-related AI modeling.

Isaac and I spent some time brainstorming about this last month. Here's a google doc with a bunch of existing ideas in it!

Jul 14 2023, 9:07 PM · Research, Epic

Jul 11 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

I would be strongly in favor of using mock data over synthetic data, at least for the moment. We should only have an explicit preference for synthetic data if there's a real need for the underlying statistical distribution of the fake data to mirror that of the real data. If it's just for performance testing, that shouldn't be necessary.

Jul 11 2023, 4:30 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0

Jul 5 2023

Htriedman added a comment to T335892: Get stats on Gadgets and Users scripts loading third-party resources.

Definitely would be pro-overriding the user-agent for fontcdn (and cdnjs) — that would make it significantly easier to argue that they should be considered ok to allowlist for third-party resources.

Jul 5 2023, 6:37 PM · WMF-General-or-Unknown, affects-Miraheze, SecTeam-Processed, Privacy Engineering, tech-decision-forum

Jun 30 2023

Htriedman added a comment to T316600: Broken DAG Error when trying to import Gitlab .tgz file into airflow.

agree! we can close this out

Jun 30 2023, 4:26 PM · Data Pipelines

Jun 29 2023

Htriedman added a comment to T331416: The nsfw model hangs in predict() after moving to Kserve 0.10.

@elukey Feel free to remove it from Lift Wing for the moment! Thanks for letting me know.

Jun 29 2023, 4:25 PM · Machine-Learning-Team

Jun 27 2023

nettrom_WMF awarded T337258: Enable libmamba by default for conda environment solving a Yellow Medal token.
Jun 27 2023, 8:45 PM · Data-Platform-SRE, Data Engineering and Event Platform Team, Data-Engineering, Data Pipelines

Jun 16 2023

nshahquinn-wmf awarded T337258: Enable libmamba by default for conda environment solving a Yellow Medal token.
Jun 16 2023, 2:14 AM · Data-Platform-SRE, Data Engineering and Event Platform Team, Data-Engineering, Data Pipelines

May 30 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.
  1. What do you consider our next steps would be with this approach (using sample data as the initial source)? I ask that because you mention we shouldn’t use it yet in production environment with private or sensitive data so I guess we need to work more on it (to anonymize, for example). It’s not the case at this moment but it’s something we should explore for the future
May 30 2023, 5:29 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0
Htriedman closed T337321: Automating pulling schemas from eventschema to datahub as Invalid.

See this task instead: https://phabricator.wikimedia.org/T318863

May 30 2023, 5:03 PM · Data-Engineering
Htriedman closed T280385: Apache Beam go prototype code for DP evaluation, a subtask of T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data, as Resolved.
May 30 2023, 4:28 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Htriedman closed T280385: Apache Beam go prototype code for DP evaluation as Resolved.
May 30 2023, 4:28 PM · Research-Freezer, Data-Engineering, Privacy Engineering, Privacy, Data-release
Htriedman closed T282195: ApacheBeam prototype for DP noise addition with pageview privacy units on top of Spark, a subtask of T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data, as Resolved.
May 30 2023, 4:27 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Htriedman closed T282195: ApacheBeam prototype for DP noise addition with pageview privacy units on top of Spark as Resolved.
May 30 2023, 4:27 PM · Research-Freezer, Data-Engineering-Radar, Privacy Engineering, Privacy, Data-release

May 23 2023

Htriedman created T337321: Automating pulling schemas from eventschema to datahub.
May 23 2023, 3:21 PM · Data-Engineering

May 22 2023

Htriedman added a comment to T207171: Have a way to show the most popular pages per country.

I don't quite get this. If I query this URL I thought I get views from Romania drilled-down per project and page (see "FCV_Farul_Constanța", present on both enwiki and rowiki). Is this not true or am I missing the defition of the splits?

May 22 2023, 9:54 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Htriedman added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Following up on this as the primary person working on this project for the past 18 months with some details of how this dataset is different from the existing API data:

May 22 2023, 7:00 PM · Data-Engineering, Research, Privacy Engineering, Privacy, Data-release
Htriedman added a comment to T207171: Have a way to show the most popular pages per country.

Hi @Strainu! I was the primary person who worked on implementing this data release for the past 18 months and can describe how this data is different from the API.

May 22 2023, 6:59 PM · Data-Engineering, Data-Engineering-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews
Htriedman added a project to T337258: Enable libmamba by default for conda environment solving: Data-Engineering.
May 22 2023, 5:55 PM · Data-Platform-SRE, Data Engineering and Event Platform Team, Data-Engineering, Data Pipelines
Htriedman created T337258: Enable libmamba by default for conda environment solving.
May 22 2023, 5:36 PM · Data-Platform-SRE, Data Engineering and Event Platform Team, Data-Engineering, Data Pipelines

May 16 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

@Sfaci I ran this on stat1006 using the conda-created-stacked, conda-activate-stacked, and conda-deactivate-stacked built-in scripts. Are you using stat machines and conda?

May 16 2023, 4:00 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0

May 10 2023

Htriedman updated the task description for T333853: [Session] Self-hosting ML models on Cloud Services.
May 10 2023, 5:47 PM · Wikimedia-Hackathon-2023

May 9 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

Did some basic experimentation on this front here: https://gitlab.wikimedia.org/htriedman/synth-data/

May 9 2023, 7:25 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0

May 5 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

if I have understood right, the library you mentioned would use source data to create synthetic one and that way we could reuse your script to synthetize data from other datasources, so it sounds really interesting and useful.

May 5 2023, 6:37 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0

May 4 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

@Eevans Thanks so much for the clarification! This rationale makes a great deal of sense to me, and I can focus on trying to provide your team with a simple and repeatable script that can do this across a variety of underlying data sources.

May 4 2023, 6:12 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0
Htriedman added a comment to T335958: The soon-to-be-released pageview datasets should be linked from dumps page .

+1, I don't know exactly who maintains the analytics.wikimedia.org domain. There are also two other data releases with more historical data that hopefully can be linked here:

May 4 2023, 6:02 PM · Privacy Engineering, Data-Engineering

May 3 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

From a columnar perspective, how large will these datasets need to be? The computational resources required to generate good synthetic data scales nonlinearly with the number of columns. Are we talking about datasets with 40 columns, or 4?

Currently, with AQS 2.0, we are talking about 10 columns. But who knows whether, in the future with new projects, we could need to create larger datasets. Anyway, as I'll mentioned below, we won't need to run the process frequently so data size shouldn't be a big issue

May 3 2023, 9:37 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0

May 2 2023

Htriedman added a comment to T334851: Define a procedure/pattern to populate test environments.

Hi all! This is a really interesting problem, and I think that there are definitely some data privacy techniques that seem like they could be useful here — primarily, differentially-private data synthesis.

May 2 2023, 4:35 PM · Catalyst (Prototype leftovers 🍱), SecTeam-Processed, Privacy Engineering, serviceops-radar, WMF-Architecture-Team, Platform Engineering, Release-Engineering-Team, API Platform, AQS2.0

Apr 17 2023

Htriedman added a comment to T317167: Support for moving data from HDFS to public http file server.

Hi all! Any updates on this? I'd love to be able to publish the DP data that is currently stuck in the hdfs:///tmp folder :)

Apr 17 2023, 3:34 PM · Data Pipelines (Sprint 12), Data-Engineering-Planning

Apr 11 2023

Htriedman added a comment to T333001: Setup for allowing Airflow deployment via Git Repository.

I think I'd also prefer option B! Seems straightforward enough from a usability perspective.

Apr 11 2023, 4:27 PM · Data Pipelines (Sprint 12)

Apr 6 2023

Htriedman added a comment to T305082: Request for Private repos to be enabled.

Hi all! I've read this thread and I want to weigh in on this with a perspective from the Privacy Engineering team. I think that there are two primary facts to consider here:

  1. The Product Analytics team (PA) has an organizational mandate from WMF to be doing this work, and they have been doing this work, despite the organizational constraint that they cannot share the outputs of their analyses in the same place as the code that produces those outputs. This is unlikely to change any time soon.
  2. The main issue here is data that is sensitive (i.e. it could potentially be used in harmful ways by a malicious actor) but not confidential (i.e. not certain to be used in harmful ways / not defined as PII in the WMF Privacy Policy) — what @mpopov alludes to with the Turkish editors example above.
Apr 6 2023, 9:10 PM · Privacy Engineering, Release-Engineering-Team (Priority Backlog 📥), Privacy, User-brennen, GitLab (Administration, Settings & Policy), Product-Analytics

Mar 28 2023

Htriedman added a comment to T317167: Support for moving data from HDFS to public http file server.

@Ottomata: @Milimetric and I have talked about adding this data to AQS at some point in the short-/mid-term future, but I think we're going to wait for AQS 2.0 to be released before we start work on that

Mar 28 2023, 6:12 PM · Data Pipelines (Sprint 12), Data-Engineering-Planning
Htriedman added a comment to T317167: Support for moving data from HDFS to public http file server.

+1 to prioritizing this. My usecase for publishing data from HDFS is the following:

Mar 28 2023, 4:12 PM · Data Pipelines (Sprint 12), Data-Engineering-Planning
Htriedman closed T333264: Add analytics-platform-eng-admins on stat* hosts as Resolved.

Resolving this ticket and add my usecase to T317167

Mar 28 2023, 3:55 PM · Data-Engineering
Htriedman added a comment to T333264: Add analytics-platform-eng-admins on stat* hosts.

@Ottomata Any of these options would work for me:

  1. enabling analytics-platform-eng on stat machines
  2. enabling data publication to /srv/published from airflow machines
  3. enabling data publication directly from HDFS using hdfs-rsync (as mentioned in T317167)
Mar 28 2023, 3:44 PM · Data-Engineering
Htriedman created T333264: Add analytics-platform-eng-admins on stat* hosts.
Mar 28 2023, 1:00 AM · Data-Engineering
Htriedman added a comment to T331647: Grant Hal deployment rights.

@Dzahn Thank you so much for the help explaining this! Makes a ton of sense, and I'll create that ticket soon.

Mar 28 2023, 12:51 AM · SRE, SRE-Access-Requests

Mar 27 2023

Htriedman added a comment to T331647: Grant Hal deployment rights.

@MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as analytics-platform-eng on stat machines by using sudo -u analytics-platform-eng <cmd>... and am being prompted for my user password — I don't recall ever having used a password to access my stat machines, and it's not any password I can remember. Do you know where I might be able to go for those credentials?

Mar 27 2023, 11:24 PM · SRE, SRE-Access-Requests

Mar 21 2023

Htriedman updated subscribers of T331647: Grant Hal deployment rights.

@Jcross asking for approval from you — I need these rights in order to deploy DP scripts that will run on a schedule on airflow

Mar 21 2023, 3:27 PM · SRE, SRE-Access-Requests

Mar 20 2023

Htriedman added a comment to T331067: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for AranyaP.

Hi @MatthewVernon! We're currently running into some weird errors with Aranya's permissions, specifically regarding access to Turnilo and Superset. Is there any way of addressing that on this thread? Or should we start a new ticket? Thanks so much.

Mar 20 2023, 8:15 PM · SRE, SRE-Access-Requests

Mar 16 2023

Htriedman added a comment to T331647: Grant Hal deployment rights.

just bumping this!

Mar 16 2023, 7:35 PM · SRE, SRE-Access-Requests

Mar 8 2023

Htriedman added a comment to T331416: The nsfw model hangs in predict() after moving to Kserve 0.10.

@elukey not exactly sure what's going on here, but I can check into it and get back to you!

Mar 8 2023, 6:17 PM · Machine-Learning-Team

Mar 2 2023

Htriedman added a comment to T330234: Differential privacy airflow-dags merge request.

@JArguello-WMF nope! I chatted with @Milimetric a couple of days ago and he said that we're good to go (as an initial MVP release, at least). Waiting on him to feel better to give the final approval and merge. I'll follow up on a new ticket if there's anything else I need besides that.

Mar 2 2023, 7:29 PM · Data Pipelines (sprint 10), Data-Engineering

Feb 28 2023

Htriedman created T330793: Address dataset frontrunning attack in dumps.
Feb 28 2023, 8:46 PM · Dumps-Generation

Feb 21 2023

Htriedman created T330234: Differential privacy airflow-dags merge request.
Feb 21 2023, 9:17 PM · Data Pipelines (sprint 10), Data-Engineering

Feb 9 2023

Htriedman added a comment to T329209: add Hal Triedman (htriedman) to ops-l mailing list.

@fgiunchedi I just signed up via lists.wikimedia.org! Thanks for getting back to me.

Feb 9 2023, 5:12 PM · SRE

Feb 8 2023

Htriedman created T329209: add Hal Triedman (htriedman) to ops-l mailing list.
Feb 8 2023, 5:46 PM · SRE

Feb 6 2023

Htriedman added a comment to T315676: Add DP cookie for pageview filtering.

@Vgutierrez thanks so much! taking a look now

Feb 6 2023, 4:01 PM · SRE, Traffic

Jan 31 2023

Htriedman added a comment to T328152: Some users' presto queries are no longer working in Superset.

Up and running! thanks for the help

Jan 31 2023, 5:15 PM · Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)), Data-Engineering
Htriedman added a comment to T328152: Some users' presto queries are no longer working in Superset.

My SQL Lab on superset has also not been working for the past week or so!

Jan 31 2023, 5:05 PM · Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)), Data-Engineering