Page MenuHomePhabricator

Schemas Event Secondary: Enhancements for Review
Closed, ResolvedPublic2 Estimated Story Points

Description

Two opportunities identified by Security during threat modeling were:

  1. Deterring instrumentation users' over-collectiong of data, including PII in experiment definitions and via JS components
  1. Deterring schema changes out of the data collection policy of the originally vetted schema

Here we can apply some simple mitigations in service of these opportunities. Here's the suggested approach:

  1. Enhancement of https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary with the two approver rule. The Airflow DAGs repo as shown at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/933/diffs has this. The two approver rule means that the merge requester may be one approver, but the other approver must be someone else.
  1. Enhancement of https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary with a default merge request template ( https://docs.gitlab.com/user/project/description_templates/#create-a-merge-request-template ) requesting confirmation that
  • The commit message contains Phabricator task(s) associated with the change if it creates/updates/deletes schemas or otherwise changes software function in this repo.
  1. Review and possible update to permissions in https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary to ensure that the effective permissions for independent review makes sense.
  1. Update to the README.md in https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary to reflect things people should know for 1-3. The verbiage presently indicates a lowered set of expectations for review and approval, but obviously here we're increasing those expectations.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Changing 'repo' to 'repository'repos/data-engineering/schemas-event-secondary!67dr0ptp4ktREADME-updatemaster
Edit README.mdrepos/data-engineering/schemas-event-secondary!63dr0ptp4ktdr0ptp4kt-master-patch-17135master
README and MR template updates for review enhancementsrepos/data-engineering/schemas-event-secondary!61dr0ptp4kttest-approvalsmaster
Require additional processes for MRsrepos/data-engineering/schemas-event-secondary!60dr0ptp4ktT395395-review-enhancementsmaster
Customize query in GitLab

Event Timeline

Milimetric set the point value for this task to 2.Jun 12 2025, 3:21 PM

For posterity's sake, here are the branch merge rules in the repo at the moment. In other words, it's possible for a developer entitled user to be the person doing the merge operation with the ▶️ button (they don't have to be a maintainer), so long as someone else in the repo's user list approved first. And it can go the other way around, where the submitter approves and then someone else does the merge operation.

{F62434186 size=full}

TODO from meeting this morning: @dr0ptp4kt make little communication draft for @JVanderhoop-WMF review and then communicate out further.

Deterring schema changes out of the data collection policy of the originally vetted schema

Q: What are the schema changes that can alter data collection?

IIRC, the only way a schema change can affect data collection are by adding the http.client_ip field. Are there any others?

Ideally, we'd stop setting this (and any data?) based on presence of schema field too, just like we did in T382173 for user-agent.

user-agent collection should be opt-out after T382173: Enable Event Platform streams to opt out of collecting User-Agent data, and if we do T384964: [Event Platform] Disable default collection of user agent for analytics streams, it will be opt-in instead.(except for PHP/server sent stuff, which needs code changes).

But, in T382173#10504959 (see linked Slack thread), we decided that really, the right thing to do is T385180: Implement agent.ua_string as contextual attribute.

Anyway, I'm not opposed to this change! But, I have a hunch that if we were to do all the 'right' things and not rely on schemas to vary how data is collected, then we wouldn't gain any extra data collection deterrence by restricting merge rights.

Deterring schema changes out of the data collection policy of the originally vetted schema

Q: What are the schema changes that can alter data collection?

IIRC, the only way a schema change can affect data collection are by adding the http.client_ip field. Are there any others?

As broad as https://foundation.wikimedia.org/wiki/Legal:Data_Collection_Guidelines#Data_collection_risk_tiering_grid . I think in practice the stuff that might be somewhat more likely to be (re)introduced would be IP address (as you rightly note), geo info falling below a coarseness threshold, user-provided demographic data, and things tied together with app install ID.

(Hopefully for the edge unique case, we actually have very little of any of those...this makes me wonder if we would do well in the README.md and the checklist to also provide a pointer to happy paths to follow. @JVanderhoop-WMF and I have a sync tomorrow and maybe we can see if there's some verbiage we want to add in the message I'll send out...and tee up for another MR.)

Ideally, we'd stop setting this (and any data?) based on presence of schema field too, just like we did in T382173 for user-agent.

Agreed! Of course there can be legitimate needs for it, but that hopefully is less common.

user-agent collection should be opt-out after T382173: Enable Event Platform streams to opt out of collecting User-Agent data, and if we do T384964: [Event Platform] Disable default collection of user agent for analytics streams, it will be opt-in instead.(except for PHP/server sent stuff, which needs code changes).

But, in T382173#10504959 (see linked Slack thread), we decided that really, the right thing to do is T385180: Implement agent.ua_string as contextual attribute.

Yeah, that continues to seem sensible a lot of the time. I understand that it's typical enough to want to be able to drill down into UA info a little, though, so I think we'll need to see. Of course browser vendors are bolting things down a bit more, although there continue to be ways to draw on the info that's available.

Anyway, I'm not opposed to this change! But, I have a hunch that if we were to do all the 'right' things and not rely on schemas to vary how data is collected, then we wouldn't gain any extra data collection deterrence by restricting merge rights.

I think that's largely the case, too. I guess we can expect there will continue to be non-A/B test things and other product health metrics and detection capabilities where it will still be useful for folks to quickly check assumptions (probably more for initial schema drafting than changes, although same idea, either way)...I do suspect in that future state, though, that's more the 20% case and not the 80% case.

This has now been communicated with some individuals as well as on the DPE Slack channel and cross-posted to another couple channels plus noted on one MR on the repo itself. It'll be shared in a couple days on a larger Slack channel.

As for GitLab UI settings updates, here they are, see screenshots in this here comment.

  • Changed 'Allowed to push and merge' on the master branch to 'No one'.
  • Switching the 'Merge method' from 'Merge commit' to 'Merge commit with semi-linear history'. This was mainly to encourage that folks are branching from up-to-date code.
  • Enabling 'Pipelines must succeed' in 'Merge Checks', so that the approval requirement needs to be met (unless using the break-glass option)
  • Addition of a read_api token, and its value (that would be in the 'value' text field below) included as a CI/CD project variable that's used in the approval GitLab step.

Screenshot 2025-06-26 at 1.13.14 PM.png (553×1 px, 106 KB)

Screenshot 2025-06-26 at 1.14.02 PM.png (920×1 px, 235 KB)

Screenshot 2025-06-26 at 1.15.55 PM.png (510×1 px, 95 KB)

Screenshot 2025-06-26 at 1.24.03 PM.png (1×3 px, 727 KB)

This in addition to the files updated within the repo for CI and the default MR template.