Page MenuHomePhabricator

Automate Wikimedia IDM ("Bitu") and GrowthBook role synchronization
Open, Needs TriagePublic

Description

This task is for implementation of the Enforcement section of T419622: Verify GrowthBook access approach parts 1-3.

This should be done first for growthbook-next.wikimedia.org to confirm proper function, then for growthbook.wikimedia.org.

This depends on completion of T420690: Create Project in GrowthBook, then migrate materials and access to it and T420688: Create new Wikimedia IDM ("Bitu") LDAP groups for GrowthBook for economy, although technically it would be possible to try it out without the Project first, and then shift it to have the Project confinement notion second.

This is being dropped into Sprint 21 (it will be DP SRE work) for work tracking, but may be dragged into a subsequent sprint.

Details

Other Assignee
RKemper
Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
growthbook: instrument ldap-sync via airflowrepos/data-engineering/airflow-dags!2208ryankemperT420691-ldap-sync-dagmain
ldap-sync: enable ruff&mypy in CI + fix format/importsrepos/data-engineering/growthbook!44ryankemperT420691-lint-typecheckmain
ldap-sync: drop threshold feature & config modulerepos/data-engineering/growthbook!43ryankemperT420691-3.5-trimmain
ldap-sync: GrowthBook API clientrepos/data-engineering/growthbook!42ryankemperT420691-5-growthbook-clientmain
ldap-sync: fix LDAP_CA_CERT_PATH default to system CA bundlerepos/data-engineering/growthbook!41ryankemperT420691-10-ca-cert-defaultT420691-9-spec
ldap-sync: add README.mdrepos/data-engineering/growthbook!40ryankemperT420691-9-specmain
ldap-sync: applier and CLI (full pipeline)repos/data-engineering/growthbook!39ryankemperT420691-8-applier-climain
ldap-sync: GrowthBook member collector and scope filterrepos/data-engineering/growthbook!38ryankemperT420691-7-growthbook-collectormain
ldap-sync: LDAP collectorrepos/data-engineering/growthbook!37ryankemperT420691-6-ldap-collectormain
ldap-sync: GrowthBook API clientrepos/data-engineering/growthbook!36ryankemperT420691-5-growthbook-clientT420691-4-data-yaml-collector
ldap-sync: resolver and plan (reconciliation core)repos/data-engineering/growthbook!34ryankemperT420691-3-resolver-planmain
ldap-sync: config, logging, and metrics modulesrepos/data-engineering/growthbook!33ryankemperT420691-2-config-loggingmain
ldap-sync: project scaffoldingrepos/data-engineering/growthbook!32ryankemperT420691-1-scaffoldingmain
[WIP] ldap-sync: Scaffold package and reconciliation corerepos/data-engineering/growthbook!31ryankemperT420691-ldap-syncmain
Show related patches Customize query in GitLab

Event Timeline

Change #1270558 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] growthbook: Fix env var indent in job template

https://gerrit.wikimedia.org/r/1270558

High-level Spec

What this spec establishes

A Kubernetes CronJob that reconciles GrowthBook user role assignments from authoritative sources (Bitu LDAP groups + puppet data.yaml). Runs every 10–30 minutes.

Current state of GrowthBook (prod)

~20 active users, all @wikimedia.org. Since mpopov's migration earlier today (https://phabricator.wikimedia.org/T420690#11817235), everyone has globalRole = NoAccess and a projectRoles[Wikimedia] entry carrying their actual tier. New SSO logins auto-provision with No Access — exactly the fail-closed posture we want.

3 "orphan" users also exist (rows in the users table but not in any org); these are invisible to the REST API and the sync can't see or affect them.

Sources & boundaries

Reads (sources of truth):

  • Bitu LDAP — three groups (GrowthBook-Admin, GrowthBook-CustomElevatedAccess, GrowthBook-ReadOnly)
  • puppet data.yaml — POSIX group membership (analytics_privatedata_users gates all access; analytics-product-users, analytics-wmde-users, deployment qualify CustomElevatedAccess)

Writes (only):

  • GrowthBook REST API: POST /members/{id}/role to set the Wikimedia project role. Never DELETE /members/{id} — that removes users from the entire org, which would wipe any future other-project roles they have. (DELETE also leaves the users-table row behind as an "orphan" — an internal admin-UI concept, cosmetic — but the real reason to avoid it is cross-project scope.)

Pipeline

1. Collect    — read LDAP, data.yaml, GrowthBook state (parallel, read-only)
2. Resolve    — compute target role per user (POSIX gate + Bitu rules)
3. Plan       — diff current vs. target, categorize actions
4. Safety     — circuit breaker: abort if deletes OR downgrades > max(5, 10% of pop)
5. Apply      — in order (deletions → downgrades → upgrades → grants)

Safety invariants

  • Fail-closed on any source error (LDAP, gitiles, GrowthBook API) — never apply with stale/missing data.
  • Global role pinned to readonly on every write so a bug in the sync cannot accidentally grant someone cross-project admin.
  • Threshold circuit breaker: a run proposing too many revocations aborts with exit code 2 unless explicitly overridden.
  • Dry-run mode always available; shows the full proposed diff.
  • Audit logging: one JSON line per decision with rule number, source groups, result, sync_run_id for Logstash correlation.

Semi-Open Questions

  1. Does the role mapping match the team's mental model?
  2. Our sync's authority is scoped to the Wikimedia project specifically (not the whole GB org). That way a future WMDE-scoped or Test-Kitchen-scoped project wouldn't be inadvertently disrupted by our revocations. OK to proceed on that assumption?
  3. Any concerns with the safety threshold defaults (5-user floor or 10% of population)?

Detailed spec (rules, algorithm, logging schema, metrics, validation checklist) available on request; but it's very large so I figure getting buyin on the simple high-level spec is more immediately useful.

@BTullis raised the question of whether we should stick with the k8s cronjob approach or just integrate this into airflow. I can see valid arguments for both sides. I'm leaning airflow currently, but in any case, the underlying script will be the same in either case, so for now I'm just flagging this as a deferred decision to revisit later; immediate priority is getting the full spec posted, getting team buyin, and beginning to test out the script

Uploaded initial implementation (WIP, and commit messages need serious cleanup) -> https://gitlab.wikimedia.org/repos/data-engineering/growthbook/-/merge_requests/31

Current state: exercised a subset of codepaths locally and confirmed they work, but I'll need to do a followup test on staging from a wmf host to be able to fully test the functionality (my initial local smoketest omitted the LDAP part of the codebase amonbefore we're reading to proceed to actually deploying in k8s via cronjob (or alternatively, airflow)

Eventually the full spec.md will live as a first-class citizen in the growthbook repo, but for now, here's a phab paste: P91247 (it's several hundred lines so recommend a nice markdown viewer for readability's sake)

Change #1270558 abandoned by Ryan Kemper:

[operations/deployment-charts@master] growthbook: Bump vendored job templ 1.0.1 → 2.0.0

Reason:

This bug is real but non-exercised, and we're not going the k8s-cronjob route that was going to necessitate it. The bumping vendored job template is worth it down the road, but there's other charts that have the same issue, so it likely warrants doing them all at once

https://gerrit.wikimedia.org/r/1270558

Haven't posted here in awhile. Current state: the actual ldap-sync script is fully done (which was the bulk of the work).

Airflow deployment ongoing. We have patch 1 ready now which sets up the initial DAG skeleton and some basic validation testing. I tested it with airflow-devenv, and verified it's working; I attached a log at the end of this comment

Steps remaining

Remaining steps AFAIK until this project is more-or-less wrapped:

  1. Patch 2a (deployment-charts): chart changes
  2. Patch 2b (private puppet): add the API key value (already exists in private puppet under the growthbook chart's namespace, where the GB backend reads it; need to duplicate under the airflow-test-k8s namespace so the DAG's task pod can read it too)
  3. helmfile apply on airflow-test-k8s -> Secret gets the key
  4. Patch 2c (airflow-dags, one-line): pin IMAGE tag
  5. Trigger DAG manually via Airflow UI to test
  6. Patch 3 (airflow-dags, one-line): flip schedule=None -> "*/30 * * * *"
airflow-devenv log

Validated the secret-name resolution end-to-end via airflow-devenv

Spun up a dev instance; the DAG-serialization step succeeded, confirming the scheduler imports the DAG cleanly (no KeyError on os.environ["SERVICE_IDENTIFIER"]):

$ airflow-devenv create --branch T420691-ldap-sync-dag --dags-folder test_k8s
- Creating PG database dev_ryankemper ✅
- Installing airflow dev environment dev-ryankemper ✅
- Waiting for the kerberos pod to start ✅
- Kerberos credentials successfully setup ✅
- Waiting for the scheduler pod to be ready ✅
- Forcing DAG serialization ✅
- Waiting for the webserver pod to be ready ✅

Confirmed SERVICE_IDENTIFIER resolves to the per-instance value via the chart's app.airflow.env helper, and the matching <SERVICE_IDENTIFIER>-secret-config Secret exists in the namespace:

$ kubectl -n airflow-dev exec airflow-dev-ryankemper-scheduler-6b76b64bb8-p7s6t -- printenv SERVICE_IDENTIFIER
airflow-dev-ryankemper

$ kubectl -n airflow-dev get secrets | grep ryankemper
airflow-dev-ryankemper-secret-config           Opaque    2    11m

Inspected the DAG's serialized operator: the secretKeyRef.name resolves to <SERVICE_IDENTIFIER>-secret-config, matching the pattern the chart already uses for its own internal keys:

$ kubectl -n airflow-dev exec airflow-dev-ryankemper-scheduler-6b76b64bb8-p7s6t -- \
    python3 -c 'import json; from kubernetes.client import ApiClient; from airflow.models import DagBag; dag = DagBag().get_dag("growthbook_ldap_sync"); print(json.dumps(ApiClient().sanitize_for_serialization(dag.get_task("sync").env_vars), indent=2))'
[
  {
    "name": "GROWTHBOOK_LDAP_SYNC_API_KEY",
    "valueFrom": {
      "secretKeyRef": {
        "key": "GROWTHBOOK_LDAP_SYNC_API_KEY",
        "name": "airflow-dev-ryankemper-secret-config"
      }
    }
  },
  ...
]

Note the actual GROWTHBOOK_LDAP_SYNC_API_KEY key isn't in the Secret yet, that's ongoing work