Page MenuHomePhabricator
Paste P91247

Growthbook LDAP sync initial spec
ActivePublic

Authored by RKemper on Apr 21 2026, 8:00 AM.
Referenced Files
F77065814: Growthbook LDAP sync initial spec
Apr 21 2026, 8:00 AM
Subscribers
# T420691 — Automate Bitu/LDAP and GrowthBook role synchronization
## Overview
A Python sync script that reconciles Bitu-managed LDAP group membership
with GrowthBook user role assignments every 10–30 minutes. The sync
cross-references two independent source systems:
- **LDAP** (Bitu group membership) — the user-facing access-control
groups. Owners add or remove users here; the sync enforces the
result in GrowthBook.
- **puppet `data.yaml`** (POSIX group membership) — used as a hard
precondition. A user must be in the required POSIX group(s) before
any Bitu-driven access is granted.
Data flow: LDAP + `data.yaml` (read-only) → reconciliation logic →
GrowthBook REST API (read-write).
WMF-specific terms used throughout:
- **Bitu**: WMF's IDM / LDAP group-management interface. Provides the
user-facing UI for adding/removing LDAP group members; the groups
themselves live in LDAP.
- **gitiles**: WMF's git-repo browser, serves raw file contents over
HTTP. Used here to fetch `data.yaml` from the `operations/puppet`
repo without needing a local checkout.
- **`data.yaml`**: `modules/admin/data/data.yaml` in `operations/puppet`
— the authoritative source for POSIX group membership across WMF
infrastructure.
- **Blubber**: WMF's container build-config tool (one YAML file per
image, declarative variant definitions).
Lives in
[gitlab.wikimedia.org/repos/data-engineering/growthbook](https://gitlab.wikimedia.org/repos/data-engineering/growthbook),
the same repo that builds the upstream GrowthBook image.
**Deployment target**: current implementation targets a k8s CronJob in
the existing GrowthBook Helm chart. Airflow is under consideration as
an alternative execution environment, deferred pending team input. The
collect/resolve/apply pipeline is scheduler-agnostic; the operational
plumbing in this spec (NetworkPolicy, Pushgateway, ConfigMap mounts,
etc.) assumes CronJob and would need revisiting if Airflow wins.
## Key concepts (referenced throughout)
- **projectRoles[]**: GrowthBook's per-member per-project role
assignment array. Each entry maps a project ID to a role string.
This sync writes exclusively into the managed project's entry; the
user's GrowthBook *global* role is always pinned to `readonly`.
- **POSIX gate**: a hard precondition before role evaluation — the
user must belong to `analytics-privatedata-users` (in `data.yaml`).
Non-members get `NO_ACCESS` regardless of their Bitu memberships.
- **Managed project**: the single GrowthBook project this sync has
authority over (configured via `GROWTHBOOK_PROJECT_ID`). Roles in
other projects are left untouched.
- **Threshold circuit breaker**: the sync aborts if a single run
would delete or downgrade more than `max(5, 10% of managed
population)` members — where managed population counts only users
the sync would actively touch (excludes `SKIPPED` and `PENDING_SSO`;
precise definition in Phase 2.5). Prevents mass-revocation
accidents.
## Open questions pending staging verification
- Does LDAP anonymous bind actually work for `GrowthBook-*` groups,
or do we need `proxyagent` credentials (WMF's read-only LDAP
service account)? If proxyagent is needed, the LDAP collector
will need a credentialed-bind code path and the secret delivery
plumbing for the bind password.
- Is the WMF CA bundle actually available at `/etc/ssl/wmf-ca.pem`
in the target container image, or does the chart need a
ConfigMap mount at a different path?
- Is `50/min` sustained an appropriate rate limit against the actual
GrowthBook instance under typical load, or should it be
tighter/looser?
- CronJob interval: the design said "10–30 min" but we never pinned
a value. Start conservative (every 30 min) and tighten based on
staging run durations.
- Do the startup Prometheus metrics render correctly in the existing
Grafana setup, or does dashboard work need to happen before rollout?
---
## Role-resolution rules (T419622 design, with POSIX-gate hard precondition)
**Identifier note — "CustomElevatedAccess" means two different things**
in this spec:
- **Bitu group (LDAP)**: `GrowthBook-CustomElevatedAccess` — the LDAP
group users are added to for elevated access.
- **GrowthBook role (application)**: `CustomElevatedAccess` — the
role string the sync writes into `projectRoles[]`, configurable via
`GROWTHBOOK_ELEVATED_ROLE_NAME`.
Each user has two identifiers in play: `user_uid` (from LDAP and
`data.yaml` POSIX groups) and `user_email` (from GrowthBook members,
resolved from LDAP via the Identity mapping section). The pseudocode
below uses both — POSIX checks key off uid, Bitu membership keys off
email (post-resolution):
```python
if user_uid not in posix_groups["analytics-privatedata-users"]:
target = Role.NO_ACCESS # POSIX gate — hard precondition
else:
rules_matched = []
if user_email in bitu_groups["GrowthBook-Admin"]:
rules_matched.append(Role.ADMIN)
if user_email in bitu_groups["GrowthBook-CustomElevatedAccess"]:
if user_uid in posix_groups["analytics-product-users"] \
or user_uid in posix_groups["analytics-wmde-users"] \
or user_uid in posix_groups["deployment"]:
rules_matched.append(Role.CUSTOM_ELEVATED)
if user_email in bitu_groups["GrowthBook-ReadOnly"]:
rules_matched.append(Role.READ_ONLY)
target = max(rules_matched, default=Role.NO_ACCESS)
```
Tier ranks: `NO_ACCESS=0, READ_ONLY=1, CUSTOM_ELEVATED=2, ADMIN=3`.
Highest matching wins via `max()`. **POSIX gate is NOT a participant
in `max()`** — it's a hard precondition.
All access scoped to the designated GrowthBook Project via
`projectRoles[]`, never via global role.
Rules are numbered for log traceability (surfaced as `rule_number`
in the per-decision log line):
| # | Rule |
|---|---|
| 1 | POSIX gate (hard precondition — user not in `analytics-privatedata-users`) |
| 2 | Admin match (`GrowthBook-Admin`) |
| 3 | CustomElevatedAccess match (`GrowthBook-CustomElevatedAccess` + one of `analytics-product-users` / `analytics-wmde-users` / `deployment`) |
| 4 | ReadOnly match (`GrowthBook-ReadOnly`) |
| 5 | No rule matched (target = NO_ACCESS) |
---
## Architecture
Three-phase pipeline with explicit data structures between phases for
testability:
```
1. Collect (parallel where possible)
GrowthBookCollector.fetch_members() → List[Member]
LDAPCollector.fetch_groups() → Dict[str, Set[Email]]
DataYamlCollector.fetch() → Dict[str, Set[Uid]]
2. Resolve
Resolver.compute_targets(...) → Dict[Email, TargetRole]
Resolver.diff(current, target) → Plan(actions=List[Action])
2.5. Safety check
Plan.evaluate_thresholds(...) → ok | abort
3. Apply
Applier.apply(plan, client, dry_run=bool) → ApplyResult
```
`Plan` is categorized (deletes, downgrades, upgrades, grants, no-ops)
and deterministically ordered. `Action` encodes a single operation
with full provenance (rule fired, source groups, current/target role).
**Apply order (fail-closed):** deletions → downgrades → upgrades →
grants. Within each category, sub-sort by `current_role` descending
(most over-privileged first), then by email. If interrupted mid-apply,
system biases toward less access.
**"Grant" terminology**: the word "grant" is used instead of
"addition" because we never *create* members — we only grant
managed-project access to existing org members. New users always come
from SSO auto-provision.
---
## CLI surface
```
ldap-sync sync [--dry-run] [--max-revocations N] [--max-downgrades N]
[--max-changes-per-run N] [--force]
ldap-sync report
```
- **`sync`** performs the real work: applies (or, with `--dry-run`,
would-apply) the computed plan. `--dry-run` prints the full diff —
per-user actions, counts by category, threshold math — and exits
non-zero if the threshold would have tripped.
- **`report`** is a read-only status view — current per-role seat
counts, count of Bitu-resolved target users, skip-list size, and
source-reachability booleans (LDAP, `data.yaml`, GrowthBook).
Does NOT compute or display the would-be plan; use
`sync --dry-run` for that.
**Exit codes** (distinct per failure class so operators / schedulers
can react):
- `1` = config error or unexpected exception
- `2` = safety threshold exceeded (no `--force`)
- `3` = apply phase completed with per-user failures
- `4` = GrowthBook auth rejected (401/403)
- `5` = source collection failure (LDAP down, gitiles down, GB API down)
- `6` = run-level deadline exceeded (`SYNC_DEADLINE_SECONDS`)
**Negative values on `--max-revocations`/`--max-downgrades`/`--max-changes-per-run`
are rejected** with a message pointing to `--force`.
Both subcommands run startup validation and source collection: API key
works, project ID exists, LDAP reachable, data.yaml parseable, all
three Bitu groups queryable.
**Not validated at startup**: the configured `GROWTHBOOK_ELEVATED_ROLE_NAME`
(default `CustomElevatedAccess`). The GrowthBook public REST API has no
`/roles` endpoint, so a misspelled role name surfaces as per-user apply
failures rather than a startup error. Operators who change the name must
verify correctness out-of-band.
### Config (env vars)
| Var | Purpose | Default |
| ------------------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------- |
| `GROWTHBOOK_LDAP_SYNC_API_KEY` | API auth (Admin scope) | required |
| `GROWTHBOOK_API_BASE_URL` | e.g. `https://growthbook.wikimedia.org/api/v1` | required |
| `GROWTHBOOK_PROJECT_ID` | Designated project ID | required |
| `GROWTHBOOK_ELEVATED_ROLE_NAME` | Custom role string in GB | `CustomElevatedAccess` |
| `GROWTHBOOK_SYNC_SKIP_USERS` | Comma-separated emails to ignore | empty |
| `LDAP_URI` | Primary LDAP server | `ldaps://ldap-ro.eqiad.wikimedia.org:636` |
| `LDAP_URI_FALLBACK` | Failover | `ldaps://ldap-ro.codfw.wikimedia.org:636` |
| `LDAP_CA_CERT_PATH` | WMF CA bundle | `/etc/ssl/wmf-ca.pem` |
| `DATA_YAML_URL` | Gitiles URL for puppet `data.yaml` | `https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml?format=TEXT` |
| `PROMETHEUS_PUSHGATEWAY_URL` | Pushgateway URL | optional |
| `SYNC_DEADLINE_SECONDS` | Run-level wall-time deadline (cleanly abort before CronJob `activeDeadlineSeconds` SIGKILLs mid-write) | `900` |
| `LOG_LEVEL` | Log level | `INFO` |
---
## Identity mapping (DN → email)
LDAP `groupOfNames.member` returns distinguished names (DNs).
GrowthBook identifies users by email (assigned at first SSO login).
1. For each Bitu group, query and get `member` DNs.
2. Extract uids using a proper DN parser (NOT regex / string-split
— DNs can contain escaped commas).
3. Batch-query `ou=people,dc=wikimedia,dc=org` with
`(|(uid=jdoe)(uid=asmith)…)` requesting `mail`. **Chunk filter at
100 uids** to defend against future filter-length limits.
4. Normalize all emails to lowercase.
5. Build `{uid: email}` map.
6. **Fail-closed** if any DN can't be resolved (missing entry,
missing `mail`).
7. **Fail-closed** if two LDAP uids resolve to the same lowercase
email with conflicting Bitu memberships, or if GrowthBook contains
duplicate normalized emails.
---
## SKIPPED semantics
Users listed in `GROWTHBOOK_SYNC_SKIP_USERS` are **opted out of
mutation** but are still *considered* for audit:
- If a skip-listed user exists in GB, a `SKIPPED` Action is emitted
recording both their current role AND the would-be target role, so
audit logs capture "this admin would have been downgraded to
ReadOnly but was opted out."
- Their role is **not changed**. The skip list is an opt-out from
mutation, NOT a revocation trigger. If you want to revoke a user,
remove them from the skip list and let the rules evaluate.
- Skip-listed users with a Bitu target but no GB record also emit
`SKIPPED` (with current=NO_ACCESS) — same audit intent.
- Skip-listed users contribute nothing to threshold population.
Primary use case: service accounts / bot users that should not be
touched by the sync.
## PENDING_SSO semantics
Users who appear in a Bitu `GrowthBook-*` group but have no
GrowthBook member record yet (they haven't logged in via SSO, so
GrowthBook has never auto-provisioned them) emit a `PENDING_SSO`
Action:
- Their intended target role is logged but no API mutation is
attempted (no member ID to target).
- They do NOT count toward threshold math (same treatment as
SKIPPED).
- They are picked up automatically after their first SSO login to
GrowthBook plus the next sync run.
Distinct from SKIPPED: PENDING_SSO is a "not here yet" state for
users the sync will eventually manage; SKIPPED is an explicit opt-out
for users the sync must never manage (service accounts, bots).
---
## Sync algorithm in detail
### Phase 1 — Collect
1. **GrowthBook**: paginated `GET /members`, loop on
`hasMore`/`nextOffset`. On count/total mismatch, retry the full
traversal once (replacing — never appending — the first pass) and
abort if the retry is itself internally inconsistent. Malformed
pagination metadata (`hasMore=true` with `nextOffset=null`, or
non-increasing offset) aborts immediately.
2. **LDAP**: query each of 3 Bitu groups (`GrowthBook-Admin`,
`GrowthBook-CustomElevatedAccess`, `GrowthBook-ReadOnly`),
normalize group common name case before comparison. Then
batch-resolve
uids → emails (see Identity mapping).
3. **data.yaml**: HTTP GET from gitiles (`operations/puppet` repo,
`production` branch, path `modules/admin/data/data.yaml` —
full URL configurable via `DATA_YAML_URL`), base64 decode
(gitiles wraps raw content), YAML parse. Extract REQUIRED keys:
- `groups.analytics-privatedata-users.members`
- `groups.analytics-product-users.members`
- `groups.analytics-wmde-users.members`
- `groups.deployment.members`
If any key missing or malformed: **abort** (fail-closed; never
interpret missing as empty).
### Phase 2 — Resolve
4. For each user appearing in any Bitu GB-* group: compute target
role per the rules above (POSIX gate as precondition, then
`max()` over Bitu rules).
5. For each GB member: lookup by lowercase email.
- Found in target map → compare current vs target role.
- Not found AND not in skip list → target is NO_ACCESS.
- In skip list → skipped, logged with `skipped: true` field.
6. For each user in Bitu groups but NOT in GB (no member record):
emit `PENDING_SSO` action (see PENDING_SSO semantics above).
7. Build `Plan: List[Action]`, categorized as
`{delete, downgrade, upgrade, grant, noop}` plus the non-mutating
`{skipped, pending_sso}` annotations. Skipped and PENDING_SSO
users are excluded from drift counts.
### Phase 2.5 — Safety check (circuit breaker)
**Scope-filter contract (population definition)**: threshold math
integrity depends on the `current_members` input to `diff()` being
pre-filtered to "members of interest". A member appears in
`current_members` if and only if at least one of:
- They have a `projectRoles` entry for the managed project (any role).
- Their email is in at least one Bitu `GrowthBook-*` group (via the
target map built in Phase 2).
- Their email is in `GROWTHBOOK_SYNC_SKIP_USERS`.
Out-of-scope GB members (no managed-project role, no Bitu membership,
not skip-listed) MUST NOT appear in `current_members`. If they did,
they would appear as NOOP actions where both current and target role
are `NO_ACCESS` — inflating the population count and letting
legitimate mass-revocations slip past the safety net. Example: 5
deletes among 10 managed users = 50%, trips the threshold. The same
5 deletes among 100 users (90 of whom are out-of-scope NOOPs) =
5%, slips past.
8. Evaluate thresholds against the **full plan**:
- Deletes > `max(5, 10% of pop)`?
- Downgrades > `max(5, 10% of pop)`?
where `pop` = count of actions in the plan that are NOT
`PENDING_SSO` AND NOT `SKIPPED`. Equivalently: the number of
in-scope GrowthBook members the sync is actively considering
for mutation this run.
9. **`threshold_exceeded{kind}` metric is set for `delete` and
`downgrade` whenever each limit is crossed, regardless of
`--force`.** This way forced runs still leave a trace in
dashboards.
10. If breached AND no `--force`/`--max-revocations`/`--max-downgrades`
override: log full categorized summary, exit non-zero. **Abort by
default; partial-application requires explicit override.**
11. `--force` skips the abort-on-exceed check only; the thresholds
are still computed, logged, and emitted as metrics.
12. `--dry-run` always shows the full diff; if threshold would have
triggered, exit non-zero (signals to automation).
### Phase 3 — Apply
13. Log `{"action": "applying_plan_from_snapshot", "snapshot_ts": …}`
at start (for staleness correlation).
14. Apply per the Architecture "Apply order (fail-closed)" sequence
(deletions → downgrades → upgrades → grants; within category,
sub-sort by `current_role` desc then by email).
15. **Revocation method**: `POST /members/{id}/role` with the managed
project entry **removed from the full `projectRoles` array**,
other entries preserved verbatim. NEVER `DELETE /members/{id}` —
that creates orphans (per source code analysis) and would lose
other-project access.
16. **Read-modify-write on `projectRoles`**: immediately before each
POST, re-fetch the target member's current `projectRoles` (not
the Collect-phase snapshot), modify only the managed-project
entry, send the full array back. The Collect-phase snapshot is
used for diff computation, NOT as the payload source — this
keeps the concurrent-UI-edit clobber window to the per-call
round-trip (milliseconds) rather than the full collect-to-apply
window (seconds to minutes). Any UI edit that lands between
re-fetch and POST is still lost, but that race is small enough
to be an accepted v1 risk.
17. **Drift detection** compares ONLY `projectRoles[managed_project_id]`,
NOT the global `role` field or the full member object (full-deep-
equal would falsely "drift" on irrelevant changes elsewhere).
18. **Global role**: when updating `projectRoles`, explicitly set the
global `role` to the minimum (`readonly`) to avoid accidentally
leaving someone with `admin` global. Note: since global `role`
is NOT part of drift detection (step 17), an out-of-band global-
role promotion will persist until the user's managed-project
role triggers a mutation for some other reason. Accepted for v1;
revisit if global-role abuse becomes a concern.
19. **`--max-changes-per-run N`**: only kicks in AFTER threshold
passes. Caps **successful real mutations** at N — failed
attempts and dry-run "applies" do NOT count against the budget,
so a burst of API failures can't silently exhaust it. NOT a
threshold override. When the cap is reached, the apply loop
still continues so trailing NOOP/SKIPPED/PENDING_SSO actions
get counted in the summary.
20. Per-user API errors: log + continue. Track in counters.
21. Rate limit: 50/min sustained + 10-burst for paginated GETs.
Exponential backoff on 429, honoring `Retry-After` as a minimum
wait (our own backoff can exceed it, but never wait less).
### Phase 4 — Report
22. Push Prometheus metrics (see table below).
23. Emit end-of-run JSON summary log line with categorized counts.
24. **Exit non-zero if any apply action failed** so operators see
partial failures.
---
## Logging
stdlib `logging`, JSON formatter. Per-decision line at INFO:
```json
{
"ts": "2026-…",
"sync_run_id": "<uuid>",
"actor": "ldap-sync",
"user_email": "...",
"user_uid": "...",
"growthbook_member_id": "...",
"current_role": "ReadOnly",
"target_role": "CustomElevatedAccess",
"action": "upgrade",
"action_result": "success",
"rule_number": 3,
"source_bitu_groups": ["GrowthBook-CustomElevatedAccess"],
"source_posix_groups": ["analytics-privatedata-users", "analytics-product-users"],
"drift_detected": false,
"skipped": false,
"dry_run": false
}
```
End-of-run summary at INFO with categorized counts. No-ops at DEBUG.
PII note: `user_email` is acceptable for internal audit log per WMF
convention (emails publicly derivable from staff names).
---
## Prometheus metrics
| Name | Type | Labels | Description |
| --------------------------------------- | --------- | --------------- | ------------------------------------------------------------------------------------ |
| `growthbook_sync_seats_total` | Gauge | `role` | Current seats per role |
| `growthbook_sync_drift_total` | Gauge | `action` | Per-run desired actions |
| `growthbook_sync_actions_applied_total` | Counter | `action,result` | Cumulative results |
| `growthbook_sync_errors_total` | Counter | `source` | Per-source errors |
| `growthbook_sync_run_duration_seconds` | Histogram | — | Wall time |
| `growthbook_sync_last_success_unixtime` | Gauge | — | Staleness |
| `growthbook_sync_last_failure_unixtime` | Gauge | — | Staleness |
| `growthbook_sync_threshold_exceeded` | Gauge | `kind` | 1 if threshold crossed (set even with `--force`; does not imply the run was blocked) |
| `growthbook_sync_source_healthy` | Gauge | `source` | 0/1 reachable |
Metrics are pushed to Pushgateway at the end of each run (the sync
is short-lived, so Prometheus can't scrape it directly). Stale data
from prior runs is prevented by two mechanisms: each push replaces
the entire job group atomically (so label values that don't reappear
are dropped), and all known label combinations are explicitly set to
0 at the start of each run before real values are written (so a role
with 0 seats shows up as 0, not as a stale value from the last run).
---
## Error handling
- HTTP requests: connect timeout 5s, read timeout 20s.
- LDAP: connect timeout 10s, receive timeout 30s. Failover across
eqiad → codfw (primary/secondary pool). LDAPS with mandatory CA
validation (CA bundle path from `$LDAP_CA_CERT_PATH`).
- 429/5xx: exponential backoff with jitter, honoring `Retry-After`
as a minimum wait. Per-call retries are capped at 5 attempts.
- 401/403: hard-fail (don't continue with partial state).
- LDAP/gitiles failures: log + abort run (don't apply with stale source).
- `data.yaml` schema/parse errors: log + abort (fail-closed).
- Per-user API errors: log + continue (counted toward
`growthbook_sync_actions_applied_total{result="failed"}`).
- **Run-level deadline**: `SYNC_DEADLINE_SECONDS` (default 900s /
15 min) caps total wall-time. Checked in the HTTP request loop
and before each apply iteration. On exceed, the run logs a
summary, pushes metrics, and exits with code 6. This cleanly
aborts the run before the CronJob's `activeDeadlineSeconds`
would SIGKILL mid-write, preserving the metric push and the
end-of-run audit log.
Idempotent: safe to re-run anytime.
---
## Repo & file layout
The sync code lives in a new `ldap-sync/` subdirectory of the
`data-engineering/growthbook` repo, alongside the existing upstream
GrowthBook image build config. The `ldap_sync/` Python package has
subpackages for collectors (one per source: LDAP, data.yaml,
GrowthBook), a resolver (rules → target role), a plan/applier split
(plan builds + threshold-checks, applier executes), a paginated
rate-limited GrowthBook client, metrics (Pushgateway), and JSON
logging. Tests mirror the package structure and include an exhaustive
truth-table test for the resolver.
Build: Poetry-managed dependencies (runtime: `ldap3`, `requests`,
`pyyaml`, `prometheus-client`, `click`). Container built via a new
root-level `blubber-ldap-sync.yaml` using
`pip install --require-hashes -r requirements.txt` exported from
`poetry.lock`; the runtime image doesn't need Poetry. The existing
`.gitlab-ci.yml` is extended with build/publish jobs for the new
image.
Tree for operator orientation:
```
data-engineering/growthbook/ (existing repo root)
├── blubber.yaml # existing — prod growthbook image
├── blubber-next.yaml # existing — staging growthbook image
├── .gitlab-ci.yml # existing — extended with new jobs
├── blubber-ldap-sync.yaml # NEW (root, matches convention)
└── ldap-sync/ # NEW
├── pyproject.toml
├── poetry.lock
├── ldap_sync/
│ ├── cli.py
│ ├── config.py # env var validation
│ ├── collectors/
│ │ ├── ldap_collector.py
│ │ ├── data_yaml_collector.py
│ │ └── growthbook_collector.py
│ ├── resolver.py # rules → target role
│ ├── plan.py # Plan, Action, threshold check
│ ├── applier.py # executes plan
│ ├── growthbook_client.py # paginated, rate-limited
│ ├── metrics.py # pushgateway
│ └── log.py # JSON formatter
└── tests/
├── test_resolver.py # exhaustive truth table
├── test_plan.py # diff + threshold logic
├── test_collectors.py # mock LDAP/HTTP
├── test_applier.py # mock GB API, rate limit
└── fixtures/
```
---
## Test strategy
- **pytest with mocks** for collectors, resolver, plan, applier.
- **Exhaustive truth-table test for `Resolver.compute_targets`** —
every combination of Bitu group × POSIX group membership.
- Manual acceptance checklist in README for staging deploy:
1. `ldap-sync report` pointed at staging (via env vars).
2. `ldap-sync sync --dry-run` pointed at staging, review the diff.
3. Manually create a test user, add to a test Bitu group, run
sync, verify role.
4. Remove from Bitu group, run sync, verify revocation.
---
## Validate-before-first-deploy
Operational prerequisites (not code; gate first prod deploy):
1. **GrowthBook SSO config**: confirm default role for new
auto-provisioned users is minimal/no-access. Otherwise ≤30min
unauthorized-access window after first SSO login.
2. **WMF LDAP anonymous bind**: confirm reads work for `GrowthBook-*`
groups under `ou=groups,dc=wikimedia,dc=org` (after T420688
creates the groups). If anonymous bind is unavailable, fall back
to `proxyagent` credentials.
3. **`data.yaml` authority (hard constraint)**: the four POSIX groups
consumed by this sync MUST be defined only in
`modules/admin/data/data.yaml` — never via Hiera (Puppet's
hierarchical data lookup, where values from per-role, per-site,
or per-datacenter YAML files merge at lookup time) layered
overrides or role-generated membership. The script parses the
raw YAML and is structurally blind to Hiera composition; any
membership defined outside the flat file would be invisible and
the POSIX gate would misjudge affected users. This is already the
de facto state for admin-module groups (which resolve via direct
fixed-path lookup rather than hierarchical composition); the
constraint just formalizes it.
4. **NetworkPolicy egress** from deployment target: LDAP eqiad+codfw,
gitiles, GrowthBook API, Pushgateway.
5. **WMF CA bundle** available to mount as ConfigMap.
6. **Custom role `CustomElevatedAccess`** exists in GrowthBook
(after T420690).
7. **Pushgateway reachability** + `prometheus-client` push API
compatibility verified.
---
## Deferred to follow-ups (with comments in code)
- **SIGTERM handler**: spec'd behavior for v2: complete current user
operation, log partial-run marker, exit 0. For v1, next run
re-converges; pod evictions during apply are rare for a 10-30min
CronJob.
- **`--rollback` / snapshot subcommand**: for v1, manual fix via
GrowthBook admin UI is acceptable. Document.
- **Stricter aggressive revocation policies** if needed after observation.
---
## Appendix: Airflow vs k8s CronJob deltas
The sync's collect/resolve/apply core is scheduler-agnostic (env-var
config, push-based metrics, no Airflow-specific imports). The current
reference implementation assumes k8s CronJob execution; this appendix
enumerates what would change if the execution environment moved to
Airflow. Decision deferred; this list is just to make things tangible.
| # | Area | CronJob | Airflow | Change scope |
|---|---|---|---|---|
| 1 | Metrics delivery | Push to Pushgateway (pod is short-lived) | Scrape the worker, or push to StatsD/OpenTelemetry | Config (`PROMETHEUS_PUSHGATEWAY_URL` becomes optional/null) |
| 2 | Retry / idempotency | `concurrencyPolicy: Forbid` + next-schedule retry | Per-task `retries`/`retry_delay`, `max_active_runs=1` | Config |
| 3 | Exit-code mapping | `kube_job_status_failed`, exit codes 1–6 surface via Prometheus | Non-zero exit = "failed" in Airflow UI; exit codes 2/3/4/5/6 look identical unless adapted | **Code (adapter)** — map exit codes to `AirflowFailException` vs `AirflowSkipException` vs custom alert routes |
| 4 | Timeouts | `activeDeadlineSeconds` (we also have `SYNC_DEADLINE_SECONDS` as a graceful cap) | `execution_timeout` | Config |
| 5 | Network egress | k8s NetworkPolicy on GrowthBook namespace (LDAP, gitiles, GB API, Pushgateway) | Airflow worker firewall rules — worker already reaches LDAP + gitiles; GB API egress would be new | Infra |
| 6 | Secrets | k8s Secret mounted as env via existing `config.private` → helmfile flow | Airflow Connections/Variables, or a separate Vault/k8s-secrets mount for the worker | **Code (secret-source)** — adds a secret-loading layer unless we read env from an Airflow-supplied shim |
| 7 | CA bundle | ConfigMap mount at `/etc/ssl/wmf-ca.pem` | System CA store on the Airflow worker (likely already present) | Infra |
| 8 | Lifecycle / SIGTERM | Pod eviction rare for 10–30min CronJob; v1 accepts it | Task kill (manual clear, timeout) is more common | Pushes the deferred SIGTERM handler to higher priority |
| 9 | Logs | stdout → k8s log capture → Logstash | Task log files on worker + optional remote (S3/GCS) | Infra |
| 10 | Operator UX | `kubectl get cronjob` + `kubectl logs` | DAG-run UI | Infra |
**Items #3 and #6 are the only real code/chart work.** Everything else
is config or infra plumbing.
### Default recommendation: CronJob
Shape-of-work argument: collect → resolve → apply is linear, stateless,
and short (seconds to low minutes), with no multi-step dependencies.
This is structurally Cron-shaped, not DAG-shaped. The GrowthBook Helm
chart and its secret-delivery plumbing already exist; a CronJob entry
is minimal incremental infra. Airflow's value is multi-step pipelines
with dependencies, fan-out/fan-in, backfills, and per-step retries —
none of which this workload needs.
The case for Airflow is less about technical shape and more about
operational integration: if the DPE daily on-call already monitors
Airflow DAG failures and triages broken tasks as part of normal
responsibilities, then a standalone CronJob is one more thing to
wire up separately — its own Prometheus alerting rules, its own
runbook, outside the team's existing failure-handling patterns.
Scheduler-uniformity has real value when the team's tooling and
habits are built around one system.
Current recommendation is CronJob (technical fit + existing Helm
plumbing), but the final call depends on whether Airflow monitoring
is already part of DPE's daily operational workflow — a conversation
still pending with the team.