We're using this in our CI jobs but it'd be cleaner if it were installed upstream in the production image
Description
Event Timeline
Change #1296682 had a related patch set uploaded (by Jforrester; author: Jforrester):
[operations/docker-images/production-images@master] abstractwiki-rust: Bake in semgrep, cargo-chef, clang, and clippy
SRE's review of my initial baked-in patch (change 1296682) highlighted the policy that everything in a production image must come from Debian upstream or our own apt.wikimedia.org, and not fetched from PyPI or GitHub. For builds too unwieldy to do ourselves, the accepted fallback is to store and serve the binary ourselves via a Debian package.
We're using semgrep on our Rust linting step per a recommendation from Security, but it's not specific to our usage. Unfortunately it's definitely "unwieldy"; its engine (semgrep-core) is OCaml (opam/dune + tree-sitter grammars), so building from source is a serious, on-going maintenance commitment. The upstream Linux wheels already bundle a pre-built, static semgrep-core, so realistically we'd be serving that binary, not compiling it.
So, which compliant delivery mechanism should we pick? Some options:
- Thirdparty .deb on apt.wikimedia.org. Build a self-contained venv once on a WMF package-builder (pip-install a pinned semgrep==X from a controlled index), bundle the resulting tree into a .deb, serve it via reprepro, and install in the image with a plain apt_install — the same shape as images/ceph. The network/pip step happens once on our infra, exactly analogous to cargo-chef's vendoring; the image build itself stays pure-apt and offline.
- (+) Matches SRE's named fallback; no pip in the image; reproducible (we serve it) and outage-proof.
- (−) Packaging + reprepro upload effort; must rebuild/re-upload on every bump, and semgrep releases often.
- WMF-served wheelhouse + offline pip. Mirror the pinned wheels (semgrep + Python deps, hash-locked) somewhere WMF-controlled and pip install --no-index --find-links at build time.
- (+) Lighter than full deb packaging; reproducible via pinned hashes.
- (−) Still "pip", which SRE flagged — needs their explicit OK; muddier policy fit than a .deb.
- Build semgrep-core from source.
- (+) The gold standard.
- (−) Wikimedia only just a few years ago finally got rid of the use of production OCaml in Math, let's not go back to that.
- Reconsider placement. semgrep isn't Rust-specific — it's wanted in abstractwiki-rust only because our function-evaluator CI runs it there as a single-point-of-truth. Maybe it belongs in a more general CI image?
- Reconsider need. semgrep partially duplicates the cargo audit / OSV testing; is it worth it to justify the packaging investment at all?
Particular questions for SRE:
- Is semgrep already packaged anywhere in WMF, or is there an existing python-tooling thirdparty component we'd reuse rather than create?
- Are you comfortable blessing the upstream semgrep-core binary into the narrow set of trusted prebuilt binaries?
- If we go the .deb route: which component name, and what's the reprepro upload + ownership process?
- Given semgrep's frequent releases, is the rebuild-and-upload cadence acceptable, or does that push us toward option 2/4?