I just came up with what an implementation of option 3 could be: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/a996b2a6ae2d9c3c2b094ae1ae3a39b4afe0433d/components/kyverno-policies/policies/toolforge-base-policy.yaml
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Yesterday
Given the new data, I think I'm now more in favor of option 2: mutation.
Another data point.
More thoughts on validation vs mutation:
Tue, May 7
Before patch https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/278 with only PSP, a Pod resource would have:
- at container level:
securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsGroup: 54005 runAsUser: 54005
- at pod level:
securityContext: fsGroup: 54005 seccompProfile: type: RuntimeDefault supplementalGroups: - 1
TODO: webservice
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Mon, May 6
Updated T362050: toolforge: review pod templates for PSP replacement to make sure our pod templates are updated accordingly.
linking T277778: Toolforge: consider decoupling user & accounts from CloudVPS accounts for reference, in case is relevant
the plan could be this:
Fri, May 3
the proposed patch LGTM.
I updated the quotas, but the administrator documents we have are a bit confusing.
Please, check the quotas and report back if you cannot operate as expected.
Tue, Apr 30
In T362872#9752551, @dcaro wrote:The decision about commiting to drop the extra component on the upgrade to k8s 1.26 might become way more relevant [..]
I think kind in particular has some issues working with appArmor. There is a reference in the doc to just disable it: https://kind.sigs.k8s.io/docs/user/known-issues/#apparmor
Fri, Apr 26
Thu, Apr 25
The problem was we were using a deprecated apiVersion field in the embedded kubeadm configuration.
in your opinion, should we decline this task and focus on the other angle you mention?
scheduled discussion meeting for 2024-04-30.
Wed, Apr 24
In T362872#9740880, @dcaro wrote:Is there a way for us to see how many objects are currently not meeting the policy? If there's not many, going with option 1 might be doable, otherwise it might requires some time to get everything first valid with the policies, and then moving to option 1 (if we want eventually)
reopening, as I just noticed an important data point: as of today PodSecurityPolicy work on mutation mode. It transparently modifies the resources being defined in the cluster.
reopening -- we might want to take a look at this soon.
I'm fine with option 1 too, so I'm declaring this decision request done.
In T356164#9621026, @dcaro wrote:In T356164#9559316, @aborrero wrote:Maybe an idea: have a per-tool network quota for concurrent connections. We don't have any semantics in kubernetes/calico for implementing this though.
Can you open a task with more details if you have a clear idea? I'm going to close this one for now, but would be nice to be able to have something more than us looking at the limits.
In T329327#9738585, @bd808 wrote:Based on this explanation of the rate limiting implementation I am very much wondering if EventStreams is seeing all traffic from Cloud VPS as coming from a single IP, specifically 185.15.56.1 (nat.cloudgw.eqiad1.wikimediacloud.org). If so, EventStreams would be mostly unusable by Toolforge tools and other Cloud VPS projects with potentially hundreds of tools fighting over 16 slots.
Tue, Apr 23
Patch https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/119 seems to work as expected.
In T362872#9734680, @taavi wrote:In T362872#9734678, @aborrero wrote:In both cases, they don't mention anything about mutation. I assume this means that they cannot backfill (backmutate?) resouces that made it to the system before the policy was updated.
I would assume the policy is applied when a Pod is created, and not when a Deployment/Job/etc is? In that case backfill support doesn't seem that important to me.
In T362872#9734666, @aborrero wrote:In T362872#9726419, @fnegri wrote:What do we do with the already defined Pod?
This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?
Kyverno has a pretty well established behavior for this, see https://kyverno.io/docs/policy-reports/background/ it will report if already existing resources no longer conform with the new policies.
On a similar fashion, OPA Gatekeeper can perform audits of existing resources, see https://open-policy-agent.github.io/gatekeeper/website/docs/audit
In T362872#9726419, @fnegri wrote:What do we do with the already defined Pod?
This is something to verify also in Option 1 I think. Does validation apply to existing resources, or only to newly created ones?
Fri, Apr 19
The speeds you reported are perfectly normal.
cross linking: T327087: Decision request: python source code line length
Thu, Apr 18
Could you please go to here https://network-tests.toolforge.org/ and download the 1GB file, and report the speed you get?
Wed, Apr 17
Tue, Apr 16
also, the request doesn't use a TLS certificate on the client side. By looking at the nginx deployment, it has ssl_verify_client on, I would expect the request to fail if not using a client cert?