Page MenuHomePhabricator

Add network policies to the ML k8s clusters
Open, Needs TriagePublic

Description

The current GlobalNetworkPolicies settings for ml-serve-{eqiad,codfw} clusters is empty, allowing any traffic to flow in/out the cluster without restrictions. This was good for initial testing, but now that we have reached a more stable phase we should add a base set of restrictions for ingress/egress and traffic flowing between pods.

Event Timeline

It seems that the GlobalNetworkPolicies are split into two parts:

  • global ones (per cluster) that include things like allowing egress between each pod, allow DNS traffic to kube-system, etc..
  • per service ones, that include ingress filtering for each service (port exposed, etc..) and also egress to specific services if needed

I think that we should proceed with T286791 first, because we can currently only add per cluster rules.

List of ports used by various containers (got via nsenter):

  • istio ingress gateway:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN      17095/envoy
tcp        0      0 0.0.0.0:15021           0.0.0.0:*               LISTEN      17095/envoy
tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN      17095/envoy
tcp        0      0 0.0.0.0:15090           0.0.0.0:*               LISTEN      17095/envoy
tcp        0      0 127.0.0.1:15000         0.0.0.0:*               LISTEN      17095/envoy
tcp6       0      0 :::15020                :::*                    LISTEN      17046/pilot-agent
  • istio webhook (istiod)
tcp6       0      0 :::9090                 :::*                    LISTEN      61904/webhook       
tcp6       0      0 :::8008                 :::*                    LISTEN      61904/webhook       
tcp6       0      0 :::8443                 :::*                    LISTEN      61904/webhook
  • istio cluster local gateway
tcp        0      0 0.0.0.0:15090           0.0.0.0:*               LISTEN      6482/envoy          
tcp        0      0 127.0.0.1:15000         0.0.0.0:*               LISTEN      6482/envoy          
tcp        0      0 0.0.0.0:15021           0.0.0.0:*               LISTEN      6482/envoy          
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      6482/envoy          
tcp6       0      0 :::15020                :::*                    LISTEN      6444/pilot-agent
  • knative activator
tcp6       0      0 :::8012                 :::*                    LISTEN      2436/activator      
tcp6       0      0 :::8013                 :::*                    LISTEN      2436/activator      
tcp6       0      0 :::9090                 :::*                    LISTEN      2436/activator      
tcp6       0      0 :::8008                 :::*                    LISTEN      2436/activator
  • knative autoscaler
tcp6       0      0 :::8080                 :::*                    LISTEN      2333/autoscaler     
tcp6       0      0 :::9090                 :::*                    LISTEN      2333/autoscaler     
tcp6       0      0 :::8008                 :::*                    LISTEN      2333/autoscaler
  • knative controller
tcp6       0      0 :::9090                 :::*                    LISTEN      2183/controller     
tcp6       0      0 :::8008                 :::*                    LISTEN      2183/controller
  • knative webhook
tcp6       0      0 :::8443                 :::*                    LISTEN      2048/webhook        
tcp6       0      0 :::9090                 :::*                    LISTEN      2048/webhook        
tcp6       0      0 :::8008                 :::*                    LISTEN      2048/webhook
  • istio networking
tcp6       0      0 :::9090                 :::*                    LISTEN      36626/controller    
tcp6       0      0 :::8008                 :::*                    LISTEN      36626/controller
  • kfserving-controller
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      40307/manager       
tcp6       0      0 :::9443                 :::*                    LISTEN      40307/manager
  • revscoring
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      41670/python3       
tcp6       0      0 :::9090                 :::*                    LISTEN      42058/queue         
tcp6       0      0 :::9091                 :::*                    LISTEN      42058/queue         
tcp6       0      0 :::8012                 :::*                    LISTEN      42058/queue         
tcp6       0      0 :::8080                 :::*                    LISTEN      41670/python3       
tcp6       0      0 :::8022                 :::*                    LISTEN      42058/queue

A lot of ports and complexity, but overall this should happen:

  • all pods shouldn't have rules for outgoing traffic (same as we do for the main cluster)
  • the istio gateway pods needs to be able to be contacted by all the pods to route traffic
  • the knative pods should be able to talk with each other, and they should accept traffic from kfserving's controller and istio (not from the ML service pods).
  • kfserving shouldn't really have any pods to be able to contact it

I had a chat with Janis this morning:

  • the GlobalNetworkPolicies that we define should be related to generic settings that are not tailored to a specific namespace etc.. basically only what it is common to all the pods.
  • we should add networking policies to the knative and kserve charts, and also probably to the kserve-inference chart as well.
  • the istio use case is a little different, since we don't really have a chart, more will come in collaboration with ServiceOps
  • when adding policies for a namespace, setting even one rule for egress or ingress will add deny-all-except kind of rules to the outbound or inbound traffic.

In theory we can add network policies in batches, and not all in once, that simplifies the complexity of the task.

Change 732677 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add network policies for the kserve-inference chart deployments

https://gerrit.wikimedia.org/r/732677

Change 732677 merged by Elukey:

[operations/deployment-charts@master] Add network policies for the kserve-inference chart deployments

https://gerrit.wikimedia.org/r/732677

Change 732939 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve: add network policies

https://gerrit.wikimedia.org/r/732939