Page MenuHomePhabricator

Clarify multi-service instance concepts in helm charts and enable canary releases
Closed, ResolvedPublic

Description

Our new Helm chart templates were not originally developed to handle multi-service deployment charts. When I created the eventgate chart, I used the wmf.releasename, which evaluates to .Chart.Name + .Release.Name to identify a service instance. The release name for e.g. the eventgate-analytics service is 'analytics'.

But this isn't quite right. Every time we deploy we get a new 'release', we just happen to force that the name of the release stays the same.


Alex and I recently had need to support live canary releases of a service. We want to be able to deploy a new image and/or configs to a limited number of pods and have them serve live traffic. This would allow us to first do the canary release, and then the main production release once the canary works fine. To do this, we'd add an extra release in the service's helmfile.yaml, like:

releases:
- production:
  values:
  - "values.yaml"
  - "private/secrets.yaml"
- canary:
  values:
  - "values.yaml"
  - "values-canary.yaml"
  - "private/secrets.yaml"

Values defined in canary.yaml would override the ones in values.yaml. canary.yaml would set things like replicas: 1 to ensure the canary only deployed one pod.

This approach works fine in the single app instance charts, where wmf.releasename makes sense, e.g. mathoid-production. For multi app instance charts where we are currently abusing the .Release.Name to ID the service, there is currently no real way to ID the app instance in the chart. If we kept things as they are now and added canary releases to both eventgate-main and eventgate-analytics, both eventgate-main and eventgate-analytics canary's wmf.releasename would evaluate to 'eventgate-canary'.

I propose we add a new main_app.name concept into our Helm charts, set in values.yaml. The chart's main values.yaml file doesn't know about multiple service instances, so it would just set this default to the chart's name (or whatever is appropriate). If a chart doesn't have multiple app instances, no change would be needed in the helmfile values.yaml. We'd also change wmf.releasename to evaluate to .Values.main_app.name + .Release.Name. Example:

mathoid service template variables

.Chart.Name: mathoid
.Values.main_app.name: mathoid
.Release.Name: production # (or canary)
wmf.releasename: mathoid-production # (or mathoid-canary)

eventgate-main template variables

.Chart.Name: eventgate
.Values.main_app.name: eventgate-main
.Release.Name: production # (or canary)
wmf.releasename: eventgate-main-production # (or eventgate-main-canary)

We'd also want to use the main_app.name in a label for all of the k8s resources. e.g.

labels:
  chart: {{ template "wmf.chartname" . }}   # mathoid or eventgate
  app: {{ .Values.main_app.name }}          # mathoid or eventgate-main
  release: {{ .Release.Name }}              # production or canary

For resources that need to use matchLabels to actually match the exact release resources, we'd match all of chart, app and release.


Independent of above, we need a way for the production k8s nodePort Service resource to route to multiple releases. Alex and I have two different approaches to this.

Alex's approach uses a new service label (which currently is conditionally set to either the release name or a special addressed_by value, I think we should always set it to .Values.main_app.name as described above) to have the k8s Service match EITHER the app or the release, depending on another setting for the k8s service resource, .Values.service.address_other_releases. This approach has the advantage of defining a more complex hierarchy of k8s services resources per release. You could potentially have several active releases in a deployment, with each one targeting one or multiple releases.

Otto's approach (for which the author of this ticket is biased :p) just adds a new routing_tag label that the k8s Service uses to select which pods it should route to. The value of routing_tag is arbitrary and defaults to .Release.Name, which causes the Service to only route to pods in its release. To enable canary releases, we want the production k8s Services to route to the production release pods as well as the canary release pod. To do this, we set service.routing_tag in values.yaml and values-canary.yaml to a common value (a good choice is just the app name, e.g. 'eventgate-main') shared by both production and canary releases. Since the production release k8s Service will now route to the canary release pods, the canary release does not need a k8s Service resource deployed. This is accomplished by setting service.deployment: none in values-canary.yaml.

Event Timeline

Our new Helm chart templates were not originally developed to handle multi-service deployment charts

On the contrary, the entire idea was to be able to have multiple instances (and thus services) easily deployed from 1 chart. It was a goal to begin with and in fact it's really easy to deploy (even in the same namespace) a chart multiple times and obtain multiple services. That being said, there are indeed a number of shortcomings here and there. I at least see the following.

  • Powering a service by >1 helm releases is not currently supported. Our canary related work tries to address that (however, my goal is not to limit it to just canarying, but a somewhat broader concepts of being able to power a services by as many releases of a chart as desired).
  • The metrics emitted by service runner via statsd are not well tied to helm chart/release controllables and some ambiguity and loss of control has creeped in there. That is indeed a limitation that should be fixed.
  • A release currently can't explicitly define that it powers a specific service, as currently a kubernetes Service is bundled in the release. That is, in essence, the flip side of the coin of the 1st item in this list, with some consequences for the 2nd item as well (metrics differentiation/aggregation).

Every time we deploy we get a new 'release', we just happen to force that the name of the release stays the same.

Technically speaking, what happens is that a release is an identifier. So when we deploy we upgrade an existing release. The name stays the same (if it varied all hell would break loose). If it's the very first time a release gets deployed, only then do we get a new one.

If we kept things as they are now and added canary releases to both eventgate-main and eventgate-analytics, both eventgate-main and eventgate-analytics canary's wmf.releasename would evaluate to 'eventgate-canary'.

Well, if we kept things as is, we would probably call the releases main-canary and analytics-canary to work around this, but point taken.

I propose we add a new 'service name' concept into our Helm charts, set in values.yaml.

Agreed. it does make sense. It would probably solve most of these issues.

The chart's main values.yaml file doesn't know about multiple service instances, so it would just set this default to the chart's name.

Defaulting to the chart name would cause the side effect of not being able to have 2 totally unrelated helm releases side by side (which is pretty useful when testing/debugging). Which would mean a dev would have to jump through some unnecessary hooks. I think defaulting to the release's name is a better approach.

We'd change wmf.releasename to evaluate to .Values.service.name + .Release.Name

Why? We have the user supplied value .service.name and can use it whenever we need. Why mess with wmf.releasename?

Why mess with wmf.releasename

Because wmf.releasename doesn't currently consider the service's name, only the chart name and the release name. If we make this change, I'd stop using .Release.Name to ID the service instance, it would be for IDing releases (e.g. production, canary) of a service. If we don't add the service name to wmf.releasename, it will evaluate to eventgate-production and eventgate-canary in all of the eventgate service instances.

Unless, are you suggesting that we call the releases e.g. main-production and main-canary? This would be compatible with wmf.releasename now and have it eval to what I want (eventgate-main-production), but it seems more natural to me to make the service instance name a top level concept in all the charts, with the release name IDing a specific 'release' of the service instance, not the chart. wmf.releasename would then be used to fully qualify the release of the service instance.

Defaulting to the chart name would cause the side effect [...] I think defaulting to the release's name is a better approach.

Hm, why? Wouldn't you just do helm install --name canary ./mathoid (or dev or whatever)? With default service.name = chart name, you'd then get

chart name: mathoid
service name: mathoid
release name: canary #(or dev or whatever)

If you defaulted to the release name, you'd get:

chart name: mathoid
service name: canary #(or dev or whatever, or if not specified, random-wolf-12345)
release name: canary #(or dev or whatever, or if not specified, random-wolf-12345)

when we deploy we upgrade an existing release. The name stays the same

AH, this makes more sense. I think I knew that once and then forgot it.

On the contrary, the entire idea was to be able to have multiple instances

Sorry, didn't mean to imply otherwise. When I was developing eventgate it was difficult to figure out how to do this, so I abused release name to id the service.

Why mess with wmf.releasename

Because wmf.releasename doesn't currently consider the service's name, only the chart name and the release name. If we make this change, I'd stop using .Release.Name to ID the service instance, it would be for IDing releases (e.g. production, canary) of a service. If we don't add the service name to wmf.releasename, it will evaluate to eventgate-production and eventgate-canary in all of the eventgate service instances.

My question is more on the line of why use wmf.releasename to identify the service to begin with. We can just use service.name in the places where wmf.releasename is abused to identify the service.

Unless, are you suggesting that we call the releases e.g. main-production and main-canary?

No no, far from that. I 've gone far enough to name the releases powering most services as "production". I 'd much rather deployers named the releases whatever suits them best and not having to live by weird naming rules.

This would be compatible with wmf.releasename now and have it eval to what I want (eventgate-main-production), but it seems more natural to me to make the service instance name a top level concept in all the charts, with the release name IDing a specific 'release' of the service instance, not the chart.

It seems more natural to me as well.

wmf.releasename would then be used to fully qualify the release of the service instance.

This is what doesn't seem natural. Why use this when you have the top level concept ready in a variable?

Defaulting to the chart name would cause the side effect [...] I think defaulting to the release's name is a better approach.

Hm, why? Wouldn't you just do helm install --name canary ./mathoid (or dev or whatever)? With default service.name = chart name, you'd then get

chart name: mathoid
service name: mathoid
release name: canary #(or dev or whatever)

And implicitly you have the exact same service name for all releases, no matter how many, which ties the releases together in the default case. Which should be a conscious decision, not happen automatically.

If you defaulted to the release name, you'd get:

chart name: mathoid
service name: canary #(or dev or whatever, or if not specified, random-wolf-12345)
release name: canary #(or dev or whatever, or if not specified, random-wolf-12345)

Which doesn't have the above side-effect of tying the releases together implicitly. Each release is distinct in every way, unless specifically asked by the dev/deployer.

In other words, when someone wants a release to explicitly power a specific service they should be required to do helm install --name myrelease --set service.name=myservice. Otherwise the release implicitly powers it's own service.

On the contrary, the entire idea was to be able to have multiple instances

Sorry, didn't mean to imply otherwise. When I was developing eventgate it was difficult to figure out how to do this, so I abused release name to id the service.

Sure, it was one of the shortcomings of part of the implementation we identified back then. Truth be told, we should have have this task back then, but hindsight is always 20/20 :-)

My question is more on the line of why use wmf.releasename to identify the service to begin with. We can just use service.name in the places where wmf.releasename is abused to identify the service.

I can do that (and am doing that mostly...although I think I am using wmf.releasename to ID service-release specific resources, e.g. ConfigMap names, etc, but I could construct those names myself from .Values.service.name + .Release.Name, or just keep them named just after .Values.service.name only.).

But then what is the point of wmf.releasename at all? It will never be anything useful. 'eventgate-production' doesn't really refer to any useful k8s resource name or label.

And implicitly you have the exact same service name for all releases, no matter how many, which ties the releases together in the default case. Which should be a conscious decision, not happen automatically.

Right, when developing the chart you rarely need multiple services, but if you did, you'd do helm install --set service.name=eventgate-main --name dev (like you said).

Which doesn't have the above side-effect of tying the releases together implicitly. Each release is distinct in every way, unless specifically asked by the dev/deployer.

Ah, I see. This is not just about the service name, but about the k8s service resource addressing of releases. The default should be that a service resource is always deployed and addresses its release. That does make sense. I don't like the idea of setting the default service.name to .Release.Name (pretty unnatural to call the service 'dev' or 'production'), but I see why you would do it here. Hm.

Ok, this is a downside of my k8s service resource canary implementation then. I've got the service only installed when .Release.Name == "production", which is clearly not going to work in dev. Hm. This is a good reason to do it your addressed_by way. There's something about your patch I still find a little confusing to reason about though. I'll have a go at mine to do addressed_by and see what comes out.

Too bad set based selectors aren't supported for k8s service, otherwise we could explicitly specify which releases a service should target, like

service:
  address_releases: [production, canary]
type: NodePort
selector:
  app: {{ template "wmf.chartname" . }}
  service: {{ .Values.service.name }}
  release in {{ .Values.service.address_releases }}

...still exploring...

I just updated my patch; I'll explain my new idea below. But first, I think a big source of confusion in our patches is the conflation of the word 'service'. I'm using service.name really to mean deployment instance name, e.g. eventgate-analytics. k8s uses Service to mean a Service (e.g. NodePort) resource used for routing external traffic to internal pods & containers.

Perhaps we shouldn't use the word 'service' for our new top level concept? Would 'instance_name' be better? It might be less confusing to get rid of our service stuff in values.yaml altogether, and just rename it to instance (unless we have a better name). E.g..

instance:
  name: eventgate # change this if you have multiple deployments to the full instance name, e.g. eventgate-main
  # Set this to false for releases for which you don't want a Service installed.
  deployment: minikube # valid values are "production" and "minikube"
  port: null # you need to define this if "production" is used. In minikube environments let it autoallocate

But, perhaps that is too disruptive, and just using service.instance_name instead of service.name is clarifying enough.

On the other hand, we do call these 'services' in helmfile.d...sigh. Perhaps service.name is ok, and we can just be very explicit when talking about service instance vs k8s Service resource? To avoid confusion, I'll always refer to the k8s Service resource as Service with a capital 'S'.

Anyway, here's how my latest patch works.

A Service is only deployed in a release if service.declare_service_routing == true (This is set to true in the chart's default values.yaml).

The Service's NodePort selector now looks like:

type: NodePort
selector:
  app: {{ template "wmf.chartname" . }}
  service: {{ .Values.service.name }} # TBD change this to instance: {{ .Values.instance_name }} ?
  service_selector: {{ .Values.service.service_selector | default .Release.Name }}

The service_selector label can be set to an arbitrary value to make the Service target all pods that also have the same service_selector value. Then, in deployment.yaml, the pod template gets

template:
    metadata:
      labels:
        app: {{ template "wmf.chartname" . }}
        service: {{ .Values.service.name }}  # TBD change this to instance.name
        release: {{ .Release.Name }}
        service_selector: {{ .Values.service.service_selector | default .Release.Name }}

service_selector is by default set to .Release.Name, so the default behavior where A. a Service is deployed for a release and B. that Service only addresses pods in that release holds true.

For our canary use case, in helmfile.d service values.yaml we'd have

service:
  name: eventgate-main
  service_selector: eventgate-main

And in canary.yaml

service:
  declare_service_routing: false
  service_selector: eventgate-main

In this way, both eventgate-main production and canary releases would have service_selector: eventgate-main, but a Service would only be deployed as part of the production release, and that Service would select all pods in both releases.

(Oof, service.service_selector is extra confusing, since the first 'service' means 'helmfile service instance' and the 'service_selector' service means k8s Service. We should rename something here, eh?)

I find this idea a little be easier to understand than yours, mainly because we aren't conditionally applying labels and varying label values based on what is addressing what. In my patch, label values are always consistent and explicit.

But first, I think a big source of confusion in our patches is the conflation of the word 'service'.

I think you are right. So indeed let's clarify that first.

I'm using service.name really to mean deployment instance name, e.g. eventgate-analytics.

I am not sure what a deployment instance is exactly in this context. A set of deployed code+configuration ? cause that's a helm release in our current context. Or is it something else?

k8s uses Service to mean a Service (e.g. NodePort) resource used for routing external traffic to internal pods & containers.

I don't think we should use that definition. It's very specific as you point out.

It might be less confusing to get rid of our service stuff in values.yaml altogether, and just rename it to instance (unless we have a better name). E.g..

My kneejerk reaction to this is "instance of what"? of eventgate? this is just like saying X instances of apache. But an apache being installed/running isn't descriptive at all about what needs is it serving. In discussions it could be clarified, but then we already lost in clarity to begin with.

To me a service is something a bit more generic. It's:

  1. One or more deployed codebase(s) versions alongside their configurations.
  2. The ability to receive traffic (internal/external/synthetic/whatever) and answer it
  3. An owner that sets (or delegates that to someone else) the "rules" (e.g. intended audiences, expected budget, capacity planning, performance, incident response expectations, etc)
  4. Something that is (hopefully) unequivocally named.
  5. Other stuff I forget.

Of those I don't think that 3 (which is more like an entire book) applies at all to our conversation and 4 would probably just be the value we would put in service.name. Note that nowhere above do I stick to specific technologies or implementations.

Now, with that in mind, let's actually do that for our context.

(1) is accomplished by helm releases. Multiple codebases + configurations can be deployed easily (we didn't have the easy part in the past). Those need to be somehow identified uniquely in order to avoid confusion and other interesting issues. Which answers the following question.

But then what is the point of wmf.releasename at all? It will never be anything useful. 'eventgate-production' doesn't really refer to any useful k8s resource name or label.

That ID part is the role of wmf.releasename. It is used throughout the charts tree in the place of the name for almost all resources types, e.g. deployments, pods, configmaps, networkpolicies. To avoid that ID being too weird due to dev/deployer input (people can be inventive) it is purposefully truncated to 63 chars and the chart name is prepended to make it abundantly clearly which chart/app it is about. But it's wrong to use it to identify the service (but it's ok to use it to identify the k8s Service resource - I follow your example here, read more below for why). Cause the service is more than the helm releases.

(2) There are many ways to implement this, even in our context. However, let's stick to what we currently have, that is the kubernetes Service resource. All it does, is select pods and route traffic to those. There are a number of ways to instantiate such a resource. What we currently do is bundle one with the release. That keeps simple and easy to reason about. We could go out of our way and instantiate this in different ways (e.g. via some kubectl command or via a different helm chart that only does that or even some of the newer tools like jsonnet). However all I can think of would break the easy of locally testing/developing/debugging the chart. So in the default case we need to ship a Service resource. And for that reason the internal name metadata (NOT label mind you) of that Service resource needs to be wmf.releasename. This is how it currently is in all charts and it has no repercussions for our discussion of service.name. I note it here just to avoid misunderstandings. This is the why I alluded to above.

Now to your approach

A Service is only deployed in a release if service.declare_service_routing == true (This is set to true in the chart's default values.yaml).

This accomplishes more or less what my addressed_by does so we agree on premise. It's a bit long as a variable, but I expect rather few charts using it at the beginning, so I guess it's fine.

I have to say I prefer the service.name approach, it's more explicit. And service.instance_name cause the exact same kneejerk reaction describe above to me.

Now to the yaml at hand. Note how both service.name AND service_selector above are the same value in "main" and "canary". That means to me that one of the 2 is redundant. We could work with just service.name and default it to .Release.Name allowing anyone who cares enough (e.g. you) to override it and set the values they care for. The rest of the charts haven't really yet expressed the need for a canary, so e.g. having a default of service.name == production, that's probably fine to them.

My kneejerk reaction to this is "instance of what"? of eventgate?

K cool, let's figure out a different name. I like service.name best, just don't want to confuse it with k8s Service.

That ID part is the role of wmf.releasename [...] It is used throughout the charts tree in the place of the name for almost all resources types
So in the default case we need to ship a Service resource. And for that reason the internal name metadata (NOT label mind you) of that Service resource needs to be wmf.releasename. This is how it currently is in all charts and it has no repercussions for our discussion of service.name

But, wouldn't we want the resource names to be unique per 'service'? If we don't include the service.name in wmf.releasename (and keep using it for resource names) then all the resources will be named chart+release. For eventgate, chart is 'eventgate' and release will be either 'production' or 'canary'. Do we want multiple (there will be a total of 4!) k8s Services all named exactly 'eventgate-production'?

This accomplishes more or less what my addressed_by does so we agree on premise

Ya indeed, we agree on the final deployed result. Just trying to reduce the number of knobs and cognitive load :)

It's a bit long as a variable

I'm having trouble naming these, mainly due to our ongoing discussion here about service.name and k8s Service. I'm sure we can find a better name once we solve this.

That means to me that one of the 2 is redundant.

Hm, yes I suppose. I did it this way because I wanted the label values to remain consistent for all resources that use that label. Since we are trying to add service.name as a top level chart concept, it seemed natural to me to add it to all resources as a label. That way, all resources for a service could be selected by querying for the service label (hm, maybe we should be extra explicit and call the label service_name to avoid confusing with k8s Service!?), independent of if the release varied between e.g. 'production' and 'canary'. Maybe this would be less confusing if we called service.service_selector service.routing_tag instead? The default routing tag IS .Release.Name, but in our case here we just set it to the service name explicitly, because we want the k8s Service to route to all pods of this 'service'. However, the routing_tag value could be anything here, it only exists to specify which pods the service should route to. It is redundant in my eventgate canary use case where I am setting it to just e.g. 'eventgate-main', but keeping it as a separate label allows you to be just as flexible as your addressed_by idea.

I don't need this extra flexibility though. I'd also be fine with something simpler like:

# values.yaml
service:
  name: eventgate-main
  deploy_service_routing: true
  route_to_release_only: false # Naming TBD, we can find something better than 'route_to_release_only'

# canary.yaml
service:
  deploy_service_routing: false

Then in service.yaml

type: NodePort
selector:
  app: {{ template "wmf.chartname" . }}
  service_name: {{ .Values.service.name }} # Perhaps service_name is a less confusing label than just service?
  {{ if .Values.service.route_to_release_only -}}
  release: {{ .Release.Name }}
  {{ end -}}

In this way, we don't need the new routing_tag label. We lose some flexibility, but gain some simplicity.

Either approach works for me!

K cool, let's figure out a different name. I like service.name best, just don't want to confuse it with k8s Service.

Should we use main_app.name instead of service.name?

Change 564052 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate - Use main_app.name as primary resource grouping, not wmf.releasename

https://gerrit.wikimedia.org/r/564052

Should we use main_app.name instead of service.name?

I think yes is the answer. I just updated my patch with this, and it makes much more sense now. service.* stuff is reserved wholly for k8s Service related stuff. main_app.name makes a lot more sense; we are naming the 'app'.
I've then changed the value of the app label to main_app.name, and I've added a new label called chart that has the value of wmf.chartname.

I've also been able to do without the service.declare_service_routing thing. Instead, I just don't deploy a Service if service.deployment == "none".

Now, enabling a canary release would look like this:

# values.yaml (production release)
service:
  deployment: production
  port: 31192
  # This release's k8s Service should route to all pods that have this routing_tag
  routing_tag: eventgate-analytics

# canary-values.yaml
service:
  # Don't deploy a k8s Service for this canary release
  deployment: none
  # the production release sets its deployed k8s Service routing_tag
  # to this value, cause its Service to also route to pods that are
  # part of this canary release.
  routing_tag: eventgate-analytics

My patch does still alter the value of wmf.releasename for reasons stated in above.

I'm moving forward with this. Let's continue this discussion later and refactor if needed when you have time. Thanks!

Change 564052 merged by Ottomata:
[operations/deployment-charts@master] eventgate - Use main_app.name as primary resource grouping, not wmf.releasename

https://gerrit.wikimedia.org/r/564052

Change 571801 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate - Name ConfigMaps using full wmf.releasename

https://gerrit.wikimedia.org/r/571801

Change 571801 merged by Ottomata:
[operations/deployment-charts@master] eventgate - Name ConfigMaps using full wmf.releasename

https://gerrit.wikimedia.org/r/571801

Change 571814 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Update eventgate/README.md

https://gerrit.wikimedia.org/r/571814

Change 571815 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate - fix names of volume mounts

https://gerrit.wikimedia.org/r/571815

Change 571814 merged by Ottomata:
[operations/deployment-charts@master] Update eventgate/README.md

https://gerrit.wikimedia.org/r/571814

Change 571815 merged by Ottomata:
[operations/deployment-charts@master] eventgate - fix names of volume mounts

https://gerrit.wikimedia.org/r/571815

Ok, applied for staging eventgate-analytics. I think it works!

First, because the 'analytics' release already existed, I failed when I first tried to deploy the 'production' one. Both were trying to use the same nodePorts for their Service(s). In this case, I destroyed the analytics release and was able to apply. This is fine for staging, but won't work for eqiad or codfw, as the 'analytics' release is actively used. We might have to pick new ports for 'production' and then switch LVS once all this works, and THEN destroy the 'analytics' release.

I do see some warning Events from when the canary pod was created. I'm not sure what they mean:

Events:
  Type     Reason                  Age                From                                Message
  ----     ------                  ----               ----                                -------
  Normal   Scheduled               29m                default-scheduler                   Successfully assigned eventgate-analytics/eventgate-analytics-canary-7477b599f9-jv8rj to kubestage1002.eqiad.wmnet
  Warning  FailedCreatePodSandBox  29m                kubelet, kubestage1002.eqiad.wmnet  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9c202b88e20b2d08367aecc03a707679c353e8b1e6bf447f63902ed84da1da03" network for pod "eventgate-analytics-canary-7477b599f9-jv8rj": NetworkPlugin cni failed to set up pod "eventgate-analytics-canary-7477b599f9-jv8rj_eventgate-analytics" network: failed to get IPv6 addresses for host side of the veth pair
  Warning  FailedCreatePodSandBox  29m                kubelet, kubestage1002.eqiad.wmnet  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1511c81c2b2f3841594e9c5d965b8053345a8d83c4ff3126e2c899ea3de88071" network for pod "eventgate-analytics-canary-7477b599f9-jv8rj": NetworkPlugin cni failed to set up pod "eventgate-analytics-canary-7477b599f9-jv8rj_eventgate-analytics" network: failed to get IPv6 addresses for host side of the veth pair
  Normal   SandboxChanged          29m (x3 over 29m)  kubelet, kubestage1002.eqiad.wmnet  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  29m                kubelet, kubestage1002.eqiad.wmnet  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9e9dc8e1f8ca35672ec55843c3b7d83fc9b036a06da9ce03741e1a256148a23c" network for pod "eventgate-analytics-canary-7477b599f9-jv8rj": NetworkPlugin cni failed to set up pod "eventgate-analytics-canary-7477b599f9-jv8rj_eventgate-analytics" network: failed to get IPv6 addresses for host side of the veth pair
  Normal   Pulled                  29m                kubelet, kubestage1002.eqiad.wmnet  Container image "docker-registry.wikimedia.org/wikimedia/eventgate-wikimedia:2020-01-10-185555-production" already present on machine
  Normal   Pulled                  29m                kubelet, kubestage1002.eqiad.wmnet  Container image "docker-registry.wikimedia.org/prometheus-statsd-exporter:latest" already present on machine
  Normal   Created                 29m                kubelet, kubestage1002.eqiad.wmnet  Created container
  Normal   Started                 29m                kubelet, kubestage1002.eqiad.wmnet  Started container
  Normal   Created                 29m                kubelet, kubestage1002.eqiad.wmnet  Created container
  Normal   Started                 29m                kubelet, kubestage1002.eqiad.wmnet  Started container
  Normal   Pulled                  29m                kubelet, kubestage1002.eqiad.wmnet  Container image "docker-registry.wikimedia.org/envoy-tls-local-proxy:1.11.2-1" already present on machine
  Normal   Created                 29m                kubelet, kubestage1002.eqiad.wmnet  Created container
  Normal   Started                 29m                kubelet, kubestage1002.eqiad.wmnet  Started container
  Warning  FailedSync              29m                kubelet, kubestage1002.eqiad.wmnet  error determining status: rpc error: code = Unknown desc = Error: No such container:

Perhaps these are just temporary warnings that happened when spawning the first pod? @akosiaris any idea? (don't spend time finding out, just wanted to ask in case you've seen this before).

Change 572093 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate - For consistency, add common labels to networkpolicy

https://gerrit.wikimedia.org/r/572093

Change 572093 merged by Ottomata:
[operations/deployment-charts@master] eventgate - For consistency, add common labels to networkpolicy

https://gerrit.wikimedia.org/r/572093

Mentioned in SAL (#wikimedia-operations) [2020-02-18T14:53:59Z] <ottomata> deploying new 'canary' and 'production' releases for eventgate-analytics. (These releases use a new nodePort, and so will not be active until LVS is modified. The old 'analytics' release and nodePort is left as is.) - T242861

Change 572897 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - Bump image version to for readiness probe schema

https://gerrit.wikimedia.org/r/572897

Change 572897 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - Bump image version to for readiness probe schema

https://gerrit.wikimedia.org/r/572897

Change 572900 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics Use primary and secondary schema repos

https://gerrit.wikimedia.org/r/572900

Change 572900 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics Use primary and secondary schema repos

https://gerrit.wikimedia.org/r/572900

Mentioned in SAL (#wikimedia-operations) [2020-02-18T16:02:35Z] <ottomata> deploying new 'canary' and 'production' releases for eventgate-main. (These releases use a new nodePort, and so will not be active until LVS is modified. The old 'main' release and nodePort is left as is.) - T242861

Updated the task description with details of the way eventgate is now doing this.

@akosiaris for my purposes I'm satisfied, but I'm not sure we settled this totally for other charts. Most controversial was my modification to wmf.releasename.

Should we close this ticket, or keep it open to figure out servicesops' stance?

akosiaris changed the task status from Open to Stalled.Mar 13 2020, 12:21 PM

@akosiaris for my purposes I'm satisfied, but I'm not sure we settled this totally for other charts. Most controversial was my modification to wmf.releasename.

Should we close this ticket, or keep it open to figure out servicesops' stance?

I 'll mark it as stalled and assign to me so I can resume work on it when time permits