Page MenuHomePhabricator

EventGate Helm chart should POST test event for readinessProbe
Closed, ResolvedPublic5 Story Points

Description

Right now, an eventgate pod is considered ready to serve if it responds to GET /?spec. It'd be more correct to say it is 'ready' if it can actually POST and produce a test event to Kafka.

Event Timeline

Ottomata created this task.Mar 19 2019, 2:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2019, 2:38 PM

This should be doable by adding x-amples to the service's spec. While technically it won't achieve exactly that, having a POST example that sends a test event will allow the service-checker utility to identify a problematic pod, which should result in k8s restarting it.

Good idea! The test x-amples are there. We should add a custom spec for the wikimedia-eventgate implementation with our x-ample event.

The readiness probe can't really be POST. The ref is here https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#probe-v1-core, it only allows httpGet, tcpSocket and exec. Exec could be used for this and call the service-checker but it adds a dependency on having service-checker on the nodes (it isn't there currently) and instrumentation (getting the pod IP essentially, I am not sure it's exposed there, will need to check) to make it happen. It might also be a tad heavy to run service-checker every 10 secs for every pod.

Ya, was thinking it'd have to be exec, and then if we can/should use service_checker that'd be fine. @akosiaris do you think we shouldn't do this?

Well I have my reservations for sure. As I said we are talking about a service-checker run every 10s (tunable, but it's a sensible default). While tunable, the command should not take long. The timeout is tunable but the default is 1s. I think it's a sensible value for a web service and I don't think service-checker doing all the endpoints + a POST that will end up in kafka is going to be THAT fast. It also requires having service-checker on every node out there. While fine in production, the development environments is definitely not gonna have it. We could have the probe in values.yaml (we do for all other services, only eventgate is an exception) and override it in production and that's probably fine, but it should be documented.

If anything, it might make more sense to create a specialized GET /healthz endpoint that does just produces (and deletes if required/prudent?) a hardcoded test event in kafka.

If anything, it might make more sense to create a specialized GET /healthz endpoint that does just produces (and deletes if required/prudent?) a hardcoded test event in kafka.

I like this as a temporary work-around to the problem (we could perhaps reuse this idea for change-prop too).

Nuria triaged this task as High priority.
Ottomata moved this task from Backlog to Next Up on the EventBus board.Mar 25 2019, 5:18 PM

it might make more sense to create a specialized GET /healthz endpoint that does just produces (and deletes if required/prudent?) a hardcoded test event in kafka.

I don't like this idea so much, mainly because it requires that we include WMF specific schemas in the API routing code. Thus far I've been able to keep any WMF specific stuff in the wikimedia-eventgate specific implementation.

I don't think service-checker doing all the endpoints + a POST that will end up in kafka is going to be THAT fast

There are only /events, /_info and /?spec endpoints. But, I really only want a POST to /events, I don't want the readiness probe to check all events. I don't even need service_checker here really.

I'd prefer a custom exec command with the test event data built into the image that could do this. Something like:

readinessProbe:
  exec:
    command: [curl, -X, POST, -H, 'Content-Type: application/json', -d@~/test_event.json, http://localhost:8192/v1/events]
  initialDelaySeconds: 2

Hm, I just noticed that there is a eventgate-analytics-staging-service-checker (pod/hook?) deployed along with eventgate-analytics. Perhaps I should just add a POST x-ample to the wikimedia-eventgate spec? Would this then cause that pod to trigger an alert if the POST fails?

it might make more sense to create a specialized GET /healthz endpoint that does just produces (and deletes if required/prudent?) a hardcoded test event in kafka.

I don't like this idea so much, mainly because it requires that we include WMF specific schemas in the API routing code. Thus far I've been able to keep any WMF specific stuff in the wikimedia-eventgate specific implementation.

How about it just counts the number of failures to produce, or even better the last timestamp since it failed to produce and reports yes/no based on some threshold for that? It does add some complexity to the software, namely some shared data structures between processes which I am not sure service-runner can handle but it would be non WMF specific.

I don't think service-checker doing all the endpoints + a POST that will end up in kafka is going to be THAT fast

There are only /events, /_info and /?spec endpoints. But, I really only want a POST to /events, I don't want the readiness probe to check all events. I don't even need service_checker here really.

I'd prefer a custom exec command with the test event data built into the image that could do this. Something like:

readinessProbe:
  exec:
    command: [curl, -X, POST, -H, 'Content-Type: application/json', -d@~/test_event.json, http://localhost:8192/v1/events]
  initialDelaySeconds: 2

And that makes the assumption now that curl is around on the kubernetes nodes (and there's your WMF specificness entering the picture).

Hm, I just noticed that there is a eventgate-analytics-staging-service-checker (pod/hook?) deployed along with eventgate-analytics.

That's for helm test. It won't ever be instantiated unless helm test is ran. Implementation wise, it's a helm hook used to test the software in the CI/CD pipeline. You can have a look at the helm hooks at https://github.com/helm/helm/blob/master/docs/charts_hooks.md but I fail to see how they would help you for this.

Perhaps I should just add a POST x-ample to the wikimedia-eventgate spec?

Definitely. It would sure help overall (but not with specific issue)

Would this then cause that pod to trigger an alert if the POST fails?

No, not at the pod level. But it would trigger an alert at the service level (cause it's icinga that runs those checks)

Ottomata added a comment.EditedMar 27 2019, 6:54 PM

Thanks for tips, Petr clued me into the fact that JSONSchema itself has an examples field, so we can add example events in schemas. I'm going to try to do something with this. I'm writing a nodejs script that will exist in the EventGate code that will take a schema URI, get it, get the examples from it, and POST it to the service. Could we exec this script for the readinessProbe? This would be like:

readinessProbe:
  exec:
    command: [/srv/service/scripts/post-events examples /test/event/0.0.3 http://localhost:8192/v1/events]
  initialDelaySeconds: 2

And that makes the assumption now that curl is around on the kubernetes nodes (and there's your WMF specificness entering the picture).

? curl isn't WMF specific? The image is WMF specific already (it clones our event-schemas and uses our wikimedia-eventgate implementation), so I don't mind if the image does specific things. I just don't really want to add WMF specific bits to the API.

Change 499576 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] [WIP] POST test event to service for readinessProbe

https://gerrit.wikimedia.org/r/499576

Ottomata moved this task from Next Up to In Progress on the EventBus board.Mar 27 2019, 7:52 PM
Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.
Ottomata claimed this task.

Thanks for tips, Petr clued me into the fact that JSONSchema itself has an examples field, so we can add example events in schemas. I'm going to try to do something with this. I'm writing a nodejs script that will exist in the EventGate code that will take a schema URI, get it, get the examples from it, and POST it to the service. Could we exec this script for the readinessProbe? This would be like:

readinessProbe:
  exec:
    command: [/srv/service/scripts/post-events examples /test/event/0.0.3 http://localhost:8192/v1/events]
  initialDelaySeconds: 2

It should be command: ['/srv/service/scripts/post-events examples', '/test/event/0.0.3', 'http://localhost:8192/v1/events'] if I read the docs correctly. Just noting that it is exec'd as is, that is no shell.

It does look like it would work.

? curl isn't WMF specific?

The presence of it on the image is specific though.

The image is WMF specific already (it clones our event-schemas and uses our wikimedia-eventgate implementation), so I don't mind if the image does specific things. I just don't really want to add WMF specific bits to the API.

Fine by me

Change 499576 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - POST test event to service for readinessProbe

https://gerrit.wikimedia.org/r/499576

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.Apr 9 2019, 4:09 PM
Ottomata set the point value for this task to 5.Apr 10 2019, 4:01 PM
Nuria closed this task as Resolved.Apr 19 2019, 1:30 PM