Currently, there is a race condition for eventgate clusters that don't use dynamically load new schemas from the remote schema repos hosted at schema.wikimedia.org.
When we merge a schema, it is automatically deployed to schema.wikimedia.org. The ProduceCanaryEvents job will look up the latest version of a schema for a stream and produce a canary event for it.
eventgate-main and eventgate-logging are configured to only load schemas from checkouts of the schema repos that are bundled into the eventgate-wikimedia docker image. This was done to avoid an unnecessary coupling between the production eventgate service and a remote schema service.
When a new schema version is merged, ProduceCanaryEvents will try to produce this event to e.g. eventgate-main, and fail until a new eventgate-wikimedia docker image with the latest commit from the schema repo is built and deployed.
eventgate-wikimedia can be configured to load schemas from both local filesystem and remote URLs. We should configure eventgate-main and eventgate-logging-external to do this. Whichever url loads first will be used.
This does not solve the issue of where a new stream is declared to use a new brand new schema (not just a new schema version). These services still only request stream config at startup. But, in these cases, there is no race condition. The schema can be merged first, and the stream config that declares the stream later. Then the service can be bounced as described here
- Configure eventgate-main and eventgate-logging-external helmfile values to use both local and remote schema repos, just like eventgate-analytics does.
- Documentation describing eventgate clusters is updated.