From an email I wrote in response to an alert:
Oct 25 14:30:05 an-launcher1002 produce_canary_events[3123]: 2020-10-25T14:30:05.933 ERROR ProduceCanaryEvents Some canary events failed to be produced
Oct 25 14:30:05 an-launcher1002 produce_canary_events[3123]: POST https://eventgate-analytics-external.svc.eqiad.wmnet:4692/v1/events =>
With the a 207 response from EventGate, for which canary events POSTed to 3 streams, eventlogging_SearchSatisfaction, test.instrumentation, and ios.edit_history_compare, failed due to: e.g. 'ios.edit_history_compare does not have a schema_title setting.'
eventgate-analytics-external is the only EventGate deployment that uses 'dynamic stream config', in that it refreshes its stream config from the MW API source every 5 minutes. I checked logs, and I can't see any failed EventStreamConfig MW API requests in any of the eventgate-analytics-external pods, but even if it did fail, I don't see how this 'does not have a schema_title setting' error could result. That only happens if the stream IS configured but it does not have a schema_title setting. I don't see how this could happen intermittently like this! Sure, stream config could be misconfigured, or HTTP API requests could fail, but how could an incorrect stream config appear and then disappear?
I just checked the cached stream configs for each of the eqiad eventgate-analytics-external pods and all the streams have schema_titles. In order for this error to happen, an eventgate-analytics-external pod in eqiad at 2020-10-25T14:30:05.933 had cached stream config for these 3 streams without a schema_title setting. I don't know how this could happen!
@JAllemandou suggested that perhaps a failed http request to MW EventStreamConfig API is being cached in the eventgate service as an empty object. I just checked eventgate-wikimedia code, and I don't see how this could happen. The HTTP request is ultimately made by preq, which should throw an HTTPError on any HTTP response >= 400. Errors are not caught by the caching code, so it should be thrown up to eventgate and logged and not cached. Also, preq uses requestretry with a default maxAttempts of 2, so this should be retrying once on any 5xx or network errors.
I'm adding some tests and extra logging to eventgate-wikimedia, but all signs point to a failed HTTP EventStreamConfig API request to not be the cause of this.