EventGate is requesting the config for too many streams at once. By default, the MediaWiki Action API will limit the number of values for a multi-valued parameter to 50. The streamconfigs API doesn't alter that limit.
See this comment from @phuedx for a more detailed analysis.
See Event_Platform/EventGate_occasionally_fails_to_ingest_specific_schemas for a summary and post mortem.
2023-11-07 update
- stream config fetching has been refactored and simplified. Hopefully this fixes a subtle race condition bug.
- stream config fetches are not retried.
- It looks like some schema fetches still fail for similar reasons as stream config fetching. We might want to look into retrying schema fetching (not sure why this doesn't just happen with nodejs http client?)
Success criteria
- Merge and deploy @phuedx EventStreamConfig patch. eventgate-analytics-external supports (mostly) legacy instrumentation, and hopefully its config is not expected to grow too much.
- Add alerting on HTTP error status codes for all eventgate instances.
- Document this incident with a post mortem and impact analysis.
- bug in eventgate-wikimedia stream config is fixed.
Original Bug report
Every week we receive several alerts with error messages similar to this:
2022-12-26T16:00:14.224 ERROR ProduceCanaryEvents Some canary events failed to be produced:
POST https://eventgate-analytics-external.svc.eqiad.wmnet:4692/v1/events => BasicHttpResult(failure) status: 500 message: Internal Server Error. Response body:
{"invalid":[],"error":[{"status":"error","event":{"$schema":"/analytics/legacy/quicksurveyinitiation/1.1.0","client_dt":"2020-04-02T19:11:20.942Z","dt":"2020-04-02T19:11:20.942Z","event":{"surveyCodeName":"perceived-performance-survey"},"meta":{"id":"b0caf18d-6c7f-4403-947d-2712bbe28610","stream":"eventlogging_QuickSurveyInitiation","domain":"canary","dt":"2022-12-26T16:00:14.217Z","request_id":"d13a220c-cd07-4397-8777-477622cb64fb"},"schema":"QuickSurveyInitiation","http":{"request_headers":{"user-agent":"Apache-HttpClient/4.5.12 (Java/1.8.0_342)"},"client_ip":"127.0.0.1"}},"context":{"message":"event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/quicksurveyinitiation/1.1.0 destined to stream eventlogging_QuickSurveyInitiation is not allowed in stream; eventlogging_QuickSurveyInitiation is configured but does not have any settings."}}]}It seems the job is having problems collecting the config settings for a given stream at a time.
It doesn't repeat the stream, it's a different one every time. Which indicates it's probably a connection issue.
It's also not critical, since the job runs IIRC every 15 minutes, so each stream will still produce several canary events per hour, even if it sees 1 or 2 failures per hour).
Nevertheless we should fix this to clear the alert space :-)
Maybe, we can wait till we migrate this job to Airflow?