Page MenuHomePhabricator

[Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors
Closed, ResolvedPublic

Description

EventGate is requesting the config for too many streams at once. By default, the MediaWiki Action API will limit the number of values for a multi-valued parameter to 50. The streamconfigs API doesn't alter that limit.

See this comment from @phuedx for a more detailed analysis.

See Event_Platform/EventGate_occasionally_fails_to_ingest_specific_schemas for a summary and post mortem.

2023-11-07 update

  • stream config fetching has been refactored and simplified. Hopefully this fixes a subtle race condition bug.
  • stream config fetches are not retried.
  • It looks like some schema fetches still fail for similar reasons as stream config fetching. We might want to look into retrying schema fetching (not sure why this doesn't just happen with nodejs http client?)
Success criteria
  • Merge and deploy @phuedx EventStreamConfig patch. eventgate-analytics-external supports (mostly) legacy instrumentation, and hopefully its config is not expected to grow too much.
  • Add alerting on HTTP error status codes for all eventgate instances.
  • Document this incident with a post mortem and impact analysis.
  • bug in eventgate-wikimedia stream config is fixed.

Original Bug report

Every week we receive several alerts with error messages similar to this:

2022-12-26T16:00:14.224 ERROR ProduceCanaryEvents Some canary events failed to be produced:
POST https://eventgate-analytics-external.svc.eqiad.wmnet:4692/v1/events => BasicHttpResult(failure)  status: 500 message: Internal Server Error. Response body:
  {"invalid":[],"error":[{"status":"error","event":{"$schema":"/analytics/legacy/quicksurveyinitiation/1.1.0","client_dt":"2020-04-02T19:11:20.942Z","dt":"2020-04-02T19:11:20.942Z","event":{"surveyCodeName":"perceived-performance-survey"},"meta":{"id":"b0caf18d-6c7f-4403-947d-2712bbe28610","stream":"eventlogging_QuickSurveyInitiation","domain":"canary","dt":"2022-12-26T16:00:14.217Z","request_id":"d13a220c-cd07-4397-8777-477622cb64fb"},"schema":"QuickSurveyInitiation","http":{"request_headers":{"user-agent":"Apache-HttpClient/4.5.12 (Java/1.8.0_342)"},"client_ip":"127.0.0.1"}},"context":{"message":"event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/quicksurveyinitiation/1.1.0 destined to stream eventlogging_QuickSurveyInitiation is not allowed in stream; eventlogging_QuickSurveyInitiation is configured but does not have any settings."}}]}

It seems the job is having problems collecting the config settings for a given stream at a time.
It doesn't repeat the stream, it's a different one every time. Which indicates it's probably a connection issue.
It's also not critical, since the job runs IIRC every 15 minutes, so each stream will still produce several canary events per hour, even if it sees 1 or 2 failures per hour).
Nevertheless we should fix this to clear the alert space :-)
Maybe, we can wait till we migrate this job to Airflow?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Following up after some investigation. I think we have two problems:

  • an issue with eventgate's config refresh system.
  • lack of alerting for eventgate-analytics-external.

@phuedx thanks for the patch and for providing context!

That said, in the case where there's a cache miss, EventGate should just fetch the stream config for the requested stream. Hrrm… Could there be a race condition here?

That could explain why the config lookup does not necessarily _always_ fail. There are cases where canary events are produced through this service (which should mean the target stream config has been resolved).

Next steps

Unless there's any objection I'd like to do the following since this phab has been open for a long time:

  • Merge and deploy the EventStreamConfig patch. eventgate-analytics-external supports (mostly) legacy instrumentation, and hopefully its config is not expected to grow too much.
  • Add alerting on HTTP error status codes for all eventgate instances.
  • Document this incident with a post mortem and impact analysis.

As follow up work (beyond the scope of this phab):

@lbowmaker @mforns thoughts?

Next steps

  • Add alerting on HTTP error status codes for all eventgate instances.

Do you mean HTTP error status codes that EventGate is emitting or that the MediaWiki API is emitting? If the latter, then you'll need to consider that the MediaWiki API tends to return HTTP 200 OK even when an error has occurred. See https://www.mediawiki.org/wiki/API:Errors_and_warnings for more detail.

Change 957949 merged by jenkins-bot:

[mediawiki/extensions/EventStreamConfig@master] API: Increase streams param values limit to 250

https://gerrit.wikimedia.org/r/957949

Change 959039 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] data-engineering: eventgate: standardize alerts

https://gerrit.wikimedia.org/r/959039

Do you mean HTTP error status codes that EventGate is emitting or that the MediaWiki API is emitting? If the latter, then you'll need to consider that the MediaWiki API tends to return HTTP 200 OK even when an error has occurred. See https://www.mediawiki.org/wiki/API:Errors_and_warnings for more detail.

HTTP error status codes that EventGate is emitting. We have monitoring in place, but alerts are only fired for the eventage-logging-external instance.

Merge and deploy @phuedx EventStreamConfig patch

Heads up: The patch should be live on metawiki by 2100 UTC tomorrow.

Merge and deploy @phuedx EventStreamConfig patch

Heads up: The patch should be live on metawiki by 2100 UTC tomorrow.

Ack.

We should fix:
eventgate-wikimedia's recaching should not fail in this way

I'm claiming this task. Will try to understand and capture TODOs in task description...

FWIW failures to fetch stream configs has disappeared since @phuedx patch was rolled out:

https://logstash.wikimedia.org/goto/3051979e2e2d4790c2975c3403282d72

I wonder if this is an async race bug introduced by https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/643519/.

See my comment from that patch.

The code uses setInterval to call a function that flushes all cached configs every stream_config_ttl seconds, and then recaches them. This is where the huge param list was coming from.

This line asks for the relevant stream config settings, and below is where the message @JAllemandou posted is logged.

(mediawiki.client.session_tick is a pretty high volume stream, which is why there are so many of these errors in a second when this happens. There are other streams that this happens to.

Perhaps there is a race condition here? Hm, does the call to precacheStreamConfigs needs to be awaited inside of the setInterval function?

Perhaps there is a race condition here? Hm, does the call to precacheStreamConfigs needs to be awaited inside of the setInterval function?

The cache is flushed before the asynchronous request for the data to warm it is made. Because that request is non-blocking, execution continues with the cache empty. While the request for the data to warm the cache is ongoing, the next N requests will cause more asynchronous requests for specific stream configs to be made, e.g. see https://logstash.wikimedia.org/goto/c2e6c5a59aeafc0e9723520e191f774b. I'm not sure if this is the same/the cause of the bug that we're seeing though…

I suggest updating precacheStreamConfigs() to accept an optional second parameter called flushAll, which, when truthy, causes the cache to be flushed immediately before the cache is updated:

const precacheStreamConfigs = async (streams, flushAll) => {
  const streamConfigs = await getConfigsForStreams(streams);
  if (!!flushAll) {
    cachedStreamConfigs.flushAll();
  }
  _.forEach(streamConfigs, (streamConfig, stream) => {
    cachedStreamConfigs.set(stream, streamConfig);
  });
};

and updating the function that flushes/refreshes the cache to something like:

if (options.stream_config_recache_on_expire) {
  logger.info(`Recaching stream configs for ${streams}.`);
  precacheStreamConfigs(streams, true);
} else {
  cachedStreamConfigs.flushAll();
}

Thanks Sam! Also wondering how setInterval is working with the non-awaited async call to precacheStreamConfigs right now. Maybe https://www.npmjs.com/package/set-interval-async might help? Not sure.

I'm trying to write some tests for this before I change code, but I'm not quite sure how to do that yet.


Thinking about T346899: eventgate: cache refreshes should fetch stream configs in batches, I wonder if it might be better to just change the way we lookup stream 'dynamic' configs for eventgate-analytics-external.

The way it works now: stream_config_uri is a template and eventgate-wikimedia fetches the config for specific streams on demand.

What if instead: we dropped stream_config_is_dyanmic and stream_config_uri template support, and instead just offered periodic recaching of all defined stream configs. That is: every stream_config_ttl, clear all and request all stream configs with desired constraints, e.g. api.php?action=streamconfigs&constraints=destination_event_service=eventgate-analytics-external.

This would mean that if a new stream is added, it would not be known by eventgate for at the most stream_config_ttl seconds later. Streams are not added that often. stream_config_ttl: 60 would probably be more than fine.

Ottomata renamed this task from EventGate occasionally fails to ingest specific schemas to eventgate-wikimedia occasionally fails to produce events due to stream config errors.Oct 18 2023, 2:27 PM
Ottomata updated the task description. (Show Details)
Ottomata added a subscriber: Krinkle.

Also wondering how setInterval is working with the non-awaited async call to precacheStreamConfigs right now.

The precacheStreamConfigs() function will initiate the HTTP request and return. The precacheStreamConfigs() call is the last line to be executed in that branch, so the setInterval-ized function will also return. The event loop will continue on.

Maybe https://www.npmjs.com/package/set-interval-async might help? Not sure.

I don't think that the above is an issue/we need to use that package to make sure that the setInterval-ized function is being executed at standard intervals as the expected execution time (milliseconds) is significantly shorter than the interval (5 minutes).


Thinking about T346899: eventgate: cache refreshes should fetch stream configs in batches, I wonder if it might be better to just change the way we lookup stream 'dynamic' configs for eventgate-analytics-external.

What if instead: we dropped stream_config_is_dyanmic and stream_config_uri template support, and instead just offered periodic recaching of all defined stream configs. That is: every stream_config_ttl, clear all and request all stream configs with desired constraints, e.g. api.php?action=streamconfigs&constraints=destination_event_service=eventgate-analytics-external.

I like the simplicity of this proposal.

Change 966922 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] [WIP] - Remove support for dynamic stream config

https://gerrit.wikimedia.org/r/966922

I like the simplicity of this proposal.

Done in https://gerrit.wikimedia.org/r/966922

This makes the code much simpler, but it does change some behavior so we should deploy this very carefully.

The one thing I'm not sure about is the loss of caching of regex matched stream names. This might matter for eventgate-main, where IIRC this feature is used by EventBus+JobQueue code to produce mediawiki\.job\..* events to undetermined stream names. Since we aren't caching the matched stream names anymore, the regexes will be evaluated for every event.

I might be able to add caching of matched stream names, just not sure if we should.

I might be able to add caching of matched stream names, just not sure if we should.

Added this in the latest patch.

Ahoelzl renamed this task from eventgate-wikimedia occasionally fails to produce events due to stream config errors to [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors.Oct 20 2023, 4:50 PM

Change 965662 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Retry produceCanaryEvents

https://gerrit.wikimedia.org/r/965662

Change 966922 merged by jenkins-bot:

[eventgate-wikimedia@master] Remove support for dynamic stream config

https://gerrit.wikimedia.org/r/966922

Deployed the above change for eventgate in beta.

Will watch this over the next few days before moving to production.

Change 968710 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Fix bug where stream config is being fetched constantly!

https://gerrit.wikimedia.org/r/968710

Change 968710 merged by jenkins-bot:

[eventgate-wikimedia@master] Fix bug where stream config is being fetched constantly!

https://gerrit.wikimedia.org/r/968710

just fyi there have been 10 FAIL: produce_canary_events in the last 8 hours - 7 of them in the last hour

Thank you, looking. Possibly related to my https://phabricator.wikimedia.org/T347477#9298947, I deployed eventgate-analytics-external with upgraded NodeJS today. Looking.

I just ran a manual produce_canary_events job to see output:

sudo -u analytics kerberos-run-command analytics produce_canary_events
...
2023-11-01T19:47:29.359 INFO ProduceCanaryEvents Succeeded producing canary events to  153 / 153 streams.

So, it is still working sometimes, as long as it runs successfully once for each stream in an hour we are okay. I'm trying to push along the eventgate nodejs upgrade so we can deploy a fix for this very bug!

Change 972046 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-* - bump image and remove now defunct stream config settings

https://gerrit.wikimedia.org/r/972046

Change 972046 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-* - bump image and remove now defunct stream config settings

https://gerrit.wikimedia.org/r/972046

Mentioned in SAL (#wikimedia-operations) [2023-11-06T21:27:14Z] <ottomata> eventgate-analytics-external - deploy change to remove 'dynamic' stream config support, instead just re-cache stream configs every 60s - https://phabricator.wikimedia.org/T326002

Alright, just deployed what we are hoping is a 'fix' to this problem.

eventgate-analytics-external no longer supports 'dynamic' stream config lookup at runtime, instead it just refetches its stream configs every 60 seconds.

Let's keep an aye on on logs:

https://logstash.wikimedia.org/goto/343c44eee5b2e0610f340f28bbe32696

If all goes well we shouldn't see any more of these "is configured but does not have any settings" errors.

Docs about eventgate-analytics-external updated.

Change 972419 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - set stream_config_retries to 3

https://gerrit.wikimedia.org/r/972419

Change 972419 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - set stream_config_retries to 3

https://gerrit.wikimedia.org/r/972419

Ottomata renamed this task from [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors to [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors.Nov 7 2023, 5:20 PM
Ottomata updated the task description. (Show Details)
Ottomata renamed this task from [Event Platform] eventgate-wikimedia occasionally fails to produce events due to remote HTTP fetch errors to [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors.Nov 7 2023, 6:19 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

The stream config fetch errors seem to have been fixed.

I still see many 'Failed loading schema at', which now that I am looking into it, is perhaps is a symptom of the bug.

For the sake of context digestion, I've created T350713 to work on those.

I think we can resolve this ticket.

I wonder if we should add retries to the envoy proxy to mw-api-int-async? This would affect all services that use that listener proxy though. @Joe ?

Change 959039 merged by jenkins-bot:

[operations/alerts@master] data-engineering: eventgate: standardize alerts

https://gerrit.wikimedia.org/r/959039

Change 973835 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] envoy.yaml - Add retries for mw-api-int-async-ro

https://gerrit.wikimedia.org/r/973835

Change 973836 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-* - use mw-api-int-async-ro for EventStreamConfig

https://gerrit.wikimedia.org/r/973836

Change 973835 merged by Ottomata:

[operations/puppet@production] envoy.yaml - Add retries for mw-api-int-async-ro

https://gerrit.wikimedia.org/r/973835

Change 973836 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-* - use mw-api-int-async-ro for EventStreamConfig

https://gerrit.wikimedia.org/r/973836

Mentioned in SAL (#wikimedia-operations) [2023-11-13T17:05:13Z] <ottomata> deploying eventgates to pick up change to use mw-api-int-async-ro with retries - T326002

Change 973846 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventstreams* - use mw-api-int-async-ro for EventStreamConfig:

https://gerrit.wikimedia.org/r/973846

Change 973846 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams* - use mw-api-int-async-ro for EventStreamConfig:

https://gerrit.wikimedia.org/r/973846

Change 965662 merged by Ottomata:

[analytics/refinery/source@master] Retry produceCanaryEvents

https://gerrit.wikimedia.org/r/965662

Change 974600 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] canary_events - bump refinery-job to version to pick up retry logic

https://gerrit.wikimedia.org/r/974600

Change 974600 merged by Ottomata:

[operations/puppet@production] canary_events - bump refinery-job to version to pick up retry logic

https://gerrit.wikimedia.org/r/974600

We should file an incident report and have a little write up about this, and publicize it with users of eventgate-analytics-external.

Reopening to track.

See: https://logstash.wikimedia.org/goto/3bb4d1ceaf5233c07581d3ddc3fcca5a

We should do a little analysis to see if we can quantify how many events we lost over the time period when this bug affected eventgate-analytics-external.