Page MenuHomePhabricator

eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventStreamConfig API
Closed, ResolvedPublic

Description

From an email I wrote in response to an alert:

Oct 25 14:30:05 an-launcher1002 produce_canary_events[3123]: 2020-10-25T14:30:05.933 ERROR ProduceCanaryEvents Some canary events failed to be produced
Oct 25 14:30:05 an-launcher1002 produce_canary_events[3123]: POST https://eventgate-analytics-external.svc.eqiad.wmnet:4692/v1/events =>

With the a 207 response from EventGate, for which canary events POSTed to 3 streams, eventlogging_SearchSatisfaction, test.instrumentation, and ios.edit_history_compare, failed due to: e.g. 'ios.edit_history_compare does not have a schema_title setting.'

eventgate-analytics-external is the only EventGate deployment that uses 'dynamic stream config', in that it refreshes its stream config from the MW API source every 5 minutes. I checked logs, and I can't see any failed EventStreamConfig MW API requests in any of the eventgate-analytics-external pods, but even if it did fail, I don't see how this 'does not have a schema_title setting' error could result. That only happens if the stream IS configured but it does not have a schema_title setting. I don't see how this could happen intermittently like this! Sure, stream config could be misconfigured, or HTTP API requests could fail, but how could an incorrect stream config appear and then disappear?

I just checked the cached stream configs for each of the eqiad eventgate-analytics-external pods and all the streams have schema_titles. In order for this error to happen, an eventgate-analytics-external pod in eqiad at 2020-10-25T14:30:05.933 had cached stream config for these 3 streams without a schema_title setting. I don't know how this could happen!

@JAllemandou suggested that perhaps a failed http request to MW EventStreamConfig API is being cached in the eventgate service as an empty object. I just checked eventgate-wikimedia code, and I don't see how this could happen. The HTTP request is ultimately made by preq, which should throw an HTTPError on any HTTP response >= 400. Errors are not caught by the caching code, so it should be thrown up to eventgate and logged and not cached. Also, preq uses requestretry with a default maxAttempts of 2, so this should be retrying once on any 5xx or network errors.

I'm adding some tests and extra logging to eventgate-wikimedia, but all signs point to a failed HTTP EventStreamConfig API request to not be the cause of this.

Event Timeline

Change 636678 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Add tests and logging around fetching dynamic stream config

https://gerrit.wikimedia.org/r/636678

Change 636678 merged by Ottomata:
[eventgate-wikimedia@master] Add tests and logging around fetching dynamic stream config

https://gerrit.wikimedia.org/r/636678

Change 636722 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - bump to image 2020-10-27-173311-production

https://gerrit.wikimedia.org/r/636722

Change 636722 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics-external - bump to image 2020-10-27-173311-production

https://gerrit.wikimedia.org/r/636722

fdans moved this task from Incoming to Event Platform on the Analytics board.

Change 638116 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Produce canary events every 15 minutes

https://gerrit.wikimedia.org/r/638116

Change 638116 merged by Ottomata:
[operations/puppet@production] Produce canary events every 15 minutes

https://gerrit.wikimedia.org/r/638116

Ottomata renamed this task from eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventtStreamConfig API to eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventStreamConfig API.Nov 2 2020, 7:27 PM

Change 641256 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Throw UnauthorizedSchemaForStreamError with specific message on empty stream config

https://gerrit.wikimedia.org/r/641256

Change 641256 merged by Ottomata:
[eventgate-wikimedia@master] Throw UnauthorizedSchemaForStreamError with specific message on empty stream config

https://gerrit.wikimedia.org/r/641256

Change 641566 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Make dynamic stream-config aware of MW action API response

https://gerrit.wikimedia.org/r/641566

Change 641566 merged by Ottomata:
[eventgate-wikimedia@master] Make dynamic stream-config aware of MW action API response

https://gerrit.wikimedia.org/r/641566

Change 641745 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analyitcs-external - bump to 2020-11-18-143227-production

https://gerrit.wikimedia.org/r/641745

Change 641745 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analyitcs-external - bump to 2020-11-18-143227-production

https://gerrit.wikimedia.org/r/641745

Parking some new log messages:

[2020-11-19T16:28:28.490Z]  WARN: eventgate-wikimedia/141 on eventgate-analytics-external-production-74bc7c7ddb-m8l7c: Got undefined stream config for stream eventlogging_NewcomerTask when fetching dynamic stream config from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=eventlogging_NewcomerTask.
[2020-11-19T16:28:28.493Z]  WARN: eventgate-wikimedia/141 on eventgate-analytics-external-production-74bc7c7ddb-m8l7c: Got undefined stream config for stream test.instrumentation when fetching dynamic stream config from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=test.instrumentation.
...
[2020-11-19T16:30:03.848Z] ERROR: eventgate-wikimedia/141 on eventgate-analytics-external-production-74bc7c7ddb-m8l7c: event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/newcomertask/1.0.0 destined to stream eventlogging_NewcomerTask encountered an error: event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/newcomertask/1.0.0 destined to stream eventlogging_NewcomerTask is not allowed in stream; eventlogging_NewcomerTask is configured but does not have any settings.
    Error: event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/newcomertask/1.0.0 destined to stream eventlogging_NewcomerTask is not allowed in stream; eventlogging_NewcomerTask is configured but does not have any settings.
        at /srv/service/eventgate-wikimedia.js:740:19

[2020-11-19T16:30:03.849Z] ERROR: eventgate-wikimedia/141 on eventgate-analytics-external-production-74bc7c7ddb-m8l7c: event a1d0e22c-ee9f-4e8c-894b-ef6b33095eb6 of schema at /analytics/test/1.1.0 destined to stream test.instrumentation encountered an error: event a1d0e22c-ee9f-4e8c-894b-ef6b33095eb6 of schema at /analytics/test/1.1.0 destined to stream test.instrumentation is not allowed in stream; test.instrumentation is configured but does not have any settings.
    Error: event a1d0e22c-ee9f-4e8c-894b-ef6b33095eb6 of schema at /analytics/test/1.1.0 destined to stream test.instrumentation is not allowed in stream; test.instrumentation is configured but does not have any settings.

I didn't get anything out of MW Action API error or warnings in response body. I guess I'll try to add more logging.

For reference, eventgate-analytics-external-production-74bc7c7ddb-m8l7c is running on kubernetes1008.eqiad.wmnet. In case this happens again and there is some pattern.

Change 642074 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Log the original stream config result when it is undefined for debugging

https://gerrit.wikimedia.org/r/642074

Change 642074 merged by Ottomata:
[eventgate-wikimedia@master] Log the original stream config result when it is undefined for debugging

https://gerrit.wikimedia.org/r/642074

Change 642082 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - bump to 2020-11-19-183059-production

https://gerrit.wikimedia.org/r/642082

Change 642082 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics-external - bump to 2020-11-19-183059-production

https://gerrit.wikimedia.org/r/642082

Ok, got some pretty whacky logs finally.

[2020-11-23T15:40:05.985Z]  INFO: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Loading dynamic stream config for stream ios.edit_history_compare from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=ios.edit_history_compare
[2020-11-23T15:40:10.987Z]  INFO: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Loading dynamic stream config for stream test.instrumentation from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=test.instrumentation
[2020-11-23T15:40:10.989Z]  INFO: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Loading dynamic stream config for stream eventlogging_Test from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=eventlogging_Test
[2020-11-23T15:40:10.990Z]  INFO: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Loading dynamic stream config for stream test.instrumentation.sampled from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=test.instrumentation.sampled
[2020-11-23T15:40:10.990Z]  INFO: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Loading dynamic stream config for stream eventlogging_NewcomerTask from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=eventlogging_NewcomerTask
[2020-11-23T15:40:16.255Z]  WARN: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Got undefined stream config for stream test.instrumentation when fetching dynamic stream config from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=test.instrumentation.
    stream_configs_result: {
      "streams": {
        "eventlogging_NewcomerTask": {
          "stream": "eventlogging_NewcomerTask",
          "schema_title": "analytics/legacy/newcomertask",
          "topic_prefixes": null,
          "canary_events_enabled": true,
          "destination_event_service": "eventgate-analytics-external",
          "topics": [
            "eventlogging_NewcomerTask"
          ]
        }
      }
    }
[2020-11-23T15:40:16.258Z]  WARN: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Got undefined stream config for stream test.instrumentation.sampled when fetching dynamic stream config from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=test.instrumentation.sampled.
    stream_configs_result: {
      "streams": {
        "eventlogging_NewcomerTask": {
          "stream": "eventlogging_NewcomerTask",
          "schema_title": "analytics/legacy/newcomertask",
          "topic_prefixes": null,
          "canary_events_enabled": true,
          "destination_event_service": "eventgate-analytics-external",
          "topics": [
            "eventlogging_NewcomerTask"
          ]
        }
      }
    }
[2020-11-23T15:40:16.263Z]  WARN: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: Got undefined stream config for stream eventlogging_Test when fetching dynamic stream config from https://api-ro.discovery.wmnet/w/api.php?format=json&action=streamconfigs&all_settings=true&constraints=destination_event_service=eventgate-analytics-external&streams=eventlogging_Test.
    stream_configs_result: {
      "streams": {
        "eventlogging_NewcomerTask": {
          "stream": "eventlogging_NewcomerTask",
          "schema_title": "analytics/legacy/newcomertask",
          "topic_prefixes": null,
          "canary_events_enabled": true,
          "destination_event_service": "eventgate-analytics-external",
          "topics": [
            "eventlogging_NewcomerTask"
          ]
        }
      }
    }

Then a few minutes later when the produce_canary_events job runs:

[2020-11-23T15:45:04.862Z] ERROR: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/test/1.2.0 destined to stream eventlogging_Test encountered an error: event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/test/1.2.0 destined to stream eventlogging_Test is not allowed in stream; eventlogging_Test is configured but does not have any settings.
    Error: event b0caf18d-6c7f-4403-947d-2712bbe28610 of schema at /analytics/legacy/test/1.2.0 destined to stream eventlogging_Test is not allowed in stream; eventlogging_Test is configured but does not have any settings.
        at /srv/service/eventgate-wikimedia.js:740:19
[2020-11-23T15:45:04.862Z] ERROR: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: event ee0e2c3d-027c-490d-93ab-46ac3201487d of schema at /analytics/test/1.1.0 destined to stream test.instrumentation encountered an error: event ee0e2c3d-027c-490d-93ab-46ac3201487d of schema at /analytics/test/1.1.0 destined to stream test.instrumentation is not allowed in stream; test.instrumentation is configured but does not have any settings.
    Error: event ee0e2c3d-027c-490d-93ab-46ac3201487d of schema at /analytics/test/1.1.0 destined to stream test.instrumentation is not allowed in stream; test.instrumentation is configured but does not have any settings.
        at /srv/service/eventgate-wikimedia.js:740:19
[2020-11-23T15:45:04.862Z] ERROR: eventgate-wikimedia/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: event f17d3eac-d3ef-4d11-b306-48e000751a64 of schema at /analytics/test/1.1.0 destined to stream test.instrumentation.sampled encountered an error: event f17d3eac-d3ef-4d11-b306-48e000751a64 of schema at /analytics/test/1.1.0 destined to stream test.instrumentation.sampled is not allowed in stream; test.instrumentation.sampled is configured but does not have any settings.
    Error: event f17d3eac-d3ef-4d11-b306-48e000751a64 of schema at /analytics/test/1.1.0 destined to stream test.instrumentation.sampled is not allowed in stream; test.instrumentation.sampled is configured but does not have any settings.
        at /srv/service/eventgate-wikimedia.js:740:19
[2020-11-23T15:45:04.979Z]  WARN: eventgate-analytics-external/141 on eventgate-analytics-external-production-cb9949dbf-wjphv: 6 out of 9 events were accepted, but 3 failed (0 invalid and 3 errored). (levelPath=warn/events, request_id=09c55fb6-7a19-466f-9b69-90970d661ba4)

Here, I'm logging the actual result body given back from the EventStreamConfig API. Or, at least what I'm getting back via the node http client I'm using (preq). This isn't an issue with NodeCache or my CachedMap wrapper of it. The function I'm logging this in is the function that fetches the result from the API, not the higher level function that gets it from cache OR then calls the fetch function.

In this occurrence, somehow the stream config for eventlogging_NewcomerTask is being returned when the config for e.g. test.instrumentation is being requested. Notably, the requests to EventStreamConfig API are going out at 15:40:10, but the warning logs about undefined stream config are being logged at 15:40:16, 6 seconds after the request is sent.

Still at a loss here. Either my code or preq is crossing its wires and handing me back results from async function calls, or EventStreamConfig is doing totally weird stuff when it receives a bunch of requests all at once. Neither of those seem very likely!

Change 643519 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] [WIP] Refactor stream-configs.js to support fetching multiple stream configs

https://gerrit.wikimedia.org/r/643519

Change 643519 merged by Ottomata:
[eventgate-wikimedia@master] Refactor stream-configs.js to support fetching multiple stream configs

https://gerrit.wikimedia.org/r/643519

Change 644912 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - use refactored batch based stream config

https://gerrit.wikimedia.org/r/644912

Change 644912 merged by jenkins-bot:
[operations/deployment-charts@master] eventgate-analytics-external - use refactored batch based stream config

https://gerrit.wikimedia.org/r/644912

Mentioned in SAL (#wikimedia-analytics) [2020-12-02T20:49:49Z] <ottomata> deployed eventgate-analytics-external with refactored stream config, hopefully this will work around the canary events alarm bug - T266573

Change 644922 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-{main,analytics,logging-external} - bump to 2020-12-02-151648-production

https://gerrit.wikimedia.org/r/644922

Change 644922 abandoned by Ottomata:
[operations/deployment-charts@master] eventgate-{main,analytics,logging-external} - bump to 2020-12-02-151648-production

Reason:
Will do individually.

https://gerrit.wikimedia.org/r/644922