Page MenuHomePhabricator

[EventGate] Failures when getting stream config from MediaWiki API
Open, HighPublic

Description

See explanation of the issue in the following email thread:


This issue seems to be the same as mentioned in the email alerts with the similar name of last week.
See Andrew's comments for those below:

Hm, very strange. These seem to be occasionally failing because eventgate-analytics-external in codfw fails to look up stream config ? Like:

  • produce_canary_events runs on an-launcher1002
  • looks up all streams to produce canary events for
  • iterates over them and produces a canary event one at a time for each stream to eventgate in both eqiad and codfw
  • eventgate in codfw receives the produce request, and looks up stream config for the event
  • This usually always succeeds, but for 5 times in the last 2 days, eventgate got error when fetching stream config: HTTPError: getaddrinfo ENOTFOUND api-ro.discovery.wmnet api-ro.discovery.wmnet:443"

When this has happened, it only happened for 1 out of the 70+ requests produce_canary_events makes, and only when POSTing to codfw. It doesn't seem to be for any particular stream, although I do see a few repeats.

Anyway, lets keep watching this, if it is still happening next week I'll file a task and investigate more.

FYI, to see these errors, run this on an-launcher1002:

cat /var/log/refinery/produce_canary_events/produce_canary_events.log | grep -A 2 'ERROR ProduceCanaryEvents Some canary events failed to be produced'

I executed this convenient log grep (thanks!) and the error message is a bit different from last week:

HTTPError: connect EHOSTUNREACH 10.2.1.22:443

The IP 10.2.1.22 belongs to MediaWiki API cluster (https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml#L87)
Maybe the MediaWiki API is throttling EventGate? Maybe the EventGate cache for inactive schemas' configs is expiring at the same time for all, and then it tries to get them at the same time, because all canaries are sent together? Or maybe just the MediaWiki API is unavailable for a couple seconds, after looking at EventGate code I think we are not retrying when getting the StreamConfig, but I might be wrong. If so, we could add a retry?

Also, there's a couple errors of another kind:

Cannot read property 'eventlogging_TemplateDataEditor' of undefined
Cannot read property 'eventlogging_TranslationRecommendationUserAction' of undefined

Not sure what those are.
I think my EventGate ops troubleshooting abilities end here.
Since the alert has recovered, and canaries seem fine, I will create a Phab task for this,
and retake on Monday with Andrew's help :].

Event Timeline

mforns triaged this task as High priority.
mforns moved this task from Incoming to Operational Excellence on the Analytics board.

Looking a bit more into this I just saw that there's a big amount of requests to streamconfigs per hour:

select uri_query, count(1) as freq from wmf_raw.webrequest where year=2021 and month=7 and day=16 and hour=13 and uri_query like '%streamconfigs%' group by uri_query order by freq desc limit 1000;

uri_query	freq
?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external	955883
?action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external	91491
?format=json&formatversion=2&errorformat=plaintext&action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external	48063
?format=json&formatversion=2&errorformat=html&action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external	12535
?action=help&modules=streamconfigs	1
Time taken: 96.272 seconds, Fetched: 5 row(s)

If EventGate caching is working well, should we see 1M+ requests to streamconfigs per hour?

Looking a bit more into this I just saw that there's a big amount of requests to streamconfigs per hour:

These requests are from recent versions of the Android app (released since last month):

?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external 955883

These are the from iOS app:

?action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external 91491

And these are from older versions of the Android app:

?format=json&formatversion=2&errorformat=plaintext&action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external 48063
?format=json&formatversion=2&errorformat=html&action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external 12535

The Android app has between 900k and 1.1m daily active users, and fetches a fresh copy of stream configs on startup, so that looks like the level of traffic we'd expect to see over the course of a day, but it does seem rather high for one hour.

Thanks a lot @Mholloway for the clarifications!
Then, this volume is not related to the EventGate issues.
I wonder where the EventGate request logs are?
Will continue looking.
Cheers!