Page MenuHomePhabricator

eventgate-analytics k8s pods occasionally can't produce to kafka
Closed, ResolvedPublic

Description

There seems to be a problem where occasionally a pod cannot produces to Kafka. I can't predict which pod it will be, but it seems to happen regularly enough during a chart upgrade that at least one pod will be affected. If I delete the pod, a new one is spawned and then usually works fine.

Event Timeline

Ottomata created this task.Mar 13 2019, 9:40 PM
Restricted Application added a project: Analytics. · View Herald TranscriptMar 13 2019, 9:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@akosiaris let's try to figure this out tomorrow. :)

Change 496329 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Disable api-request eventgate logging for group0 wikis

https://gerrit.wikimedia.org/r/496329

Change 496329 merged by Ottomata:
[operations/mediawiki-config@master] Disable api-request eventgate logging for group0 wikis

https://gerrit.wikimedia.org/r/496329

Mentioned in SAL (#wikimedia-operations) [2019-03-13T22:00:08Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Disabling api-request logging to eventgate-analytics for group0 wikis until we solve T218268 (duration: 00m 56s)

Do we have logs of this happening?

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.03.13/eventgate?id=AWl4sxguNBo9dX1kfcii&_g=(refreshInterval:('$$hashKey':'object:1664',display:Off,pause:!f,section:0,value:0),time:(from:'2019-03-13T18:00:00.000Z',mode:absolute,to:'2019-03-13T21:00:00.000Z'))

{
  "_index": "logstash-2019.03.13",
  "_type": "eventgate",
  "_id": "AWl4sxguNBo9dX1kfcii",
  "_score": 1,
  "_source": {
    "err_errno": -1,
    "source_host": "10.64.64.93",
    "level": "ERROR",
    "err_code": -1,
    "pid": 140,
    "err_origin": "local",
    "type": "eventgate",
    "message": "event 32d56482-45cc-11e9-be6d-1418776134a1 of schema at /mediawiki/api/request/0.0.1 destined to stream mediawiki.api-request encountered an error: message timed out",
    "version": "1.0",
    "normalized_message": "event 32d56482-45cc-11e9-be6d-1418776134a1 of schema at /mediawiki/api/request/0.0.1 destined to stream mediawiki.api-request encountered an error: message timed out",
    "tags": [
      "input-gelf-12201",
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "err_message": "message timed out",
    "@timestamp": "2019-03-13T20:16:36.049Z",
    "err_name": "Error",
    "host": "eventgate-analytics-production-5d866bc9dd-nc2qk",
    "@version": "1",
    "gelf_level": "3",
    "err_stack": "Error: Local: Message timed out"
  },
  "fields": {
    "@timestamp": [
      1552508196049
    ]
  }
}

What is especially strange here is that this only happen on one(ish?) random pod at a time after an upgrade. However, the service won't be 'ready' until it has initiated its Kafka producer connection, so it must have been able to talk to Kafka on startup. Now that I think of that, it seems unlikely that this is a random pod network problem, unless the network stops working after service startup. Perhaps this is too much load and the local producer queues are too full waiting for Kafka to ACK messages? Possible, but I believe that Petr and I saw this happen even when we only had the low volume group0 wikis enabled.

I'm going to poke around with some service upgrade/restarts and see if I can reproduce without any MW logging turned on.

I can not rule out a networking issue but it seems improbable, as after all the logs did make it to logstash. Also the destination host is missing from the log making it more difficult to double check things.

Milimetric triaged this task as High priority.Mar 18 2019, 3:34 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.
Milimetric added a project: Analytics-Kanban.
Ottomata claimed this task.Mar 20 2019, 8:31 PM
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

I don't know much more, but I have a lot more data!

Here is a staging pod with trace logging enabled reproducing the problem:
https://logstash.wikimedia.org/goto/2c7aa00c769053662687249881871c91

The pod starts, and then I send the first POST at 2019-03-21T22:29:33. rdkafka then to get stuck in a loop of metadata requests and failures for abaout 1m48 seconds. Logs here include:

Topic staging.test.event [0]: leader is down: re-query
kafka-jumbo1002.eqiad.wmnet:9092/1002: Sent MetadataRequest (v2, 67 bytes @ 0, CorrId 25)
kafka-jumbo1002.eqiad.wmnet:9092/1002: Received MetadataResponse (v2, 345 bytes, CorrId 25, rtt 57.27ms)
Topic staging.test.event partition 0 Leader 1003
kafka-jumbo1002.eqiad.wmnet:9092/1002: 1/1 requested topic(s) seen in metadata
Topic staging.test.event [0]: leader is down: re-query

...
and so on. It does this until 2019-03-21T22:31:20 when it logs

kafka-jumbo1003.eqiad.wmnet:9092/1003: failed: err: Local: Broker transport failure: (errno: Connection timed out)

From what I can tell, those the Local timeouts like this happen because the broker closes an idle connection. All of the open broker connections timeout and are disconnected at about this time. Then we get reconnects for all the brokers happening at 2019-03-21T22:31:21

kafka-jumbo1005.eqiad.wmnet:9092/1005: Connecting to ipv6#[2620:0:861:103:10:64:32:160]:9092 (plaintext) with socket 28

Once they all reconnect, finally at rdkafka sends the produce request:

kafka-jumbo1003.eqiad.wmnet:9092/1003: Sent ProduceRequest (v3, 326 bytes @ 0, CorrId 2)
kafka-jumbo1003.eqiad.wmnet:9092/1003: staging.test.event [0]: Produce MessageSet with 1 message(s) (239 bytes, ApiVersion 3, MsgVersion 2)
kafka-jumbo1003.eqiad.wmnet:9092/1003: staging.test.event [0] 1 message(s) in xmit queue (1 added from partition queue)
kafka-jumbo1003.eqiad.wmnet:9092/1003: staging.test.event [0]: MessageSet with 1 message(s) delivered
kafka-jumbo1003.eqiad.wmnet:9092/1003: Received ProduceResponse (v3, 54 bytes, CorrId 2, rtt 94.33ms)
All 1 out of 1 events were accepted.

Now we just have to figure out why rdkafka is sometimes getting stuck because of Topic staging.test.event [0]: leader is down: re-query for 1-2 minutes....

Ottomata moved this task from Backlog to In Progress on the EventBus board.Mar 22 2019, 3:46 PM
Ottomata added a comment.EditedMar 22 2019, 5:23 PM

I tcpdumped while reproducing this on kubestage1001 and the kafka brokers. I could see the looped metadata requests going through fine, but the topic partition leader on kafka-jumbo1003 did not receive any traffic between the initial producer connection, and when the produce request is finally sent about 2 minutes after the HTTP POST. Here's the tcp dump lines 2 minutes apart from kubestage1001:

17:05:39.985011 IP kubernetes-pod-10-64-75-18.eqiad.wmnet.57250 > kafka-jumbo1003.eqiad.wmnet.9092: Flags [.], ack 557, win 62, options [nop,nop,TS val 1666777933 ecr 4257697771], length 0
E..4M.@.?.~.
@K.
@.c..#...N...1D...>p......
cY.M..O.

17:07:50.725284 IP kubernetes-pod-10-64-75-18.eqiad.wmnet.57718 > kafka-jumbo1003.eqiad.wmnet.9092: Flags [S], seq 2270592047, win 29200, options [mss 1460,sackOK,TS val 1666810618 ecr 0,nop,wscale 9], length 0
E..<..@.?...
@K.
@.c.v#..V|/......r.p#.........
cY.........

Topic staging.test.event [0]: leader is down: re-query happens in the main rdkafka thread, while all the other metadata requests happen in either the broker or broker/bootstrap specific threads. It's as if the main thread sees kafka-jumbo1003 as down and tells the bootstrap threads to re-query for leadership assuming that Kafka will have rebalanced the leader. Since Kafka hasn't (there's nothing wrong with 1003), the partition leadership metadata returned is the same (1003 is still the leader), and the main thread is told to produce to 1003 again. It still sees 1003 as down (why we don't know). Thus the loop between main thread: leader is down and metadata re-query.

Also, examining the logs again and from the fact that no traffic is sent to kafka-jumbo1003 in this period, 1003 is never sent a metadata request. And suspiciously neither is kafka-jumbo1006. (Just checked, they are in different rows: 1003 is in Row B and 1006 is in row D.)

So so so , WHY does the main thread see 1003 (and possibly 1006) as down???

Interesting fact: kafka-jumbo1003 and kafka-jumbo1006 are the only brokers in this cluster that are not in Row A or Row B. kubestage1001 is in Row A and kubestage1002 is in Row B. Could there be some very sneaky and intermittent network troubles between k8s nodes and hosts in different rows?

Ohooohooooo! Could this be related to IPv6? On a successful run, here's the connect message for kafka-jumbo1003:

kafka-jumbo1003.eqiad.wmnet:9092/bootstrap: Connected to ipv4#10.64.16.99:9092

But for an unsuccessful run, the startup connect attempt is:

kafka-jumbo1003.eqiad.wmnet:9092/1003: Connecting to ipv6#[2620:0:861:102:10:64:16:99]:9092 (plaintext) with socket 27

And then two minutes later, the successful connect attempt:

kafka-jumbo1003.eqiad.wmnet:9092/1003: Connected to ipv4#10.64.16.99:9092

From https://github.com/edenhill/librdkafka/wiki/FAQ

librdkafka will use the system resolver to resolve the broker hostname. [...] librdkafka will, in a round-robin fashion, attempt to connect to all addresses the hostname resolves to. If the broker is only listening to the IPv4 address then the clients connection attempt to the IPv6 address will fail.
To limit the address families the clients connects to, set the broker.address.family configuration property to v4 or v6.

Kafka brokers are listening on IPv4 and IPv6 addresses:

[@kubestage1002:/home/otto] $ dig kafka-jumbo1003.eqiad.wmnet  ANY | grep -A 2 ';; ANSWER SECTION'
;; ANSWER SECTION:
kafka-jumbo1003.eqiad.wmnet. 1618 IN	AAAA	2620:0:861:102:10:64:16:99
kafka-jumbo1003.eqiad.wmnet. 1343 IN	A	10.64.16.99

[@kubestage1002:/home/otto] $ telnet 10.64.16.99 9092
Trying 10.64.16.99...
Connected to 10.64.16.99.
Escape character is '^]'.

[@kubestage1002:/home/otto] $ telnet 2620:0:861:102:10:64:16:99 9092
Trying 2620:0:861:102:10:64:16:99...
Connected to 2620:0:861:102:10:64:16:99.
Escape character is '^]'.

I tried logging into the container as root and installing telnet there to test this from the pod directly, but:

root@eventgate-analytics-staging-7bbc67c749-tfwwv:/srv/service# apt-get install telnet
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package telnet
root@eventgate-analytics-staging-7bbc67c749-tfwwv:/srv/service# apt-get update
0% [Connecting to mirrors.wikimedia.org (2620:0:861:1:208:80:154:15)] [Connecting to security.debian.org (2610:148:1f10:3::73)] [Connecting to apt.wikimedia.org (2620:0:861:1:208:80:154:22)]

apt-get update couldn't connect to the apt source IPv6 addresses.

I just tried the suggested librdkafka work around of setting broker.address.family: v4, and I can't reproduce this problem anymore.

@akosiaris _joe_ suggested that perhaps there is some IPv6 Calico issue here?

Change 498455 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - set broker.address.family: v4 to workaround k8s IPv6 issue

https://gerrit.wikimedia.org/r/498455

Change 498455 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - set broker.address.family: v4 to workaround k8s IPv6 issue

https://gerrit.wikimedia.org/r/498455

Change 498873 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Re-enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/498873

Change 498873 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/498873

apt-get update couldn't connect to the apt source IPv6 addresses.

Nor IPv4 for that matter. The apt repos are not in the whitelist[1] (on purpose, aside from debugging, why would they be?

I just tried the suggested librdkafka work around of setting broker.address.family: v4, and I can't reproduce this problem anymore.
@akosiaris _joe_ suggested that perhaps there is some IPv6 Calico issue here?

From the looks of it, we never added the IPv6 kafka broker IPs in 29f615c779, so I think this is it. I 'll add and let's see if we found the root cause.

[1] https://github.com/wikimedia/puppet/blob/production/modules/calico/data/default-kubernetes-policy.yaml

OF COURSE! Thank you Alex.

Since (IIUC), the preferred default is that all servers get IPv6 addies, we should remember to in general include both IPv4 and IPv6 in all networking rules too, ya?

Change 499223 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Kubernetes default policy: Add Kafka IPv6 addresses

https://gerrit.wikimedia.org/r/499223

After ^:

kafka-jumbo1003.eqiad.wmnet:9092/bootstrap: Connected to ipv6#[2620:0:861:102:10:64:16:99]:9092

Looks good!

Change 499223 merged by Alexandros Kosiaris:
[operations/puppet@production] Kubernetes default policy: Add Kafka IPv6 addresses

https://gerrit.wikimedia.org/r/499223

akosiaris closed this task as Resolved.Mar 26 2019, 9:39 PM

Resolving, feel free to reopen

Change 502292 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/502292

Change 502292 merged by Ottomata:
[operations/mediawiki-config@master] Enable eventgate-analytics api-request logging for group0 wikis

https://gerrit.wikimedia.org/r/502292