There seems to be a problem where occasionally a pod cannot produces to Kafka. I can't predict which pod it will be, but it seems to happen regularly enough during a chart upgrade that at least one pod will be affected. If I delete the pod, a new one is spawned and then usually works fine.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ottomata | T218255 Enabling api-request eventgate to group1 caused minor service disruptions | |||
Resolved | Ottomata | T218268 eventgate-analytics k8s pods occasionally can't produce to kafka |
Event Timeline
Change 496329 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Disable api-request eventgate logging for group0 wikis
Change 496329 merged by Ottomata:
[operations/mediawiki-config@master] Disable api-request eventgate logging for group0 wikis
Mentioned in SAL (#wikimedia-operations) [2019-03-13T22:00:08Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Disabling api-request logging to eventgate-analytics for group0 wikis until we solve T218268 (duration: 00m 56s)
{ "_index": "logstash-2019.03.13", "_type": "eventgate", "_id": "AWl4sxguNBo9dX1kfcii", "_score": 1, "_source": { "err_errno": -1, "source_host": "10.64.64.93", "level": "ERROR", "err_code": -1, "pid": 140, "err_origin": "local", "type": "eventgate", "message": "event 32d56482-45cc-11e9-be6d-1418776134a1 of schema at /mediawiki/api/request/0.0.1 destined to stream mediawiki.api-request encountered an error: message timed out", "version": "1.0", "normalized_message": "event 32d56482-45cc-11e9-be6d-1418776134a1 of schema at /mediawiki/api/request/0.0.1 destined to stream mediawiki.api-request encountered an error: message timed out", "tags": [ "input-gelf-12201", "es", "gelf", "normalized_message_untrimmed" ], "err_message": "message timed out", "@timestamp": "2019-03-13T20:16:36.049Z", "err_name": "Error", "host": "eventgate-analytics-production-5d866bc9dd-nc2qk", "@version": "1", "gelf_level": "3", "err_stack": "Error: Local: Message timed out" }, "fields": { "@timestamp": [ 1552508196049 ] } }
What is especially strange here is that this only happen on one(ish?) random pod at a time after an upgrade. However, the service won't be 'ready' until it has initiated its Kafka producer connection, so it must have been able to talk to Kafka on startup. Now that I think of that, it seems unlikely that this is a random pod network problem, unless the network stops working after service startup. Perhaps this is too much load and the local producer queues are too full waiting for Kafka to ACK messages? Possible, but I believe that Petr and I saw this happen even when we only had the low volume group0 wikis enabled.
I'm going to poke around with some service upgrade/restarts and see if I can reproduce without any MW logging turned on.
I can not rule out a networking issue but it seems improbable, as after all the logs did make it to logstash. Also the destination host is missing from the log making it more difficult to double check things.
I don't know much more, but I have a lot more data!
Here is a staging pod with trace logging enabled reproducing the problem:
https://logstash.wikimedia.org/goto/2c7aa00c769053662687249881871c91
The pod starts, and then I send the first POST at 2019-03-21T22:29:33. rdkafka then to get stuck in a loop of metadata requests and failures for abaout 1m48 seconds. Logs here include:
Topic staging.test.event [0]: leader is down: re-query kafka-jumbo1002.eqiad.wmnet:9092/1002: Sent MetadataRequest (v2, 67 bytes @ 0, CorrId 25) kafka-jumbo1002.eqiad.wmnet:9092/1002: Received MetadataResponse (v2, 345 bytes, CorrId 25, rtt 57.27ms) Topic staging.test.event partition 0 Leader 1003 kafka-jumbo1002.eqiad.wmnet:9092/1002: 1/1 requested topic(s) seen in metadata Topic staging.test.event [0]: leader is down: re-query
...
and so on. It does this until 2019-03-21T22:31:20 when it logs
kafka-jumbo1003.eqiad.wmnet:9092/1003: failed: err: Local: Broker transport failure: (errno: Connection timed out)
From what I can tell, those the Local timeouts like this happen because the broker closes an idle connection. All of the open broker connections timeout and are disconnected at about this time. Then we get reconnects for all the brokers happening at 2019-03-21T22:31:21
kafka-jumbo1005.eqiad.wmnet:9092/1005: Connecting to ipv6#[2620:0:861:103:10:64:32:160]:9092 (plaintext) with socket 28
Once they all reconnect, finally at rdkafka sends the produce request:
kafka-jumbo1003.eqiad.wmnet:9092/1003: Sent ProduceRequest (v3, 326 bytes @ 0, CorrId 2) kafka-jumbo1003.eqiad.wmnet:9092/1003: staging.test.event [0]: Produce MessageSet with 1 message(s) (239 bytes, ApiVersion 3, MsgVersion 2) kafka-jumbo1003.eqiad.wmnet:9092/1003: staging.test.event [0] 1 message(s) in xmit queue (1 added from partition queue) kafka-jumbo1003.eqiad.wmnet:9092/1003: staging.test.event [0]: MessageSet with 1 message(s) delivered kafka-jumbo1003.eqiad.wmnet:9092/1003: Received ProduceResponse (v3, 54 bytes, CorrId 2, rtt 94.33ms) All 1 out of 1 events were accepted.
Now we just have to figure out why rdkafka is sometimes getting stuck because of Topic staging.test.event [0]: leader is down: re-query for 1-2 minutes....
I tcpdumped while reproducing this on kubestage1001 and the kafka brokers. I could see the looped metadata requests going through fine, but the topic partition leader on kafka-jumbo1003 did not receive any traffic between the initial producer connection, and when the produce request is finally sent about 2 minutes after the HTTP POST. Here's the tcp dump lines 2 minutes apart from kubestage1001:
17:05:39.985011 IP kubernetes-pod-10-64-75-18.eqiad.wmnet.57250 > kafka-jumbo1003.eqiad.wmnet.9092: Flags [.], ack 557, win 62, options [nop,nop,TS val 1666777933 ecr 4257697771], length 0 E..4M.@.?.~. @K. @.c..#...N...1D...>p...... cY.M..O. 17:07:50.725284 IP kubernetes-pod-10-64-75-18.eqiad.wmnet.57718 > kafka-jumbo1003.eqiad.wmnet.9092: Flags [S], seq 2270592047, win 29200, options [mss 1460,sackOK,TS val 1666810618 ecr 0,nop,wscale 9], length 0 E..<..@.?... @K. @.c.v#..V|/......r.p#......... cY.........
Topic staging.test.event [0]: leader is down: re-query happens in the main rdkafka thread, while all the other metadata requests happen in either the broker or broker/bootstrap specific threads. It's as if the main thread sees kafka-jumbo1003 as down and tells the bootstrap threads to re-query for leadership assuming that Kafka will have rebalanced the leader. Since Kafka hasn't (there's nothing wrong with 1003), the partition leadership metadata returned is the same (1003 is still the leader), and the main thread is told to produce to 1003 again. It still sees 1003 as down (why we don't know). Thus the loop between main thread: leader is down and metadata re-query.
Also, examining the logs again and from the fact that no traffic is sent to kafka-jumbo1003 in this period, 1003 is never sent a metadata request. And suspiciously neither is kafka-jumbo1006. (Just checked, they are in different rows: 1003 is in Row B and 1006 is in row D.)
So so so , WHY does the main thread see 1003 (and possibly 1006) as down???
Interesting fact: kafka-jumbo1003 and kafka-jumbo1006 are the only brokers in this cluster that are not in Row A or Row B. kubestage1001 is in Row A and kubestage1002 is in Row B. Could there be some very sneaky and intermittent network troubles between k8s nodes and hosts in different rows?
Ohooohooooo! Could this be related to IPv6? On a successful run, here's the connect message for kafka-jumbo1003:
kafka-jumbo1003.eqiad.wmnet:9092/bootstrap: Connected to ipv4#10.64.16.99:9092
But for an unsuccessful run, the startup connect attempt is:
kafka-jumbo1003.eqiad.wmnet:9092/1003: Connecting to ipv6#[2620:0:861:102:10:64:16:99]:9092 (plaintext) with socket 27
And then two minutes later, the successful connect attempt:
kafka-jumbo1003.eqiad.wmnet:9092/1003: Connected to ipv4#10.64.16.99:9092
From https://github.com/edenhill/librdkafka/wiki/FAQ
librdkafka will use the system resolver to resolve the broker hostname. [...] librdkafka will, in a round-robin fashion, attempt to connect to all addresses the hostname resolves to. If the broker is only listening to the IPv4 address then the clients connection attempt to the IPv6 address will fail.
To limit the address families the clients connects to, set the broker.address.family configuration property to v4 or v6.
Kafka brokers are listening on IPv4 and IPv6 addresses:
[@kubestage1002:/home/otto] $ dig kafka-jumbo1003.eqiad.wmnet ANY | grep -A 2 ';; ANSWER SECTION' ;; ANSWER SECTION: kafka-jumbo1003.eqiad.wmnet. 1618 IN AAAA 2620:0:861:102:10:64:16:99 kafka-jumbo1003.eqiad.wmnet. 1343 IN A 10.64.16.99 [@kubestage1002:/home/otto] $ telnet 10.64.16.99 9092 Trying 10.64.16.99... Connected to 10.64.16.99. Escape character is '^]'. [@kubestage1002:/home/otto] $ telnet 2620:0:861:102:10:64:16:99 9092 Trying 2620:0:861:102:10:64:16:99... Connected to 2620:0:861:102:10:64:16:99. Escape character is '^]'.
I tried logging into the container as root and installing telnet there to test this from the pod directly, but:
root@eventgate-analytics-staging-7bbc67c749-tfwwv:/srv/service# apt-get install telnet Reading package lists... Done Building dependency tree Reading state information... Done E: Unable to locate package telnet root@eventgate-analytics-staging-7bbc67c749-tfwwv:/srv/service# apt-get update 0% [Connecting to mirrors.wikimedia.org (2620:0:861:1:208:80:154:15)] [Connecting to security.debian.org (2610:148:1f10:3::73)] [Connecting to apt.wikimedia.org (2620:0:861:1:208:80:154:22)]
apt-get update couldn't connect to the apt source IPv6 addresses.
I just tried the suggested librdkafka work around of setting broker.address.family: v4, and I can't reproduce this problem anymore.
@akosiaris _joe_ suggested that perhaps there is some IPv6 Calico issue here?
Change 498455 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - set broker.address.family: v4 to workaround k8s IPv6 issue
Change 498455 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - set broker.address.family: v4 to workaround k8s IPv6 issue
Change 498873 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Re-enable eventgate-analytics api-request logging for group0 wikis
Change 498873 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable eventgate-analytics api-request logging for group0 wikis
Nor IPv4 for that matter. The apt repos are not in the whitelist[1] (on purpose, aside from debugging, why would they be?
I just tried the suggested librdkafka work around of setting broker.address.family: v4, and I can't reproduce this problem anymore.
@akosiaris _joe_ suggested that perhaps there is some IPv6 Calico issue here?
From the looks of it, we never added the IPv6 kafka broker IPs in 29f615c779, so I think this is it. I 'll add and let's see if we found the root cause.
OF COURSE! Thank you Alex.
Since (IIUC), the preferred default is that all servers get IPv6 addies, we should remember to in general include both IPv4 and IPv6 in all networking rules too, ya?
Change 499223 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Kubernetes default policy: Add Kafka IPv6 addresses
After ^:
kafka-jumbo1003.eqiad.wmnet:9092/bootstrap: Connected to ipv6#[2620:0:861:102:10:64:16:99]:9092
Looks good!
Change 499223 merged by Alexandros Kosiaris:
[operations/puppet@production] Kubernetes default policy: Add Kafka IPv6 addresses
Change 502292 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Enable eventgate-analytics api-request logging for group0 wikis
Change 502292 merged by Ottomata:
[operations/mediawiki-config@master] Enable eventgate-analytics api-request logging for group0 wikis