Page MenuHomePhabricator

ChangeProp logging KafkaConsumer is not connected
Closed, ResolvedPublic

Description

Sometimes ChangeProp gets into the mode when it starts logging KafkaConsumer is not connected in a tight loop. The logs normally disappeare after a full restart. The service was operating normally though.

Sample log entry:

{
  "_index": "logstash-2018.07.12",
  "_type": "cpjobqueue",
  "_id": "AWSNgQkkKHrb1-g-EgFk",
  "_version": 1,
  "_score": null,
  "_source": {
    "err_executor": "RetryExecutor",
    "err_errno": -172,
    "source_host": "10.192.0.33",
    "err_levelPath": "error/consumer",
    "level": "ERROR",
    "err_rule_name": "low_traffic_jobs",
    "line": "260",
    "err_code": -172,
    "pid": 9774,
    "err_origin": "kafka",
    "type": "cpjobqueue",
    "message": "KafkaConsumer is not connected",
    "version": "1.0",
    "normalized_message": "KafkaConsumer is not connected",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "err_message": "KafkaConsumer is not connected",
    "file": "/srv/deployment/cpjobqueue/deploy-cache/revs/ba672a32e1d9642f5dad2008a6f05b8b48ee89e6/node_modules/node-rdkafka/lib/error.js",
    "@timestamp": "2018-07-12T07:59:59.974Z",
    "err_name": "cpjobqueue",
    "host": "scb2003",
    "@version": "1",
    "gelf_level": "3",
    "err_stack": "Error: KafkaConsumer is not connected\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/ba672a32e1d9642f5dad2008a6f05b8b48ee89e6/node_modules/node-rdkafka/lib/error.js:260:10)\n    at /srv/deployment/cpjobqueue/deploy-cache/revs/ba672a32e1d9642f5dad2008a6f05b8b48ee89e6/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29"
  },
  "fields": {
    "@timestamp": [
      1531382399974
    ]
  },
  "sort": [
    1531382399974
  ]
}

I believe the problem is that driver can't properly reconnect automatically when it's been disconnected. We need to figure out whether it's a driver bug or it's something we should fix/workaround in our code.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pchelolo renamed this task from After switchover to codfw ChangeProp logged KafkaConsumer is not connected to After switchover to codfw Job queue ChangeProp logged KafkaConsumer is not connected.Jul 12 2018, 1:50 PM
Pchelolo renamed this task from After switchover to codfw Job queue ChangeProp logged KafkaConsumer is not connected to ChangeProp logging KafkaConsumer is not connected.Oct 1 2018, 4:51 PM
Pchelolo updated the task description. (Show Details)
mobrovac added a subscriber: mobrovac.

The interesting part is that this doesn't happen for CP-JQ, only CP.

It does. Yesterday I've restarted JobQueue for that.

I found the reason for this to be happening. At least, one possible reason. And to be honest, I'm embarrassed.

To support regex topics and automatic resubscription in case a new job/rule/topic is added we have a MetadataWatch class - from time to time it fetches the complete topic list from Kafka, reapplies topic filters for each rule and if we the set of topics we need to subscribe to changes, we call this code:

this.unsubscribe();
this._subscribeTopics(newFilteredTopics);

Looks harmless, but the unsubscribe is asynchronous. So, sometimes we end up in a situation when connects and disconnects gets mixed up and we end up with executors which think they're connected while the client itself is disconnected.

I'm not sure this is the only case when this can happen, but this particular one is reproducible in labs, so I will start with it.

The previous comment also explains why we started seeing the errors after DC switchover. Topics are created on demand and while codfw was not a primary DC a lot of job types did not exist there, so when we switched and new job types started being emitted in codfw the faulty codepath was executed and we run into the race condition.

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:51:02Z] <ppchelko@deploy1001> Started deploy [changeprop/deploy@5d00448]: Proper reconnect on topics change T199444

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:52:42Z] <ppchelko@deploy1001> Finished deploy [changeprop/deploy@5d00448]: Proper reconnect on topics change T199444 (duration: 01m 40s)

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:53:41Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@55dbb8b]: Proper reconnect on topics change T199444

Mentioned in SAL (#wikimedia-operations) [2018-10-04T15:54:36Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@55dbb8b]: Proper reconnect on topics change T199444 (duration: 00m 55s)

Pchelolo edited projects, added Services (done); removed Services (doing).

Both CP and CPJQ were deployed with the fix. This bug happens quite rarely, so I will close the ticket now. If it happens again we will reopen it.