Page MenuHomePhabricator

improve banner logger and kafkatee monitoring
Closed, ResolvedPublic

Description

We've several possible failure modes for kafkatee, :

  1. local filtering or log handling bug
  2. unanticipated kafka-jumbo cluster changes T254257: adjust fundraising firewalls and kafkatee configuration to accomodate new kafka brokers kafka-jumbo100[789]
  3. packet loss due to local bottleneck T239564: Monitor and investigate possible event dropping by Kafkatee
  4. packet loss due to local firewall (locally logged)
  5. packet loss due to network issue
  6. incomplete topic data due to bug/issue in the kafka cluster T73056: kafkatee not consuming for some partitions
  7. incomplete data due tue eventlogging issue

We already have some monitoring in place:

  • check_impression_logs - nagios check that is integrated with the log rotation script, alerts if log rotation stop running or sees gaps in data, this is very limited because there's no way to adjust alert thresholds to suit to campaign activity
  • notification on packets dropped by the kernel
  • email alert when kafkatee's discovered broker list does not match configuration, this is a function of our prometheus exporter which exports statistics from kafkatee's status log
  • nagios/icinga monitoring of reported broker state up/down as well as timeliness of json stat reports

We have some tasks defined for improvements:

Other ideas:

  • alert on changes to relevant production puppet classes or hiera configuration

Event Timeline

Jgreen created this task.Jun 3 2020, 5:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2020, 5:28 PM
Jgreen updated the task description. (Show Details)Jun 3 2020, 5:31 PM
Jgreen updated the task description. (Show Details)
Jgreen updated the task description. (Show Details)Jun 3 2020, 5:36 PM
Jgreen updated the task description. (Show Details)Jun 3 2020, 5:44 PM
Jgreen updated the task description. (Show Details)
Jgreen updated the task description. (Show Details)Jun 3 2020, 5:58 PM
Jgreen moved this task from Triage to Up Next on the fundraising-tech-ops board.Jun 4 2020, 12:26 PM
Jgreen claimed this task.Jun 11 2020, 9:25 PM
Jgreen triaged this task as Medium priority.
Jgreen updated the task description. (Show Details)
Jgreen updated the task description. (Show Details)
  • adjusted kernel log monitoring to make dropped packets more visible
  • modified prometheus kafkatee exporter to cronspam every 10 minutes when there's a mismatch between discovered vs configured brokers
Jgreen updated the task description. (Show Details)Jun 16 2020, 7:53 PM
  • added check_kafkatee to frack nagios/icinga, monitoring broker state in kafkatee status logs
Jgreen closed this task as Resolved.Jun 16 2020, 7:54 PM
Jgreen moved this task from In Progress to Done on the fundraising-tech-ops board.