If throughput of refined events being dumped into hadoop sharply decreases (a failure in camus for example or a problem with offset consumption) we should be notified.
Maybe report throughput via counts to graphite and have a threshold alarm?
If throughput of refined events being dumped into hadoop sharply decreases (a failure in camus for example or a problem with offset consumption) we should be notified.
Maybe report throughput via counts to graphite and have a threshold alarm?
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ottomata | T198906 EventLogging in Hive data loss due to Camus and Kafka timestamp.type=CreateTime change | |||
Resolved | odimitrijevic | T198986 Data Quality Alarms | |||
Resolved | Ottomata | T198908 Alarms on throughput on camus imported data |
See: https://phabricator.wikimedia.org/T198906
Ideally we also would alarm in errors in any of the logs that camus produces
Alright, I have an idea!
So, a long time ago, @JAllemandou wrote the CamusPartitionChecker that we use to mark webrequest partitions as imported. This thing intelligently looks at the camus history files to determine the latest message timestamp of the last run to determine if it is beyond some particular hours, and then marks them as imported. So, we could use this for all EventLogging data too...IF EventLogging data all had at least one message per hour. Because it doesn't, CamusPartitionChecker will print errors for missing partitions. There's no way to know if the partition is missing due to an error, or just because there's no data in that hour.
So, I suggest we use NavigationTiming as our canary, as we do for EventLogging throughput. The easiest way for me to do this would be to do a --dry-run CamusPartitionChecker that will print an ERROR message if NavigationTiming is missing any partitions, and then trigger an alert on it. This is a little hacky, but will work well. Thoughts?
Sounds good to me! Maybe we could put a bigger list of topics to check, to multiply the probability of catching errors, but except from that sounds good.
Change 450861 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Add email error reporting to CamusPartitionChecker
I suggest both NavigationTiming and VirtualPageview which should have data at all times.
Change 450861 merged by Ottomata:
[analytics/refinery/source@master] Add email error reporting to CamusPartitionChecker
Change 451784 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] CamusPartitionChecker - only send emails if errors are encountered
Change 451869 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] camus wrapper - Use Spark 2 jars to get Scala and Hadoop dependencies
Change 451784 merged by Ottomata:
[analytics/refinery/source@master] CamusPartitionChecker - only send emails if errors are encountered
Change 451869 merged by Ottomata:
[analytics/refinery@master] camus wrapper - Use Spark 2 jars to get Scala and Hadoop dependencies
Mentioned in SAL (#wikimedia-analytics) [2018-08-13T14:59:52Z] <ottomata> deploying refinery-0.0.69 and refinery changes for T198908
Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:00:21Z] <otto@deploy1001> Started deploy [analytics/refinery@9006a4e]: refinery changes for T198908
Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:10:38Z] <otto@deploy1001> Finished deploy [analytics/refinery@9006a4e]: refinery changes for T198908 (duration: 10m 30s)
Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:33:31Z] <otto@deploy1001> Started deploy [analytics/refinery@a051125]: fix for T198908
Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:41:37Z] <otto@deploy1001> Finished deploy [analytics/refinery@a051125]: fix for T198908 (duration: 08m 06s)
Change 452439 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] camus - Add --check-java-opts and --check-emails-to option
Change 452439 merged by Ottomata:
[analytics/refinery@master] camus - Add --check-java-opts and --check-emails-to option
Change 452442 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add parameters for camus::job to pass to CamusPartitionChecker
Change 452442 merged by Ottomata:
[operations/puppet@production] Add parameters for camus::job to pass to CamusPartitionChecker
Mentioned in SAL (#wikimedia-operations) [2018-08-13T18:35:54Z] <otto@deploy1001> Started deploy [analytics/refinery@c7f68b7]: camus - Add --check-java-opts and --check-emails-to option - T198908
Change 452451 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Enable check email reporting for camus jobs
Mentioned in SAL (#wikimedia-operations) [2018-08-13T18:47:04Z] <otto@deploy1001> Finished deploy [analytics/refinery@c7f68b7]: camus - Add --check-java-opts and --check-emails-to option - T198908 (duration: 11m 10s)
Change 452451 merged by Ottomata:
[operations/puppet@production] Enable check email reporting for camus jobs