Page MenuHomePhabricator

Alarms on throughput on camus imported data
Closed, ResolvedPublic8 Estimated Story Points

Description

If throughput of refined events being dumped into hadoop sharply decreases (a failure in camus for example or a problem with offset consumption) we should be notified.

Maybe report throughput via counts to graphite and have a threshold alarm?

Event Timeline

Nuria triaged this task as Unbreak Now! priority.Jul 5 2018, 7:43 PM

See: https://phabricator.wikimedia.org/T198906

Ideally we also would alarm in errors in any of the logs that camus produces

Nuria renamed this task from Throughput alarms on refined data to Alarms on throughput on refined data .Jul 6 2018, 6:19 PM
fdans lowered the priority of this task from Unbreak Now! to High.Jul 9 2018, 4:13 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Alright, I have an idea!

So, a long time ago, @JAllemandou wrote the CamusPartitionChecker that we use to mark webrequest partitions as imported. This thing intelligently looks at the camus history files to determine the latest message timestamp of the last run to determine if it is beyond some particular hours, and then marks them as imported. So, we could use this for all EventLogging data too...IF EventLogging data all had at least one message per hour. Because it doesn't, CamusPartitionChecker will print errors for missing partitions. There's no way to know if the partition is missing due to an error, or just because there's no data in that hour.

So, I suggest we use NavigationTiming as our canary, as we do for EventLogging throughput. The easiest way for me to do this would be to do a --dry-run CamusPartitionChecker that will print an ERROR message if NavigationTiming is missing any partitions, and then trigger an alert on it. This is a little hacky, but will work well. Thoughts?

Sounds good to me! Maybe we could put a bigger list of topics to check, to multiply the probability of catching errors, but except from that sounds good.

Change 450861 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Add email error reporting to CamusPartitionChecker

https://gerrit.wikimedia.org/r/450861

I suggest both NavigationTiming and VirtualPageview which should have data at all times.

Ottomata renamed this task from Alarms on throughput on refined data to Alarms on throughput on camus imported data .Aug 7 2018, 8:26 PM
Ottomata claimed this task.
Ottomata added a project: Analytics-Kanban.
Ottomata set the point value for this task to 8.
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 450861 merged by Ottomata:
[analytics/refinery/source@master] Add email error reporting to CamusPartitionChecker

https://gerrit.wikimedia.org/r/450861

Change 451784 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] CamusPartitionChecker - only send emails if errors are encountered

https://gerrit.wikimedia.org/r/451784

Change 451869 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] camus wrapper - Use Spark 2 jars to get Scala and Hadoop dependencies

https://gerrit.wikimedia.org/r/451869

Change 451784 merged by Ottomata:
[analytics/refinery/source@master] CamusPartitionChecker - only send emails if errors are encountered

https://gerrit.wikimedia.org/r/451784

Change 451869 merged by Ottomata:
[analytics/refinery@master] camus wrapper - Use Spark 2 jars to get Scala and Hadoop dependencies

https://gerrit.wikimedia.org/r/451869

Mentioned in SAL (#wikimedia-analytics) [2018-08-13T14:59:52Z] <ottomata> deploying refinery-0.0.69 and refinery changes for T198908

Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:00:21Z] <otto@deploy1001> Started deploy [analytics/refinery@9006a4e]: refinery changes for T198908

Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:10:38Z] <otto@deploy1001> Finished deploy [analytics/refinery@9006a4e]: refinery changes for T198908 (duration: 10m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:33:31Z] <otto@deploy1001> Started deploy [analytics/refinery@a051125]: fix for T198908

Mentioned in SAL (#wikimedia-operations) [2018-08-13T15:41:37Z] <otto@deploy1001> Finished deploy [analytics/refinery@a051125]: fix for T198908 (duration: 08m 06s)

Change 452439 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] camus - Add --check-java-opts and --check-emails-to option

https://gerrit.wikimedia.org/r/452439

Change 452439 merged by Ottomata:
[analytics/refinery@master] camus - Add --check-java-opts and --check-emails-to option

https://gerrit.wikimedia.org/r/452439

Change 452442 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add parameters for camus::job to pass to CamusPartitionChecker

https://gerrit.wikimedia.org/r/452442

Change 452442 merged by Ottomata:
[operations/puppet@production] Add parameters for camus::job to pass to CamusPartitionChecker

https://gerrit.wikimedia.org/r/452442

Mentioned in SAL (#wikimedia-operations) [2018-08-13T18:35:54Z] <otto@deploy1001> Started deploy [analytics/refinery@c7f68b7]: camus - Add --check-java-opts and --check-emails-to option - T198908

Change 452451 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Enable check email reporting for camus jobs

https://gerrit.wikimedia.org/r/452451

Mentioned in SAL (#wikimedia-operations) [2018-08-13T18:47:04Z] <otto@deploy1001> Finished deploy [analytics/refinery@c7f68b7]: camus - Add --check-java-opts and --check-emails-to option - T198908 (duration: 11m 10s)

Change 452451 merged by Ottomata:
[operations/puppet@production] Enable check email reporting for camus jobs

https://gerrit.wikimedia.org/r/452451