Page MenuHomePhabricator

[Epic] overhaul fundraising cluster monitoring
Closed, ResolvedPublic

Description

fundraising hosts use nsca + icinga, and ganglia for monitoring and reporting
nsca seems to be poorly maintained, currently broken for Trusty hosts
as an org we're talking about moving away from ganglia
this task is to fix and update monitoring to be reliable and better supported going forward

random notes:

Related Objects

StatusSubtypeAssignedTask
ResolvedJgreen
ResolvedJgreen
ResolvedNone
Resolved cwdent
Resolved cwdent
DeclinedNone
Resolved cwdent
Resolvedayounsi
ResolvedJgreen
ResolvedSpikeJgreen
ResolvedJgreen
DeclinedNone
Resolved cwdent
Resolved cwdent
ResolvedDwisehaupt
DeclinedNone
Resolved cwdent
DeclinedNone
ResolvedJgreen
DeclinedNone
ResolvedJgreen
DeclinedNone
DeclinedNone
ResolvedJgreen
ResolvedNone
ResolvedJgreen
ResolvedNone
DeclinedNone

Event Timeline

Jgreen claimed this task.
Jgreen raised the priority of this task from to Low.
Jgreen updated the task description. (Show Details)
Jgreen subscribed.

It is time to promote Wikimedia-Hackathon-2015 activities in the program (training sessions and meetings) and main wiki page (hacking projects and other ongoing activities). Follow the instructions, please. If you have questions, about this message, ask here.

Did anyone work on this task during Wikimedia-Hackathon-2015? If so, please post an update with the results. Otherwise remove the label.

Looked into why ganglia has stopped reporting for most of the fundraising cluster, determined pfw config is likely blocking multicast, opened T100150.

Spent some time researching monitoring options and spoke to Filippo about his research toward a strategy for production, and specifically re. promethius which sounds promising.

Looked at possibilities for pulling data from ganglia aggregators into promethius. Ganglia's aggregator listens on a tcp socket and responds with an XML report of metric data for all hosts in the aggregated cluster, studied that report format, wrote a simple python XML parser, thinking toward reformatting that data to a format promethius can use.

We are trying to help all open tasks listed under "Work continues after Lyon" at the Wikimedia Hackathon 2015 workboard finding their best way forward. * If you are participating in Wikimania, consider adding the #Wikimania-Hackathon-2015 project to get this task in that loop, which is about to start. * If you think this project could welcome help from a dedicated Google Summer of Code or Outreachy intern, or from an Individual Engagement Grant, add the Possible-Tech-Projects project. * If you would like to receive some other type of support (organizing a Tech Talk, establishing contacts with existing developer teams in Wikimedia or elsewhere, travel sponsorship for a related activity... you name it), please create a subtask explaining your request and associate it with #Engineering-Community (or you can start by commenting here if you prefer). * Keeping the description, priority and assigned fields up to date always helps. :) For some context about this message, see T101151: Evaluate which projects showcased at the Wikimedia Hackathon should be supported further. It is the last communication related to Wikimedia-Hackathon-2015 that we will post here.

Please confirm and promote this activity by assigning it to its owner, listing it or scheduling it at the Hackathon wiki page and by placing it in the right column at #Wikimania-Hackathon-2015. Thank you!

@Jgreen, are you planning to work on this task at Wikimania?

No, unfortunately I won't be attending this year.

Jgreen closed subtask Restricted Task as Resolved.Jul 10 2015, 5:49 PM
Jgreen added subtasks: Restricted Task, Restricted Task.Jul 16 2015, 5:19 PM
Jgreen closed subtask Restricted Task as Resolved.Sep 8 2016, 2:21 PM
Jgreen added a subtask: Restricted Task.
Jgreen removed a subtask: Restricted Task.Sep 8 2016, 2:24 PM
Jgreen closed subtask Restricted Task as Declined.May 15 2017, 7:33 PM
Jgreen renamed this task from overhaul fundraising cluster monitoring to [Epic] overhaul fundraising cluster monitoring.Jun 21 2018, 6:51 PM
Jgreen closed subtask Restricted Task as Resolved.Oct 22 2018, 7:22 PM
Jgreen closed subtask Restricted Task as Resolved.May 23 2019, 8:39 PM
Aklapper added a subscriber: cwdent.

(Resetting assignee as @cwdent has left WMF)

Jgreen claimed this task.
Jgreen removed a subtask: Restricted Task.
Jgreen moved this task from Backlog to Done on the fundraising-tech-ops board.