Page MenuHomePhabricator

[Epic] overhaul fundraising cluster monitoring
Open, LowPublic

Description

fundraising hosts use nsca + icinga, and ganglia for monitoring and reporting
nsca seems to be poorly maintained, currently broken for Trusty hosts
as an org we're talking about moving away from ganglia
this task is to fix and update monitoring to be reliable and better supported going forward

random notes:

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedJgreen
OpenNone
Resolved cwdent
Resolved cwdent
DeclinedNone
Resolved cwdent
Resolvedayounsi
ResolvedJgreen
ResolvedSpikeJgreen
ResolvedJgreen
DeclinedNone
Resolved cwdent
Resolved cwdent
StalledNone
DeclinedNone
Resolved cwdent
DeclinedNone
OpenNone
DeclinedNone
ResolvedJgreen
OpenNone
DeclinedNone
ResolvedJgreen
ResolvedNone
OpenDwisehaupt
ResolvedJgreen
OpenNone
OpenNone

Event Timeline

Jgreen created this task.Mar 4 2015, 3:09 PM
Jgreen claimed this task.
Jgreen raised the priority of this task from to Low.
Jgreen updated the task description. (Show Details)
Jgreen added a subscriber: Jgreen.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2015, 3:09 PM
Jgreen added a subtask: Restricted Task.Mar 4 2015, 3:13 PM
Krenair added a subscriber: Krenair.Mar 4 2015, 5:53 PM
Jgreen set Security to None.
Qgil added a comment.May 18 2015, 11:12 AM

It is time to promote Wikimedia-Hackathon-2015 activities in the program (training sessions and meetings) and main wiki page (hacking projects and other ongoing activities). Follow the instructions, please. If you have questions, about this message, ask here.

Jgreen added a subtask: Restricted Task.May 24 2015, 10:34 AM
Qgil added a comment.Jun 2 2015, 8:42 AM

Did anyone work on this task during Wikimedia-Hackathon-2015? If so, please post an update with the results. Otherwise remove the label.

Looked into why ganglia has stopped reporting for most of the fundraising cluster, determined pfw config is likely blocking multicast, opened T100150.

Spent some time researching monitoring options and spoke to Filippo about his research toward a strategy for production, and specifically re. promethius which sounds promising.

Looked at possibilities for pulling data from ganglia aggregators into promethius. Ganglia's aggregator listens on a tcp socket and responds with an XML report of metric data for all hosts in the aggregated cluster, studied that report format, wrote a simple python XML parser, thinking toward reformatting that data to a format promethius can use.

Qgil added a comment.Jun 4 2015, 10:43 AM

We are trying to help all open tasks listed under "Work continues after Lyon" at the Wikimedia Hackathon 2015 workboard finding their best way forward. * If you are participating in Wikimania, consider adding the #Wikimania-Hackathon-2015 project to get this task in that loop, which is about to start. * If you think this project could welcome help from a dedicated Google Summer of Code or Outreachy intern, or from an Individual Engagement Grant, add the Possible-Tech-Projects project. * If you would like to receive some other type of support (organizing a Tech Talk, establishing contacts with existing developer teams in Wikimedia or elsewhere, travel sponsorship for a related activity... you name it), please create a subtask explaining your request and associate it with Developer-Advocacy (or you can start by commenting here if you prefer). * Keeping the description, priority and assigned fields up to date always helps. :) For some context about this message, see T101151: Evaluate which projects showcased at the Wikimedia Hackathon should be supported further. It is the last communication related to Wikimedia-Hackathon-2015 that we will post here.

Qgil added a comment.Jul 3 2015, 10:25 AM

Please confirm and promote this activity by assigning it to its owner, listing it or scheduling it at the Hackathon wiki page and by placing it in the right column at #Wikimania-Hackathon-2015. Thank you!

Qgil added a comment.Jul 7 2015, 1:44 PM

@Jgreen, are you planning to work on this task at Wikimania?

Jgreen added a comment.Jul 7 2015, 1:47 PM

No, unfortunately I won't be attending this year.

Jgreen closed subtask Restricted Task as Resolved.Jul 10 2015, 5:49 PM
Jgreen added subtasks: Restricted Task, Restricted Task.Jul 16 2015, 5:19 PM
Jgreen closed subtask Restricted Task as Resolved.Sep 8 2016, 2:21 PM
Jgreen added a subtask: Restricted Task.
Jgreen removed a subtask: Restricted Task.Sep 8 2016, 2:24 PM
Qgil removed a subscriber: Qgil.Sep 9 2016, 11:50 AM
Jgreen closed subtask Restricted Task as Declined.May 15 2017, 7:33 PM
Jgreen removed Jgreen as the assignee of this task.Jul 11 2017, 8:53 PM
Jgreen changed the status of subtask T176924: Create alerts for rsyslog rate limiting from Open to Stalled.May 24 2018, 3:39 PM
Jgreen renamed this task from overhaul fundraising cluster monitoring to [Epic] overhaul fundraising cluster monitoring.Jun 21 2018, 6:51 PM
Jgreen closed subtask Restricted Task as Resolved.Oct 22 2018, 7:22 PM
Jgreen closed subtask Restricted Task as Resolved.May 23 2019, 8:39 PM
Aklapper removed cwdent as the assignee of this task.Jul 1 2019, 7:18 AM
Aklapper added a subscriber: cwdent.

(Resetting assignee as @cwdent has left WMF)

Jgreen updated the task description. (Show Details)Feb 11 2020, 7:20 PM
Dwisehaupt changed the status of subtask T234918: set up prometheus server snapshots for backups from Duplicate to Resolved.Mar 17 2020, 3:58 PM
fgiunchedi moved this task from Inbox to Radar on the observability board.Mon, Jul 20, 1:14 PM