Page MenuHomePhabricator

Retire Torrus
Closed, ResolvedPublic

Description

Torrus was down for weeks and none of us noticed.
Torrus fails often enough that the fix is documented on Wikitech: https://wikitech.wikimedia.org/wiki/Torrus#Deadlock_problem
We have no monitoring of Torrus (T87817)

Torrus is configured to graph stats from:

  • switches
  • routers
  • PDUs
  • varnish caches
  • netapps

Most of the above isn't even working.

LibreNMS graphs the same stats for:

  • switches
  • routers
  • PDUs

Ganglia graphs similar stats for:

  • Varnish

Therefore the only thing that only Torrus provides is stats on our two Netapps, nas1001-{a,b}.eqiad.wmnet.

Once the blocking task to graph Netapp stats with LibreNMS is completed, let's get rid of Torrus.

Details

Related Gerrit Patches:
operations/dns : masterRemove torrus.wikimedia.org
operations/puppet : productiontorrus: remove role from netmon1001

Event Timeline

Gage created this task.Jan 29 2015, 2:27 AM
Gage raised the priority of this task from to Needs Triage.
Gage updated the task description. (Show Details)
Gage added a project: acl*sre-team.
Gage added a subscriber: Gage.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 29 2015, 2:27 AM
Andrew triaged this task as Medium priority.Feb 8 2015, 9:33 PM
Andrew set Security to None.

@Gage can torrus be killed now?

Not sure, as @faidon declined my ticket to monitor the Netapp stats with LibreNMS. So those stats are still only being collected by Torrus.

Faidon, what would you like to do?

faidon lowered the priority of this task from Medium to Low.Mar 23 2015, 4:57 AM

I don't use Torrus so I wouldn't mind it if it was gone. I know @mark uses it/likes it, though, so he will probably be against killing it entirely.

Gage changed the task status from Open to Stalled.May 6 2015, 6:00 PM
Dzahn added a subscriber: Dzahn.Jun 17 2015, 9:20 PM

torrus has intermittent "500 Internal Server Error" but mostly works:

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=netmon1001

but probably not that interesting if we want to retire it anyways?

RobH added a subscriber: RobH.Aug 13 2015, 9:53 PM

As others point out, its always broken. I used to use it for power planning, but the same data can be pulled from librenms.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 13 2015, 9:53 PM
RobH assigned this task to mark.Aug 13 2015, 9:54 PM

I've assigned this to @mark for his confirmation we can kill the torrus service and remove it entirely.

RobH added a comment.Aug 13 2015, 9:57 PM

Also when it is broken, it doesn't retain data for the outage period, so any historical graphs have large gaps.

Mark will need to confirm for his traffic stats though.

RobH raised the priority of this task from Low to Medium.Aug 13 2015, 9:58 PM
RobH added a project: Technical-Debt.

it was broken again today and i fixed it with the usual steps from wikitech docs

+1 on killing it

mark added a comment.Aug 14 2015, 9:13 AM

The main thing we still need it for is aggregated graphs, particularly for aggregated power usage in our data centers, which LibreNMS doesn't really do. When we have an alternative for that, I think we can retire Torrus.

mark removed mark as the assignee of this task.Sep 23 2015, 12:59 PM

Change 245890 had a related patch set uploaded (by Dzahn):
torrus: remove role from netmon1001

https://gerrit.wikimedia.org/r/245890

Dzahn added a comment.Oct 14 2015, 2:06 PM

As @akosiaris pointed out yesterday the aggregated graphs don't seem to work currently. That would mean we don't have this feature anyways and the others have been replaced by librenams.

Change 245890 abandoned by Dzahn:
torrus: remove role from netmon1001

Reason:
we are still keeping it for aggregated graphs per T87840#1538794

https://gerrit.wikimedia.org/r/245890

Dzahn added a subscriber: Danny_B.Jan 23 2016, 1:44 AM

@Danny_B This is not actually ready to be retired yet.

Danny_B removed a subscriber: Danny_B.May 28 2016, 3:10 PM
ayounsi added a subscriber: ayounsi.May 2 2017, 7:46 AM

Note that currently:

$ curl -v https://torrus.wikimedia.org/

has a 301 to "Location: https:///"

Change 351265 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Remove torrus.wikimedia.org

https://gerrit.wikimedia.org/r/351265

Torrus functions for PDU-related aggregated metrics has been replaced by Prometheus in T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring. Left TODO is to restore torrus RRDs from bacula and import them into graphite for historical reference. (graphite because sadly prometheus doesn't support backfill of values)

How are we going to be able to look at the trends of both historical and current data? Also, what are we going to do for data that we keep past Prometheus retention period? The power usage data are useful to track over years, so perhaps we should just keep them in one single store suitable with a long-term retention period (Graphite?)

Grafana supports mixed datasources so in theory we can combine graphite/prometheus in a single graph/panel, I haven't tried it though. WRT retention indeed after the one year retention of the global instance the data would be expired. I think it makes sense to have some or all of the data also in graphite for long-term storage in this case.
In terms of solutions to achieve this we could ask snmp_exporter for data, parse it with prometheus' python client and push the result into graphite.

faidon moved this task from Inbox to In progress on the observability board.
faidon reassigned this task from akosiaris to fgiunchedi.Jul 10 2017, 3:19 PM
faidon added a subscriber: akosiaris.
faidon closed this task as Resolved.Jul 20 2017, 1:28 PM

So @godog mentioned today that we can't actually recover the Torrus data from Bacula, as these were lost forever :(

We're still lacking a good solution for monitoring our power loads, but this is primarily tracked in T148541, so I think we can resolve this task.

Change 351265 abandoned by Alexandros Kosiaris:
Remove torrus.wikimedia.org

Reason:
Already done

https://gerrit.wikimedia.org/r/351265

I 've merged https://gerrit.wikimedia.org/r/351276 today (should have merged it months ago), this kills the module and role finally.