Page MenuHomePhabricator

set up a looking glass for WMF ASes
Closed, DeclinedPublic

Description

in order to help analysing network issues, such as in T105984 and T62283, we need a looking glass in which we can see traceroutes and routes as seen from the WMF NOC.

It is general a best practise as an owner of an AS to provide a looking glass, so peers can resolve issues easier.

Event Timeline

80686 raised the priority of this task from to Needs Triage.
80686 updated the task description. (Show Details)
80686 added a project: netops.
80686 added subscribers: 80686, brion, Nemo_bis and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
80686 set Security to None.
faidon triaged this task as Lowest priority.Jul 17 2015, 7:15 PM

That would be nice indeed, but needs a bit of work to do it properly:

  • Opening up a web interface to our routers directly like many others do is a bad idea IMO, as it opens up an attack vector.
  • We could set up a BGP instance in a separated server (e.g. using bird-lg) and peer with our routers. *However*, due to the fact that we span over two ASNs one of which is a confederation withs 2-3 subASes, we'd need to set up multiple instances which I'm guessing it complicates things. It needs further research to see how easy it would be (help welcome :))

In the meantime, unlike most ASNs, we are generally responsive, reachable over IRC (both on our channels and other well-known networker channels) and we have a public bug tracker ;)

After looking at the various looking glass, bird-lg seems indeed the best option (doesn't need ssh access to the routers, open-source, user-friendly, supports multiple regions).
That's why I setup a POC at https://af-lg.wmflabs.org/ This only includes esams.
It consists of 4 main parts:

  • cr2-esams peering with bird (with next hop self)
  • A bird daemon receiving the full BGP view
  • lgproxy.py that talks to a single bird daemon, each "region" needs its own lgproxy.py. It's also the script than runs the traceroutes.
  • lg.py the web interface that relays queries to lgproxy.py

A few current caveats:

  • As bird/bird-lg is running in eqiad, the traceroutes originate from eqiad. For an optimal deployment, we would need to have a bird/lgproxy.py instance in each region. Is there a host we could use for that?
  • The routes add the mention "via 10.68.16.1 on eth0"as it's the route bird uses to reach the next hop "BGP.next_hop: 91.198.174.244" as it adds confusion, we could remove it from the code, like what other do: http://lg.as5580.net/prefix_detail/all/ipv4?q=172.217.3.163

Following steps if we want to move forward would be to write init scripts for lg.py and lgproxy.py, puppetize bird/birdlg/apache, and have all routers peer with it.

Change 390330 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] [WIP] Bird-lg

https://gerrit.wikimedia.org/r/390330

Gerrit change 390330 is up for reviews. @faidon ? or anyone else?
It will then need to be deployed on netmon1002/2001

Note that we peer with RIPE RIS collectors in out POPs, so people can use https://stat.ripe.net/widget/looking-glass as a looking glass.

Change 504233 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Add looking glass CNAMEs

https://gerrit.wikimedia.org/r/504233

Change 390330 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Bird-lg

https://gerrit.wikimedia.org/r/390330

Change 504248 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] acme_chief: Issue birdlg certificate

https://gerrit.wikimedia.org/r/504248

Change 390330 abandoned by Ayounsi:
Bird-lg

Reason:
Not worth pursuing.

https://gerrit.wikimedia.org/r/390330

Change 504248 abandoned by Ayounsi:
acme_chief: Issue birdlg certificate

Reason:
Not worth pursuing.

https://gerrit.wikimedia.org/r/504248

Change 504233 abandoned by Ayounsi:
Add looking glass CNAMEs

Reason:
Not worth pursuing.

https://gerrit.wikimedia.org/r/504233

The amount of work required to properly deploy a (muti-dc) looking glass is, so far, not worth the benefits of having and maintaining one.

  • Peering with the RIPE RIS provides a looking glass for 3 of our 5 DCs
  • As said by Faidon, we're very quick to reply to NOC email and IRC messages (usually less than 24h)
  • No routing issue would have been resolved faster with a looking glass (as far as I know)

Thanks for considering this and for sharing the analysis.

Mentioned in SAL (#wikimedia-operations) [2019-11-12T16:28:43Z] <XioNoX> setup bgp session from cr2-codfw to multihop RIS collector - T106056