Page MenuHomePhabricator

Investigate Capirca
Open, MediumPublic

Description

As we only operates 1 network vendor, we currently have 2 types of network ACLs:

  1. SRX security policies, zone to zone stateful rules on the payment firewall as well as the management routers
  2. Firewall filters, stateless rules applied to L3 interfaces on all routers (core, payment, management)
  • Both types have a distinct syntax
  • Security policies are managed either by a homemade tool written by the fundraising-tech team, or manually for the management routers
  • Firewall filters are centrally managed (in Homer) but manually written
  • Any SRE (and fr-tech) can (and does) write new ACLs, push to the devices are done either by the SREs (after netops review) or by Netops directly

This raises the following problems:

  • Multiple tools and processes, higher learning curve and turnaround time on change requests:
  • Syntax typos
    • Not always caught until being pushed to prod (and rejected by the device linter)
  • Entities typos
    • Wrong IP or wrong prefix length
  • Stale rules
  • Lack of consistency
    • IPs defined in prefix-lists or directly in the rules
    • Discrepancies between v4 and v6 rules

The natural next step now that Homer automates most network configuration, is to provide a standardized and automated way of managing ACLs, leveraging Netbox, the source of truth.
Looking at the landscape of existing open-source tools, one stands out.

Capirca is an actively maintained open source tool made by Google.

It consists of different moving parts:

  • Definition files
    • Services: port/protocol, port-ranges/protocol, nestable sets of services
    • Network: IP, prefixes, nestable sets of IP/prefixes
  • Policy files
    • ACL rules using the previously defined services and networks in a custom format
  • Capirca library
    • Takes the above files as input and generates ACLs in a format compatible with a given platform (Junos, SRX, iptables, etc)

An example (high level) usage could be:

  1. Define the services (ports/protocols) once for the whole infrastructure
  2. Populate the network definition file partially from Netbox
    • At least devices IPs, create groups per hosts prefixes (all bast, all cp, etc)
    • Potentially network prefixes as well
  3. Manually define network for the remaining usecases
  4. Convert existing policies to their Capirca format
    • Cleaning them up in the process
    • Most likely manually
  5. In Homer, for each device (or device roles) define which policies to apply
  6. When running Homer, ACLs will be added to the pushed config and diffs

This would solve all the problems listed above, either directly or thanks to the audit required to convert them.

  • Multiple tools and processes
    • Consolidated to a single tool, potentially 2 processes if used by frack and required for PCI compliance
  • Syntax typos
    • Caught by Capirca locally during the execution
  • Entities typos
    • Less risk as defined only once, and potentially coming from Netbox
  • Stale rules
    • Will be removed automatically when a host/IP is removed
  • Lack of consistency
    • A “network” set can consist of v4 and v6 IPs helping with consistency, defined from Netbox would ensure consistency

In addition it would bring the following advantages (not all currently needed):

  • Multi-platform support, might make things easier if if move away from Junos
  • Shadow check (a policy rule making the following one useless)
  • Optimization (eg. 2 contiguous /32s are merged in a /31)
  • Specific flow testing (“can IPx reach IPy through that ACL?”)

On the other hand, there are some possible limitations:

  • Network prefixes don’t have names in Netbox, so assigning a relevant variable name might be challenging
    • Existing fields could be leveraged (roles, sites, status, description, etc)
  • Fetching all hosts IPs from Netbox might be too slow for being an option
    • Might be better to “pre-compile” the list on Netbox hosts using a plugin
  • Policies use a custom syntax, something like YAML would have been better
  • Capirca’s code is complex (at least to me, even though I sent PRs years ago)

Other concerns

  • It might not be worth the efforts as our ACLs don’t change often
  • There are rumors of Capirca’ successor to be open-sourced by Google in the future

The scope of this task is to discuss network ACLs management in general, and evaluate Capirca for that role. Possibly other tools if any.

Event Timeline

ayounsi triaged this task as Medium priority.Feb 4 2021, 10:59 AM
ayounsi created this task.

Change 663535 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Capirca POC

https://gerrit.wikimedia.org/r/663535

Change 663536 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/homer@master] Capirca POC

https://gerrit.wikimedia.org/r/663536

Limitations identified:

Some ACLs currently have Jinja code in them, which is not possible through Capirca.
The easiest cases have (or can) be mitigated by either:

  • removing the feature flag
  • moving the logic to Homer/Capirca. For example the ACLs needed only need eqiad/codfw can be specified with a local override:
capirca:
  - cr   #<- normal core router ACLs
  - cr-cloud   # <- core site addon

But one is currently more tricky:

{% if ping_offload_redirect | d(false)  %}
term offload-ping4 {
    from {
        destination-address {
            {{ ping_offload_vip }};
        }
        protocol icmp;
        icmp-type echo-request;
    }
    then {
        next-ip {{ ping_offload_redirect }}/32;
    }
}
{% endif %}

As it embed site specific IPs.
One possible way to work around it is to keep the term in a dedicated filter (in the current firewall.conf file) and import it in a generic way:

term offload-ping4  {
    filter offload-ping4 ;
}

Side advantage is that it can also be used in filter transport-in4.

2nd limitation is that we use per family prefix-list. Eg. wikimedia4 and wikimedia6 while Capirca is not able to use the relevant one depending on IP version in use.
One workaround (used in this POC), is to not use prefix-list in those cases, but have Capirca generate destination-address based on the definition files.

On the positive side, the current state of the POC doesn't show significant blockers. ACLs can progressively be ported to Capirca (doesn't need to be all or nothing). It also supports (untested) ACLs from homer-private

Next to test is how to generate the network definitions from Netbox, most likely using a plugin, and fetch it from Netbox at run time. https://gerrit.wikimedia.org/r/666876

If deemed viable:

  • put the Capirca in it's own Homer class and improve error handling
  • Test more extensively homer-private ACLs (and maybe network definitions)
  • figure out how to package Capirca (pypi version is old)
  • streamline directory structure
  • transition more ACLs
  • write doc/train people

Note that most of the above is out of scope for the POC.

Change 666876 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/netbox-extras@master] Add Capirca definitions exporter

https://gerrit.wikimedia.org/r/666876

Mentioned in SAL (#wikimedia-operations) [2021-03-22T07:51:24Z] <elukey> stop/start mariadb instances on dbstore1004 to reduce buffer pool memory settings - T273865

Change 666876 merged by Ayounsi:
[operations/software/netbox-extras@master] Add Capirca definitions exporter

https://gerrit.wikimedia.org/r/666876

Change 663536 merged by jenkins-bot:

[operations/software/homer@master] Add Capirca support to Homer

https://gerrit.wikimedia.org/r/663536