Page MenuHomePhabricator

Show IPs matching a list of IP subnets in Webrequest data
Closed, ResolvedPublic

Description

Context on RPKI: https://en.wikipedia.org/wiki/Resource_Public_Key_Infrastructure

Working on RPKI Validation, I'd like to be able to know how much hits/webrequests we get from v4 and v6 IPs within set IP subnets (https://nusenu.github.io/RPKI-Observatory/unreachable-networks.html).

For context, the end goal of RPKI validation would be to enforce invalid origins, which mean subnets would be unable to reach our network.
Having visibility how how much requests this means would be useful to make a decision down the road.

Related Objects

Event Timeline

ayounsi created this task.Apr 10 2019, 5:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2019, 5:46 PM

@ayounsi: Any tags that would display this task on workboards are welcome. :) netops? Analytics?

elukey removed the point value for this task.

I had a chat with Arzhel on IRC. The scope of the project would be to check, among webrequest data, if we have traffic coming from IPs registered in subnets like:

https://nusenu.github.io/RPKI-Observatory/unreachable_prefixes-v4.html

The tricky thing in my opinion is to filter webrequest's IP based on which ones are matching a certain network subnet (task that it is not as straightforward as string matching, but involves some logic with bit operators). We could either write a new UDF or use an existing one, and run a simple Hive query with it against some webrequest text/upload hours.

elukey renamed this task from Show hits matching a list of IP subnets to Show IPs matching a list of IP subnets in Webrequest data.Apr 11 2019, 10:54 AM
fdans triaged this task as Normal priority.Apr 11 2019, 4:24 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans added a subscriber: JAllemandou.
elukey updated the task description. (Show Details)May 28 2019, 7:47 AM
elukey added a comment.EditedMay 28 2019, 3:00 PM

This is a quick hacking session happened today (credis to Joseph for the code):

import ipaddress
from pyspark.sql.types import ArrayType, StringType

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

# List of RPKI prefixes with INVALID status
rpki_invalid_prefixes = ['192.0.0.0/16']

# Return an IPv4Network or IPv6Network object depending on the IP address passed as argument.
network_infos = [ipaddress.ip_network(prefix) for prefix in rpki_invalid_prefixes]

# The network address and the prefix length together uniquely define a network.
networks = [int(n.network_address) for n in network_infos]

masks = [int(n.netmask) for n in network_infos]

# List of networks, prefixes and masks shared among workers
networks_and_masks = list(zip(networks, masks, rpki_invalid_prefixes))

# Define a UDF function for Spark, that checks if an IP address
# belongs to the list of RPKI invalid prefixes.
# Returns the list of IPs and related matching RPKI invalid prefixes.
def in_subnets(addr_str):
    try:
        a = int(ipaddress.ip_address(addr_str))
        return [prefix
             for (network, mask, prefix) in networks_and_masks
             if (a & mask) == network
        ]
    except:
        return []

spark.udf.register("in_subnets", in_subnets, ArrayType(StringType()))

spark.sql('''
SELECT
  in_subnets(ip) as prefix, AS, count(*) as hits
FROM (
  SELECT DISTINCT
    ip, isp_data['autonomous_system_number'] as AS
  FROM wmf.webrequest
  WHERE webrequest_source = 'text'
      AND year = 2019
      AND month = 5
      AND day = 27
      AND hour = 0
) t
WHERE size(in_subnets(ip)) > 0
GROUP BY prefix, AS
ORDER BY hits DESC
''').show(100, False)

Executed with:

spark2-submit --master yarn \
  --executor-memory 8G \
  --executor-cores 4 \
  --conf spark.dynamicAllocation.maxExecutors=32 \
  --driver-memory 4G \
  --conf spark.executor.memoryOverhead=2048 \
  --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar name-of-the-above-file.py

The result of the spark script is a list of ip,prefix tuples (IP addresses matching a RPKI invalid prefix). @ayounsi we'll need to review the code together but we are basically ready to start crunching data, the only missing thing is the long list of RPKI invalid prefixes.

the only missing thing is the long list of RPKI invalid prefixes

Very quick and dirty script:

import urllib.request
import re
import pprint

urls = ['https://raw.githubusercontent.com/nusenu/RPKI-Observatory/master/pages/unreachable_autogen/unreachable_prefixes-v4.md',
        'https://raw.githubusercontent.com/nusenu/RPKI-Observatory/master/pages/unreachable_autogen/unreachable_prefixes-v6.md']

prefix_list = []

for url in urls:
    response = urllib.request.urlopen(url)
    for line in response.readlines():
        prefix_search = re.search(r"\|\s*\d+\s\|\s\[(\d.*\/\d+)\].*", line.decode('utf-8'))
        if prefix_search:
            prefix_list.append(prefix_search.group(1))

pprint.pprint(prefix_list)

Outputs:

['115.168.0.0/14',
 '219.70.0.0/15',
 '8.208.0.0/16',
 '219.68.0.0/16',
 '82.125.0.0/17',
 '90.52.0.0/17',
 '90.52.128.0/17',
 '219.69.0.0/17',
 '8.209.64.0/18',
 '122.254.64.0/18',
 '123.252.64.0/18',
 '190.184.128.0/18',
 '123.252.0.0/19',
 '190.96.160.0/19',
 '200.35.0.0/19',
 '112.206.96.0/20',
 '112.206.112.0/20',
 '122.179.192.0/20',
 '122.179.208.0/20',
 '122.179.224.0/20',
 '122.179.240.0/20',
 '122.202.64.0/20',
 '123.236.16.0/20',
 '123.237.144.0/20',
 '123.238.48.0/20',
 '123.238.64.0/20',
 '123.239.80.0/20',
 '123.252.48.0/20',
 '124.125.0.0/20',
 '179.3.48.0/20',
 '179.3.96.0/20',
 '179.3.112.0/20',
 '188.253.16.0/20',
 '190.96.208.0/20',
 '190.211.176.0/20',
 '200.25.224.0/20',
 '200.25.240.0/20',
 '85.255.176.0/21',
 '93.110.64.0/21',
 '109.122.240.0/21',
 '121.59.8.0/21',
 '123.237.184.0/21',
 '123.238.0.0/21',
 '170.33.16.0/21',
 '170.33.24.0/21',
 '170.33.32.0/21',
 '179.3.0.0/21',
 '179.3.8.0/21',
 '179.3.16.0/21',
 '179.3.64.0/21',
 '179.3.72.0/21',
 '179.3.80.0/21',
 '179.3.88.0/21',
 '181.225.80.0/21',
 '190.52.48.0/21',
 '190.112.64.0/21',
 '190.112.72.0/21',
 '190.112.80.0/21',
 '190.112.88.0/21',
 '202.57.96.0/21',
 '202.57.120.0/21',
 '31.155.64.0/22',
 '31.155.104.0/22',
 '31.184.204.0/22',
 '36.0.4.0/22',
 '43.224.124.0/22',
 '43.230.56.0/22',
 '43.242.0.0/22',
 '45.5.68.0/22',
 '45.160.12.0/22',
 '45.162.72.0/22',
 '45.239.120.0/22',
 '45.239.208.0/22',
 '46.143.204.0/22',
 '80.12.124.0/22',
 '86.57.96.0/22',
 '86.57.100.0/22',
 '86.57.104.0/22',
 '86.57.108.0/22',
 '86.57.116.0/22',
 '91.192.96.0/22',
 '91.200.212.0/22',
 '93.110.84.0/22',
 '93.110.88.0/22',
 '95.82.4.0/22',
 '95.82.8.0/22',
 '95.82.12.0/22',
 '95.82.16.0/22',
 '95.82.20.0/22',
 '95.82.24.0/22',
 '95.82.28.0/22',
 '95.82.32.0/22',
 '95.82.36.0/22',
 '95.82.40.0/22',
 '95.82.44.0/22',
 '95.82.48.0/22',
 '95.82.52.0/22',
 '95.82.56.0/22',
 '95.82.60.0/22',
 '103.11.32.0/22',
 '103.27.76.0/22',
 '103.49.84.0/22',
 '103.74.176.0/22',
 '103.76.40.0/22',
 '103.112.116.0/22',
 '103.126.148.0/22',
 '103.131.172.0/22',
 '121.59.4.0/22',
 '123.237.32.0/22',
 '123.237.40.0/22',
 '123.237.68.0/22',
 '123.238.152.0/22',
 '123.238.236.0/22',
 '125.5.28.0/22',
 '146.196.88.0/22',
 '150.107.48.0/22',
 '170.247.160.0/22',
 '179.61.108.0/22',
 '185.4.116.0/22',
 '185.31.44.0/22',
 '185.90.224.0/22',
 '185.101.188.0/22',
 '185.105.120.0/22',
 '185.107.248.0/22',
 '185.141.36.0/22',
 '185.170.236.0/22',
 '185.199.36.0/22',
 '185.225.180.0/22',
 '186.144.32.0/22',
 '188.3.80.0/22',
 '190.111.120.0/22',
 '192.231.116.0/22',
 '193.20.64.0/22',
 '202.57.116.0/22',
 '208.79.44.0/22',
 '212.33.192.0/22',
 '212.33.204.0/22',
 '220.224.244.0/22',
 '8.209.10.0/23',
 '27.54.116.0/23',
 '27.54.118.0/23',
 '31.155.68.0/23',
 '31.155.88.0/23',
 '31.155.92.0/23',
 '31.155.96.0/23',
 '31.155.100.0/23',
 '31.155.102.0/23',
 '31.155.108.0/23',
 '31.155.136.0/23',
 '31.155.138.0/23',
 '31.155.232.0/23',
 '31.155.234.0/23',
 '31.155.236.0/23',
 '31.155.238.0/23',
 '31.155.248.0/23',
 '31.155.250.0/23',
 '31.155.252.0/23',
 '31.155.254.0/23',
 '31.206.252.0/23',
 '31.206.254.0/23',
 '49.156.0.0/23',
 '49.156.2.0/23',
 '82.115.26.0/23',
 '87.247.168.0/23',
 '87.247.170.0/23',
 '89.42.148.0/23',
 '89.44.44.0/23',
 '91.108.194.0/23',
 '91.108.196.0/23',
 '91.195.34.0/23',
 '93.110.92.0/23',
 '93.110.94.0/23',
 '93.113.132.0/23',
 '93.119.16.0/23',
 '95.65.232.0/23',
 '103.113.202.0/23',
 '103.228.202.0/23',
 '115.146.128.0/23',
 '120.29.64.0/23',
 '120.29.108.0/23',
 '123.236.4.0/23',
 '123.237.36.0/23',
 '123.237.38.0/23',
 '123.238.160.0/23',
 '124.125.118.0/23',
 '125.5.24.0/23',
 '125.5.26.0/23',
 '138.121.240.0/23',
 '149.255.230.0/23',
 '150.116.54.0/23',
 '154.127.54.0/23',
 '176.32.48.0/23',
 '178.157.0.0/23',
 '181.13.56.0/23',
 '185.145.88.0/23',
 '185.198.16.0/23',
 '185.198.18.0/23',
 '188.3.8.0/23',
 '188.3.10.0/23',
 '188.3.12.0/23',
 '188.3.14.0/23',
 '188.3.124.0/23',
 '188.3.126.0/23',
 '188.3.192.0/23',
 '188.3.194.0/23',
 '188.3.196.0/23',
 '188.3.198.0/23',
 '188.253.124.0/23',
 '190.156.206.0/23',
 '193.33.32.0/23',
 '193.169.136.0/23',
 '193.239.140.0/23',
 '202.57.108.0/23',
 '202.57.114.0/23',
 '209.237.170.0/23',
 '212.33.202.0/23',
 '213.186.144.0/23',
 '213.186.146.0/23',
 '213.186.148.0/23',
 '213.186.150.0/23',
 '213.186.152.0/23',
 '213.186.154.0/23',
 '213.186.156.0/23',
 '213.186.158.0/23',
 '220.226.236.0/23',
 '220.226.252.0/23',
 '223.25.8.0/23',
 '223.25.10.0/23',
 '223.25.16.0/23',
 '223.25.18.0/23',
 '223.25.24.0/23',
 '223.25.26.0/23',
 '5.102.134.0/24',
 '5.159.48.0/24',
 '5.159.49.0/24',
 '5.159.50.0/24',
 '5.159.51.0/24',
 '5.159.52.0/24',
 '5.159.53.0/24',
 '5.159.54.0/24',
 '5.159.55.0/24',
 '5.253.244.0/24',
 '5.253.245.0/24',
 '5.253.246.0/24',
 '5.253.247.0/24',
 '5.254.49.0/24',
 '23.139.0.0/24',
 '31.132.36.0/24',
 '31.155.70.0/24',
 '31.155.71.0/24',
 '31.155.90.0/24',
 '31.155.91.0/24',
 '31.155.94.0/24',
 '31.155.95.0/24',
 '31.155.98.0/24',
 '31.155.99.0/24',
 '31.155.110.0/24',
 '31.155.111.0/24',
 '31.155.120.0/24',
 '31.206.241.0/24',
 '31.206.243.0/24',
 '31.223.76.0/24',
 '37.77.173.0/24',
 '43.245.92.0/24',
 '43.245.93.0/24',
 '43.245.94.0/24',
 '43.245.95.0/24',
 '45.80.172.0/24',
 '45.80.173.0/24',
 '45.80.174.0/24',
 '45.80.175.0/24',
 '45.118.70.0/24',
 '45.161.44.0/24',
 '45.161.45.0/24',
 '45.161.46.0/24',
 '45.161.47.0/24',
 '45.224.202.0/24',
 '45.225.152.0/24',
 '45.225.153.0/24',
 '45.227.254.0/24',
 '45.229.168.0/24',
 '45.229.170.0/24',
 '45.229.171.0/24',
 '45.229.248.0/24',
 '45.229.249.0/24',
 '45.229.250.0/24',
 '45.229.251.0/24',
 '46.2.78.0/24',
 '46.2.79.0/24',
 '46.2.124.0/24',
 '46.2.125.0/24',
 '46.28.242.0/24',
 '46.33.52.0/24',
 '46.102.106.0/24',
 '46.143.208.0/24',
 '46.143.209.0/24',
 '46.143.211.0/24',
 '46.143.244.0/24',
 '46.143.245.0/24',
 '49.128.160.0/24',
 '49.128.161.0/24',
 '49.128.162.0/24',
 '49.128.163.0/24',
 '49.128.164.0/24',
 '49.128.165.0/24',
 '49.128.166.0/24',
 '49.128.167.0/24',
 '49.128.168.0/24',
 '49.128.169.0/24',
 '49.128.170.0/24',
 '49.128.171.0/24',
 '49.128.172.0/24',
 '49.128.173.0/24',
 '49.128.174.0/24',
 '49.128.175.0/24',
 '62.100.211.0/24',
 '66.97.32.0/24',
 '66.97.33.0/24',
 '66.97.34.0/24',
 '66.97.35.0/24',
 '66.97.36.0/24',
 '66.97.37.0/24',
 '66.97.38.0/24',
 '66.97.39.0/24',
 '66.97.40.0/24',
 '68.64.234.0/24',
 '77.240.84.0/24',
 '77.240.85.0/24',
 '77.246.64.0/24',
 '77.246.65.0/24',
 '77.246.66.0/24',
 '77.246.67.0/24',
 '77.246.69.0/24',
 '79.170.23.0/24',
 '80.77.184.0/24',
 '80.77.190.0/24',
 '82.97.240.0/24',
 '82.97.241.0/24',
 '82.97.242.0/24',
 '82.97.243.0/24',
 '82.97.244.0/24',
 '82.97.245.0/24',
 '82.97.246.0/24',
 '82.97.247.0/24',
 '82.97.248.0/24',
 '82.97.249.0/24',
 '82.97.250.0/24',
 '82.97.251.0/24',
 '82.97.252.0/24',
 '82.97.253.0/24',
 '82.97.254.0/24',
 '82.97.255.0/24',
 '82.115.7.0/24',
 '82.115.17.0/24',
 '82.115.18.0/24',
 '82.115.19.0/24',
 '82.115.24.0/24',
 '82.115.25.0/24',
 '83.221.20.0/24',
 '83.221.21.0/24',
 '83.221.22.0/24',
 '85.209.76.0/24',
 '85.209.77.0/24',
 '85.209.78.0/24',
 '85.209.79.0/24',
 '86.62.6.0/24',
 '89.41.244.0/24',
 '89.41.245.0/24',
 '89.41.246.0/24',
 '89.41.247.0/24',
 '89.43.92.0/24',
 '89.43.93.0/24',
 '89.43.94.0/24',
 '89.43.95.0/24',
 '89.43.96.0/24',
 '89.43.97.0/24',
 '89.43.98.0/24',
 '89.43.99.0/24',
 '89.43.100.0/24',
 '89.43.101.0/24',
 '89.43.102.0/24',
 '89.43.103.0/24',
 '89.44.46.0/24',
 '89.46.216.0/24',
 '89.46.217.0/24',
 '89.46.218.0/24',
 '89.46.219.0/24',
 '91.108.135.0/24',
 '91.108.240.0/24',
 '91.198.46.0/24',
 '91.209.127.0/24',
 '91.209.255.0/24',
 '91.212.52.0/24',
 '91.213.86.0/24',
 '91.240.80.0/24',
 '91.240.82.0/24',
 '91.240.83.0/24',
 '93.110.76.0/24',
 '93.110.77.0/24',
 '93.110.78.0/24',
 '93.110.79.0/24',
 '93.110.96.0/24',
 '93.110.97.0/24',
 '93.110.98.0/24',
 '93.110.99.0/24',
 '93.110.100.0/24',
 '93.110.101.0/24',
 '93.110.102.0/24',
 '93.110.103.0/24',
 '93.110.104.0/24',
 '93.110.105.0/24',
 '93.110.106.0/24',
 '93.110.107.0/24',
 '93.110.108.0/24',
 '93.110.109.0/24',
 '93.110.110.0/24',
 '93.110.111.0/24',
 '93.110.112.0/24',
 '93.110.113.0/24',
 '93.110.114.0/24',
 '93.110.115.0/24',
 '93.110.116.0/24',
 '93.110.117.0/24',
 '93.110.118.0/24',
 '93.110.119.0/24',
 '93.110.120.0/24',
 '93.110.121.0/24',
 '93.110.122.0/24',
 '93.110.123.0/24',
 '93.110.124.0/24',
 '93.110.125.0/24',
 '93.110.126.0/24',
 '93.110.127.0/24',
 '93.152.148.0/24',
 '93.152.149.0/24',
 '93.152.150.0/24',
 '93.152.151.0/24',
 '93.175.147.0/24',
 '93.187.224.0/24',
 '93.187.226.0/24',
 '94.176.36.0/24',
 '94.176.37.0/24',
 '94.176.38.0/24',
 '94.176.39.0/24',
 '94.177.76.0/24',
 '94.177.77.0/24',
 '94.177.78.0/24',
 '94.177.79.0/24',
 '95.65.234.0/24',
 '95.65.235.0/24',
 '95.65.244.0/24',
 '95.65.245.0/24',
 '95.65.246.0/24',
 '95.65.247.0/24',
 '103.15.120.0/24',
 '103.15.121.0/24',
 '103.15.122.0/24',
 '103.15.123.0/24',
 '103.27.136.0/24',
 '103.27.137.0/24',
 '103.29.96.0/24',
 '103.29.97.0/24',
 '103.29.98.0/24',
 '103.29.99.0/24',
 '103.36.16.0/24',
 '103.36.17.0/24',
 '103.36.18.0/24',
 '103.36.19.0/24',
 '103.54.54.0/24',
 '103.69.234.0/24',
 '103.69.235.0/24',
 '103.71.154.0/24',
 '103.72.78.0/24',
 '103.77.188.0/24',
 '103.77.189.0/24',
 '103.77.190.0/24',
 '103.77.191.0/24',
 '103.81.204.0/24',
 '103.102.222.0/24',
 '103.102.223.0/24',
 '103.104.194.0/24',
 '103.105.88.0/24',
 '103.105.89.0/24',
 '103.105.90.0/24',
 '103.105.91.0/24',
 '103.106.152.0/24',
 '103.107.239.0/24',
 '103.113.92.0/24',
 '103.113.93.0/24',
 '103.113.94.0/24',
 '103.113.95.0/24',
 '103.116.194.0/24',
 '103.121.34.0/24',
 '103.121.48.0/24',
 '103.134.30.0/24',
 '103.135.16.0/24',
 '103.138.145.0/24',
 '103.196.22.0/24',
 '103.198.0.0/24',
 '103.229.132.0/24',
 '103.229.134.0/24',
 '103.229.135.0/24',
 '103.232.206.0/24',
 '103.252.223.0/24',
 '109.197.38.0/24',
 '115.146.132.0/24',
 '115.146.133.0/24',
 '115.146.135.0/24',
 '115.146.146.0/24',
 '118.179.13.0/24',
 '120.29.66.0/24',
 '120.29.67.0/24',
 '120.29.92.0/24',
 '120.29.93.0/24',
 '120.29.94.0/24',
 '120.29.95.0/24',
 '120.29.96.0/24',
 '120.29.97.0/24',
 '120.29.98.0/24',
 '120.29.99.0/24',
 '120.29.110.0/24',
 '120.29.111.0/24',
 '121.59.17.0/24',
 '123.236.0.0/24',
 '123.236.1.0/24',
 '123.236.3.0/24',
 '123.236.6.0/24',
 '123.236.9.0/24',
 '123.236.11.0/24',
 '123.236.13.0/24',
 '123.236.14.0/24',
 '123.236.175.0/24',
 '123.236.241.0/24',
 '123.236.242.0/24',
 '123.236.243.0/24',
 '123.236.244.0/24',
 '123.236.245.0/24',
 '123.236.246.0/24',
 '123.237.0.0/24',
 '123.237.2.0/24',
 '123.237.3.0/24',
 '123.237.5.0/24',
 '123.237.6.0/24',
 '123.237.7.0/24',
 '123.237.8.0/24',
 '123.237.9.0/24',
 '123.237.10.0/24',
 '123.237.11.0/24',
 '123.237.12.0/24',
 '123.237.15.0/24',
 '123.237.45.0/24',
 '123.237.46.0/24',
 '123.237.240.0/24',
 '123.237.242.0/24',
 '123.237.246.0/24',
 '123.237.248.0/24',
 '123.237.250.0/24',
 '123.237.252.0/24',
 '123.237.253.0/24',
 '123.237.254.0/24',
 '123.237.255.0/24',
 '123.238.10.0/24',
 '123.238.24.0/24',
 '123.238.25.0/24',
 '123.238.28.0/24',
 '123.238.29.0/24',
 '123.238.30.0/24',
 '123.238.33.0/24',
 '123.238.34.0/24',
 '123.238.35.0/24',
 '123.238.36.0/24',
 '123.238.37.0/24',
 '123.238.39.0/24',
 '123.238.40.0/24',
 '123.238.41.0/24',
 '123.238.42.0/24',
 '123.238.45.0/24',
 '123.238.46.0/24',
 '123.238.47.0/24',
 '123.238.102.0/24',
 '123.238.111.0/24',
 '123.238.128.0/24',
 '123.238.157.0/24',
 '123.238.162.0/24',
 '123.238.195.0/24',
 '123.238.240.0/24',
 '123.238.241.0/24',
 '123.238.245.0/24',
 '123.238.246.0/24',
 '123.238.248.0/24',
 '123.238.251.0/24',
 '123.238.252.0/24',
 '123.238.253.0/24',
 '123.238.254.0/24',
 '123.238.255.0/24',
 '124.125.125.0/24',
 '124.125.126.0/24',
 '124.125.128.0/24',
 '124.125.138.0/24',
 '124.125.147.0/24',
 '124.125.162.0/24',
 '124.125.163.0/24',
 '124.125.165.0/24',
 '124.125.167.0/24',
 '128.201.80.0/24',
 '128.201.81.0/24',
 '128.201.82.0/24',
 '128.201.83.0/24',
 '138.59.134.0/24',
 '138.59.176.0/24',
 '138.59.177.0/24',
 '138.59.178.0/24',
 '138.59.179.0/24',
 '138.94.244.0/24',
 '138.94.245.0/24',
 '138.94.246.0/24',
 '138.94.247.0/24',
 '138.118.144.0/24',
 '138.118.145.0/24',
 '138.118.146.0/24',
 '138.118.147.0/24',
 '138.121.242.0/24',
 '138.255.89.0/24',
 '151.216.17.0/24',
 '154.16.19.0/24',
 '154.16.21.0/24',
 '154.16.31.0/24',
 '154.16.59.0/24',
 '154.16.67.0/24',
 '154.16.73.0/24',
 '154.16.79.0/24',
 '154.16.81.0/24',
 '154.16.83.0/24',
 '154.16.95.0/24',
 '154.16.96.0/24',
 '154.16.99.0/24',
 '154.16.136.0/24',
 '154.16.137.0/24',
 '154.16.138.0/24',
 '154.16.150.0/24',
 '154.16.151.0/24',
 '154.16.156.0/24',
 '154.16.157.0/24',
 '154.16.159.0/24',
 '154.16.194.0/24',
 '154.16.200.0/24',
 '154.16.205.0/24',
 '154.16.229.0/24',
 '154.16.230.0/24',
 '154.16.233.0/24',
 '154.16.238.0/24',
 '154.16.239.0/24',
 '154.16.242.0/24',
 '154.16.244.0/24',
 '154.16.246.0/24',
 '154.16.254.0/24',
 '161.22.36.0/24',
 '162.208.108.0/24',
 '162.208.109.0/24',
 '162.208.110.0/24',
 '162.210.242.0/24',
 '167.250.50.0/24',
 '168.197.188.0/24',
 '168.197.240.0/24',
 '168.197.241.0/24',
 '168.197.242.0/24',
 '168.197.243.0/24',
 '170.33.0.0/24',
 '170.33.1.0/24',
 '170.33.2.0/24',
 '170.33.3.0/24',
 '170.33.40.0/24',
 '170.33.41.0/24',
 '170.33.42.0/24',
 '170.33.43.0/24',
 '170.33.44.0/24',
 '170.33.45.0/24',
 '170.33.46.0/24',
 '170.33.47.0/24',
 '170.33.64.0/24',
 '170.33.65.0/24',
 '170.33.66.0/24',
 '170.33.68.0/24',
 '170.33.69.0/24',
 '170.33.72.0/24',
 '170.33.75.0/24',
 '170.33.76.0/24',
 '170.33.77.0/24',
 '170.33.78.0/24',
 '170.33.79.0/24',
 '170.33.80.0/24',
 '170.33.81.0/24',
 '170.33.82.0/24',
 '170.33.83.0/24',
 '170.33.84.0/24',
 '170.33.85.0/24',
 '170.33.86.0/24',
 '170.33.87.0/24',
 '170.247.56.0/24',
 '170.247.57.0/24',
 '170.247.58.0/24',
 '170.247.59.0/24',
 '170.247.173.0/24',
 '179.61.97.0/24',
 '179.61.98.0/24',
 '179.61.99.0/24',
 '179.61.100.0/24',
 '179.61.101.0/24',
 '179.61.102.0/24',
 '179.61.107.0/24',
 '179.61.210.0/24',
 '179.61.217.0/24',
 '179.61.248.0/24',
 '179.62.113.0/24',
 '181.10.184.0/24',
 '181.10.185.0/24',
 '181.10.186.0/24',
 '181.13.50.0/24',
 '181.13.51.0/24',
 '181.114.192.0/24',
 '181.174.156.0/24',
 '181.174.158.0/24',
 '182.50.66.0/24',
 '182.50.67.0/24',
 '182.54.233.0/24',
 '185.2.32.0/24',
 '185.3.200.0/24',
 '185.3.201.0/24',
 '185.3.202.0/24',
 '185.3.203.0/24',
 '185.7.44.0/24',
 '185.7.45.0/24',
 '185.16.236.0/24',
 '185.16.237.0/24',
 '185.16.238.0/24',
 '185.16.239.0/24',
 '185.51.188.0/24',
 '185.57.136.0/24',
 '185.57.137.0/24',
 '185.57.138.0/24',
 '185.57.139.0/24',
 '185.64.176.0/24',
 '185.64.177.0/24',
 '185.70.84.0/24',
 '185.70.85.0/24',
 '185.70.86.0/24',
 '185.70.87.0/24',
 '185.73.112.0/24',
 '185.81.74.0/24',
 '185.81.75.0/24',
 '185.81.236.0/24',
 '185.81.237.0/24',
 '185.81.238.0/24',
 '185.81.239.0/24',
 '185.86.6.0/24',
 '185.86.183.0/24',
 '185.88.172.0/24',
 '185.88.173.0/24',
 '185.88.174.0/24',
 '185.88.175.0/24',
 '185.98.61.0/24',
 '185.111.235.0/24',
 '185.158.175.0/24',
 '185.171.88.0/24',
 '185.171.89.0/24',
 '185.171.90.0/24',
 '185.171.91.0/24',
 '185.185.232.0/24',
 '185.185.235.0/24',
 '185.186.65.0/24',
 '185.188.112.0/24',
 '185.188.113.0/24',
 '185.188.114.0/24',
 '185.188.115.0/24',
 '185.195.253.0/24',
 '185.198.89.0/24',
 '185.198.90.0/24',
 '185.198.91.0/24',
 '185.201.40.0/24',
 '185.201.42.0/24',
 '185.215.89.0/24',
 '185.215.90.0/24',
 '185.215.91.0/24',
 '185.216.193.0/24',
 '185.217.228.0/24',
 '185.217.229.0/24',
 '185.217.230.0/24',
 '185.217.231.0/24',
 '185.228.92.0/24',
 '185.228.93.0/24',
 '185.228.94.0/24',
 '185.228.95.0/24',
 '185.231.109.0/24',
 '185.231.112.0/24',
 '185.232.43.0/24',
 '185.233.247.0/24',
 '185.236.248.0/24',
 '185.236.249.0/24',
 '185.236.250.0/24',
 '185.236.251.0/24',
 '185.237.81.0/24',
 '185.242.169.0/24',
 '185.244.29.0/24',
 '185.255.31.0/24',
 '186.1.248.0/24',
 '186.1.249.0/24',
 '186.86.198.0/24',
 '186.86.255.0/24',
 '186.113.12.0/24',
 '188.3.24.0/24',
 '188.3.26.0/24',
 '188.3.27.0/24',
 '188.3.84.0/24',
 '188.3.85.0/24',
 '188.3.86.0/24',
 '188.3.87.0/24',
 '188.3.112.0/24',
 '188.3.113.0/24',
 '188.3.114.0/24',
 '188.3.115.0/24',
 '188.3.116.0/24',
 '188.3.117.0/24',
 '188.3.118.0/24',
 '188.3.119.0/24',
 '188.3.236.0/24',
 '188.3.237.0/24',
 '188.3.238.0/24',
 '188.3.239.0/24',
 '188.253.0.0/24',
 '188.253.2.0/24',
 '188.253.3.0/24',
 '188.253.126.0/24',
 '188.253.127.0/24',
 '190.99.117.0/24',
 '190.99.118.0/24',
 '190.105.194.0/24',
 '190.107.216.0/24',
 '190.107.217.0/24',
 '190.107.218.0/24',
 '190.107.219.0/24',
 '190.107.220.0/24',
 '190.107.221.0/24',
 '190.107.222.0/24',
 '190.107.223.0/24',
 '190.112.32.0/24',
 '190.112.33.0/24',
 '190.112.34.0/24',
 '190.112.35.0/24',
 '190.112.36.0/24',
 '190.112.37.0/24',
 '190.112.38.0/24',
 '190.112.39.0/24',
 '190.113.241.0/24',
 '190.113.242.0/24',
 '190.147.138.0/24',
 '190.147.139.0/24',
 '190.158.143.0/24',
 '190.182.189.0/24',
 '190.182.250.0/24',
 '190.182.251.0/24',
 '190.185.130.0/24',
 '190.185.204.0/24',
 '190.185.206.0/24',
 '190.185.209.0/24',
 '190.185.210.0/24',
 '190.228.177.0/24',
 '191.96.32.0/24',
 '191.96.98.0/24',
 '192.33.23.0/24',
 '192.113.68.0/24',
 '192.113.70.0/24',
 '192.113.71.0/24',
 '192.136.187.0/24',
 '192.209.63.0/24',
 '193.106.198.0/24',
 '193.168.144.0/24',
 '194.32.71.0/24',
 '194.36.173.0/24',
 '194.50.177.0/24',
 '194.53.176.0/24',
 '194.53.177.0/24',
 '194.53.178.0/24',
 '194.53.179.0/24',
 '194.55.225.0/24',
 '194.127.111.0/24',
 '194.145.136.0/24',
 '198.176.44.0/24',
 '198.176.46.0/24',
 '198.176.47.0/24',
 '200.23.206.0/24',
 '200.25.59.0/24',
 '200.33.26.0/24',
 '200.33.27.0/24',
 '200.33.125.0/24',
 '200.50.182.0/24',
 '200.50.183.0/24',
 '200.50.190.0/24',
 '200.59.196.0/24',
 '200.59.197.0/24',
 '200.126.44.0/24',
 '200.126.45.0/24',
 '201.219.254.0/24',
 '201.221.144.0/24',
 '201.221.146.0/24',
 '201.221.147.0/24',
 '201.221.149.0/24',
 '201.221.152.0/24',
 '201.221.154.0/24',
 '201.221.156.0/24',
 '202.13.72.0/24',
 '202.57.111.0/24',
 '202.143.125.0/24',
 '202.143.126.0/24',
 '202.144.202.0/24',
 '202.153.171.0/24',
 '202.179.14.0/24',
 '203.98.69.0/24',
 '203.98.80.0/24',
 '203.98.85.0/24',
 '203.98.92.0/24',
 '208.71.225.0/24',
 '208.71.226.0/24',
 '208.71.227.0/24',
 '209.24.0.0/24',
 '212.33.196.0/24',
 '212.33.197.0/24',
 '212.33.198.0/24',
 '212.33.199.0/24',
 '212.33.200.0/24',
 '212.33.201.0/24',
 '213.142.148.0/24',
 '213.159.199.0/24',
 '213.186.128.0/24',
 '213.186.129.0/24',
 '213.186.130.0/24',
 '213.186.131.0/24',
 '213.186.132.0/24',
 '213.186.133.0/24',
 '213.186.135.0/24',
 '213.186.136.0/24',
 '213.186.137.0/24',
 '213.186.138.0/24',
 '213.186.139.0/24',
 '213.186.141.0/24',
 '213.186.142.0/24',
 '213.186.143.0/24',
 '216.21.237.0/24',
 '219.69.248.0/24',
 '219.69.249.0/24',
 '219.69.250.0/24',
 '219.69.254.0/24',
 '220.224.192.0/24',
 '220.224.227.0/24',
 '220.226.161.0/24',
 '220.226.162.0/24',
 '220.226.163.0/24',
 '221.120.25.0/24',
 '2a07:abc0::/29',
 '2a04:280::/29',
 '2a02:7f80::/30',
 '2a02:7f84::/30',
 '2001:1400::/32',
 '2404:f801::/32',
 '2405:8180::/32',
 '2406:3001::/32',
 '2407:6180::/32',
 '2800:6b0::/32',
 '2803:4f00::/32',
 '2a00:5f00::/32',
 '2a07:c880::/32',
 '2a0c:9f40::/32',
 '2402:f740:2000::/36',
 '2402:f740::/36',
 '2402:f740:1000::/36',
 '2401:9cc0:300::/40',
 '2803:6700:3000::/40',
 '2406:3003:3100::/45',
 '2803:2a80:21::/48',
 '2803:7d00:130a::/48',
 '2803:7d00:131::/48',
 '2803:7d00:130::/48',
 '2803:7d00:80::/48',
 '2803:7d00::/48',
 '2803:2a80:8f0::/48',
 '2803:2a80:821::/48',
 '2803:2a80:800::/48',
 '2803:2a80:20::/48',
 '2803:2a80:2::/48',
 '2a0d:2902:cb00::/48',
 '2803:2a80::/48',
 '2803:7d00:b000::/48',
 '2803:7d00:b020::/48',
 '2803:db40::/48',
 '2803:ea80::/48',
 '2806:211::/48',
 '2806:211:10::/48',
 '2a00:5180::/48',
 '2a00:5180:1::/48',
 '2a00:5180:2::/48',
 '2a03:f85:4::/48',
 '2a0b:b87:ffd0::/48',
 '2a0c:b641:77::/48',
 '2a0c:b641:57f::/48',
 '2406:3003:100::/48',
 '2001:67c:1484::/48',
 '2001:7fb:fd03::/48',
 '2001:ac0:c800::/48',
 '2001:13c7:6000::/48',
 '2001:19f0:a06::/48',
 '2001:19f0:a07::/48',
 '2001:19f0:a08::/48',
 '2401:2c0:1::/48',
 '2402:cf80:1006::/48',
 '2402:e380:19::/48',
 '2402:ea80::/48',
 '2406:3002:20::/48',
 '2801:170::/48',
 '2604:a680:2::/48',
 '2604:a680:5::/48',
 '2604:ab80:2::/48',
 '2606:8e80::/48',
 '2606:8e80:1000::/48',
 '2606:8e80:2000::/48',
 '2606:8e80:3000::/48',
 '2606:8e80:4000::/48',
 '2606:8e80:5000::/48',
 '2620:4d:4002::/48',
 '2620:4d:4022::/48',
 '2801:12:7000::/48']

Note that the source is not updated regularly, but it seems good enough to have a rough estimate.

After a chat with Arzhel we decided that the best output would be prefix and AS number from Geo IP data, grouped and ordered by number of hits.

elukey closed this task as Resolved.May 31 2019, 9:34 AM
elukey claimed this task.

Data for one day of text/upload has been provided to Arzhel via separate phab paste, closing! (re-open if anything is still needed).

faidon reopened this task as Open.Jun 28 2019, 2:15 PM
faidon added a subscriber: faidon.

So, a few things:

  • There is a better source for this kind of data, that is updated hourly rather than monthly: https://as286.net/data/ana-invalids.txt
  • For RPKI specifically we would also like to differentiate between three states: no match, match but with no alternative prefix (unreachable), and match but with an alternative prefix (invalid-but-reachable)
  • I'd like us to be able to see the evolution of that data over time, as to basically track the percentage of traffic that we would lose if we were to move forward with rejecting RPKI invalids. Ideally that would be a Grafana graph or something, but if we have no such capabilities, no reason to add them - this would be temporary most likely (i.e. grab data for something like a month).

I wrote the code below which:

  • Parses this new file format, and differentiates between those three states
  • Uses a Patricia trie library (pytricia), which would make these lookups factors of magnitudes faster than linear lookups as the existing code seems to have been doing.

I have no idea how to integrate all that with Spark :) Questions would be:

  • How do we run this with a venv so that we can include Pytricia?
  • How do we fetch the file from Hadoop worker nodes? Should we instead have a different job that fetches it every hour and stores it in HDFS, and then use Spark to read that file from HDFS instead? (The code is generic right now, shouldn't be too hard?)
  • How do we run this periodically (every hour, ideally after we refresh the file), and how do we store the output in a time-series manner to be able to track the trend over time?

Thanks, and pardon my ignorance :)

1#!/usr/bin/env python3
2
3import re
4import pytricia
5import urllib.request
6
7import random
8import timeit
9
10anafile = "ana-invalids.txt"
11anaurl = "https://as286.net/data/ana-invalids.txt"
12
13
14def parse(iterable):
15 pattern = "^(?P<prefix>[^;]+);srcAS=(?P<asn>[^;]+);altpfx=(?P<altpfx>[^;]+);.*"
16 prefixes = pytricia.PyTricia(128)
17
18 for line in iterable:
19 match = re.search(pattern, line)
20 if not match:
21 continue
22
23 prefix = match.group("prefix")
24 if match.group("altpfx") == "NONE":
25 altpfx = False
26 else:
27 altpfx = True
28
29 prefixes[prefix] = altpfx
30
31 return prefixes
32
33
34def read_and_parse_file(filename):
35 with open(filename, "r") as f:
36 return parse(f)
37
38
39def read_and_parse_url(url):
40 response = urllib.request.urlopen(url)
41 content = response.read().decode("utf-8")
42 return parse(content.splitlines())
43
44
45prefixes = read_and_parse_file(anafile)
46# prefixes = read_and_parse_url(anaurl)
47
48
49def lookup(ip):
50 try:
51 altpfx = prefixes[ip]
52 if altpfx:
53 result = "invalid-but-reachable"
54 else:
55 result = "unreachable"
56 except KeyError:
57 result = "valid-or-unverified"
58
59 return result
60
61
62# test some lookups that we
63def test_lookup():
64 for ip in (
65 "2.59.118.1", # unreachable
66 "2.176.52.1", # invalid-but-reachable
67 "1.1.1.1", # valid-or-unverified
68 "2a0d:5643::1", # unreachable
69 "2a0d:5084::1", # invalid-but-reachable
70 "2606:4700:4700::1111", # valid-or-unverified
71 ):
72 try:
73 print(ip, lookup(ip))
74 except KeyError:
75 # invalid prefix?
76 pass
77
78
79def random_ip(n):
80 for i in range(n):
81 random.randint(0, 255)
82 random_ip4 = ".".join([str(random.randint(0, 255)) for j in range(0, 4)])
83 yield random_ip4
84
85
86def test_random():
87 for ip in random_ip(1000):
88 lookup(ip)
89
90
91if __name__ == "__main__":
92 test_lookup()
93 n = 10
94 v = timeit.timeit(
95 "test_random()", setup="from __main__ import test_random", number=n
96 )
97 print("Execution time per run: ", v / n)

elukey added a comment.Jul 1 2019, 2:29 PM

How do we run this with a venv so that we can include Pytricia?

Ideally if we had a deb package for this library we could deploy it on all the worker nodes and use it :)

If we don't care about plotting a graph to grafana, we could simply have a recurrent Spark job that adds data every hour to a Hive table (that will be partitioned by year/month/day/hour anyway). Later on we could import the data to Druid and display it via Turnilo in case.

  • For RPKI specifically we would also like to differentiate between three states: no match, match but with no alternative prefix (unreachable), and match but with an alternative prefix (invalid-but-reachable)

This bit is also not totally clear for me, could you please add more info? :)

https://as286.net/data/ana-invalids.txt is RPKI invalid data, crossed with the global routing table.

Take for example:

2.59.118.0/24;srcAS=60721;altpfx=NONE;iROAS=2.59.118.0/24-24(209737);

Means:
2.59.118.0/24 is an RPKI invalid prefix (someone is advertising that prefix on the internet, but the RPKI database says that AS is not allowed to),
altpfx=NONE that prefix is not covered by a different (valid/unknown) prefix
iROAS=2.59.118.0/24-24(209737), the AS# allowed to advertise that prefix according to RPKI.

So from there:

no match

Means a webrequest IP doesn't match any RPKI invalid prefix in that file

match but with no alternative prefix

match with a prefix that have altpfx=NONE, what we call RPKI unreachable prefixes

match but with an alternative prefix

match with a prefix that have for example altpfx=<2.176.0.0/16(12880), even if we block the invalid prefix, traffic will route to the alternate valid one

elukey added a comment.Jul 3 2019, 8:49 AM

Ack got it thanks!

As curiosity I ran two Spark jobs with one hour of webrequest text data and:

  1. The Spark python code listed in this task
  2. Another similar script with the following change (basically use ipaddress rather than implementing the prefix matching):
rpki_invalid_prefixes_ip_network = [ipaddress.ip_network(prefix) for prefix in rpki_invalid_prefixes]

# Define a UDF function for Spark, that checks if an IP address
# belongs to the list of RPKI invalid prefixes.
# Returns the list of IPs and related matching RPKI invalid prefixes.
def in_subnets(ip):
    try:
        ip_addr = ipaddress.ip_address(ip)
        return [str(prefix) for prefix in rpki_invalid_prefixes_ip_network if ip_addr in prefix]
    except:
        return []

spark.udf.register("in_subnets", in_subnets, ArrayType(StringType()))

The results are the following:

  1. execution time (with 32 Spark workers) ~7 minutes
  2. execution time (with 32 Spark workers) ~ 40 minutes

Results are the same from what I can see.

I'll try to dig a bit more why, there is possibly some improvements for 2) that I am currently not seeing.

elukey moved this task from Backlog to Waiting for others on the User-Elukey board.Jul 5 2019, 7:01 AM
faidon lowered the priority of this task from Normal to Low.Jul 12 2019, 12:07 PM

How do we run this with a venv so that we can include Pytricia?

Ideally if we had a deb package for this library we could deploy it on all the worker nodes and use it :)

FYI, with rpkicounter.py I've swiched to the similar "radix" library, which is in Debian (as python3-radix). Relatively easy to test out, ping me on IRC for some test code if you'd like :) rpkicounter can process ~100k lookups/sec, so it should be fairly quick for Spark as well (maybe fast enough that we won't even need the extra DISTINCT step even!).

More broadly, we're using rpkicounter in production now, via kafkatee with a 1:100 sample, and through Prometheus we get a Grafana dashboard out of it. So our needs with regards to this task are covered. It'd be still neat to understand better Analytics' offerings (including Spark?) for use cases like those instead of building custom pipelines! Would love to have an exploratory chat with y'all, I'll try to set up something soon :)

@ayounsi should we keep this task open?

ayounsi closed this task as Resolved.Wed, Sep 18, 4:01 PM

All good here. Thanks!