Per @faidon's comment: One thing I've proposed before that could be useful for the raw logs & Druid, but perhaps even for the data in HDFS, is incorporating data from the GeoIP2 ISP database (which I don't believe we're subscribed to, but is fairly cheap). Being able to aggregate by ISP/AS number could be useful for these kind of investigations ("how many hits were by Yandex").
|Resolved||JAllemandou||T167907 Incorporate data from the GeoIP2 ISP database to webrequest|
|Duplicate||None||T160822 Filter local IPs before checking for geo info|
This came up again this week: I was looking into our network traffic in our various PoPs, to plan capacity and procure network links for eqsin (Singapore). There is traffic on our peering port in ulsfo, and there is no easy way to identify where it's coming from using our own tooling. Analyzing Netflows using a pmacct/Druid/Tranquility pipeline would be ideal, but we're very far from that being usable and useful, despite Analytics (very graciously!) helping us slowly get there (cf. T181036).
Having ISP names and their autonomous system numbers in Hadoop could be a useful alternative for us to extract the same statistics, and perhaps it'd be useful for other use cases as well? I can imagine e.g. performance drilling down in case of ISP issues, Zero giving an estimate to potential carriers, etc.
I realize you folks have a lot on your plate and this may not be high for your list, so until that happens, I wrote a script of my own to parse the 1:1000 logs on oxygen and import them on an SQLite database to run queries against. For the ISP data, they're currently based on the outdated GeoIP v1 ASN databases, and having GeoIP v2 ISP databases would already be a great step. If the $300/yr (or so) is an issue, we could cover it from TechOps' budget (and while I have access to our MaxMind account, I thought it'd be best to coordinate with you folks first :)). Upgrading oxygen to stretch would be helpful too, I'll sync with Luca/Otto next week about that, and if there are no gotchas or objections, tackle it myself.