Page MenuHomePhabricator

Incorporate data from the GeoIP2 ISP database to webrequest
Closed, ResolvedPublic8 Estimated Story Points

Description

Per @faidon's comment: One thing I've proposed before that could be useful for the raw logs & Druid, but perhaps even for the data in HDFS, is incorporating data from the GeoIP2 ISP database (which I don't believe we're subscribed to, but is fairly cheap). Being able to aggregate by ISP/AS number could be useful for these kind of investigations ("how many hits were by Yandex").

Event Timeline

This came up again this week: I was looking into our network traffic in our various PoPs, to plan capacity and procure network links for eqsin (Singapore). There is traffic on our peering port in ulsfo, and there is no easy way to identify where it's coming from using our own tooling. Analyzing Netflows using a pmacct/Druid/Tranquility pipeline would be ideal, but we're very far from that being usable and useful, despite Analytics (very graciously!) helping us slowly get there (cf. T181036).

Having ISP names and their autonomous system numbers in Hadoop could be a useful alternative for us to extract the same statistics, and perhaps it'd be useful for other use cases as well? I can imagine e.g. performance drilling down in case of ISP issues, Zero giving an estimate to potential carriers, etc.

I realize you folks have a lot on your plate and this may not be high for your list, so until that happens, I wrote a script of my own to parse the 1:1000 logs on oxygen and import them on an SQLite database to run queries against. For the ISP data, they're currently based on the outdated GeoIP v1 ASN databases, and having GeoIP v2 ISP databases would already be a great step. If the $300/yr (or so) is an issue, we could cover it from TechOps' budget (and while I have access to our MaxMind account, I thought it'd be best to coordinate with you folks first :)). Upgrading oxygen to stretch would be helpful too, I'll sync with Luca/Otto next week about that, and if there are no gotchas or objections, tackle it myself.

Let's start buying a license for GeoIP v2 ISP databases

We should plan to add this to webrequest early q3

JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 403916 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Refactor geo-coding function and add ISP

https://gerrit.wikimedia.org/r/403916

Change 405899 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add ISP data to webrequest table

https://gerrit.wikimedia.org/r/405899

Change 403916 merged by Ottomata:
[analytics/refinery/source@master] Refactor geo-coding function and add ISP

https://gerrit.wikimedia.org/r/403916

Change 405899 merged by Joal:
[analytics/refinery@master] Add ISP data to webrequest table

https://gerrit.wikimedia.org/r/405899