Page MenuHomePhabricator

Access and use the MaxMind database for IPInfo
Closed, ResolvedPublic

Description

The Anti-Harassment team needs to access the MaxMind database to provide extended information about IPs as part of our work on IP masking. We have licensing permission from MaxMind to use the data in this way.

Production

The MaxMind DB file is available on all Wikimedia application servers. We'll add a config variable to IP Info extension that will be a filesystem path to the MaxMind DB file. We'll use their geoip2/geoip2 library in order to read/query the DB file. What will the the file path in production?

Alternatively, we could create a microservice that IP Info could access over the local network.

Testing

The outstanding question is how do we write code that integrates with that data either in local environments or other testing areas. Should we mock the data structure with some fake data? Is the GeoIP Lite database in the same structure? Given its open license, we could more easily copy that data around. What other options are there for testing setups?

Other Concerns

Given that this is a nicely indexed flat file database, I'm assuming performance impact for something low use is negligible. (At least to begin with, the feature will only be available to checkusers. However, it may be more widely available in the future.) Is that assumption correct?

Event Timeline

@eprodromou @sdkim Can you check in with @aezell about this? They're hoping to have a quick turnaround if possible.

After some discussions with @eprodromou about an API, we've decided for ease and speed's sake to use the flatfile available on the application servers.

Now, we need to discover what that file location is and whether the user that is running the PHP app can access that application or if the permissions restrict it.

@BBlack Do you know the filepath for the MaxMind DB on the app servers and what its permissions might be?

Tchanders renamed this task from Access and use the MaxMind database for CheckUser to Access and use the MaxMind database for IPInfo.Sep 30 2020, 5:24 PM
Tchanders updated the task description. (Show Details)

@Jdforrester-WMF I'm curious if you have any knowledge about how accessing the MaxMind database works. If so, could you share some guidance?

Hmm. Looking, it seems like the geoip2 composer library is installed in production on the Fundraising cluster but not the main one. Not sure how trivial it'd be to enable it in full Prod?

Note that https://gerrit.wikimedia.org/r/c/mediawiki/core/+/376451 was going to do some of this work in core, including adding this library, so it's not an unknown want. :-)

The MaxMind databases we have are available in production on each Mediawiki application server under /usr/share/GeoIP. The GeoIP2, the GeoIP, and the GeoLite databases are available.

Also, there's freely-distributable (CC-BY-SA 3.0 license) MaxMind demo databases at https://github.com/maxmind/MaxMind-DB/tree/master/test-data which ought to be helpful for unit/integration tests against the GeoIP2 data format.

Happy to help more but hopefully this unblocks you :)

BTW, there's also the mmdblookup utility installed on those machines, which will let you look up values by hand, give you the field structure & data types, etc.
https://maxmind.github.io/libmaxminddb/mmdblookup.html

Marking as invalid, since working with the flat file meets AHT's needs. We may re-open if there are performance or other issues in the future.

Thanks @CDanis, this is really helpful. We're currently responding to performance and security reviews via our test environment, using the free databases. We may have more questions when we're ready to move over to production.

For now, do you know how often the databases are updated? I see that some haven't been updated since summer, but some have been updated since T264838#6534393 (those with Oct 4 date now have Nov 1, so I would guess monthly or weekly?)...

Thanks @CDanis, this is really helpful. We're currently responding to performance and security reviews via our test environment, using the free databases. We may have more questions when we're ready to move over to production.

Sounds good!

For now, do you know how often the databases are updated? I see that some haven't been updated since summer, but some have been updated since T264838#6534393 (those with Oct 4 date now have Nov 1, so I would guess monthly or weekly?)...

Yeah, sorry, there's a variety of legacy files also in that directory.

In production you should use the GeoIP2 databases.

As compared to some of the other, older formats: GeoIP2 is more accurate, supports IPv6, and is not due to be deprecated around 2022.

The GeoIP2 files are updated weekly, usually on Tuesdays.

Depending on which free database you've been using to do your testing, and which fields you care about, GeoIP2's structure might be somewhat different -- you should probably also do some tests against the GeoIP2 test files. Just be aware those test files are pretty sparse in terms of IP address coverage, so you'll need to hunt down some addresses defined in them -- the JSON source data in that repo can help.

@CDanis Thanks for the help so far. These are the MaxMind files that we appear to have available on production (from T264838#6534393):

GeoIP2-City.mmdb
GeoIP2-Connection-Type.mmdb
GeoIP2-Country.mmdb
GeoIP2-ISP.mmdb

MaxMind does offer more databases, but we don't have them available: https://dev.maxmind.com/geoip/

We were planning on using some of this additional data if these files were available.

Are they left out by design, because of licensing, or some other reason? Is there a way to add them?

Maybe @BBlack might know.

@CDanis Thanks for the help so far. These are the MaxMind files that we appear to have available on production (from T264838#6534393):

GeoIP2-City.mmdb
GeoIP2-Connection-Type.mmdb
GeoIP2-Country.mmdb
GeoIP2-ISP.mmdb

MaxMind does offer more databases, but we don't have them available: https://dev.maxmind.com/geoip/

We were planning on using some of this additional data if these files were available.

Are they left out by design, because of licensing, or some other reason? Is there a way to add them?

Maybe @BBlack might know.

Re-opening this until we know whether we can use the production .mmdb files on the Beta Cluster.

Prtksxna added subscribers: STran, Prtksxna.

@STran says its ok to close, "Production files can only be used in production."