Page MenuHomePhabricator

IPInfo MediaWiki extension depends on presence of maxmind db in the container/host
Closed, ResolvedPublicBUG REPORT

Event Timeline

Aklapper renamed this task from IPInfo MediaWiki extension : https://www.mediawiki.org/wiki/Extension:IPInfo depends on presence of maxmind db in the container/host to IPInfo MediaWiki extension depends on presence of maxmind db in the container/host.Aug 7 2021, 9:05 AM
Aklapper updated the task description. (Show Details)
wkandek triaged this task as Medium priority.Aug 7 2021, 5:12 PM
wkandek added a project: serviceops.

First, I think the solution picked depends on whether we expect other services or extensions to want to use this GeoIP information in the future. Is this just a one-off for IPInfo or will other services/extensions want this information too?

I am not sure whether it makes sense to have all of MediaWiki have access to the GeoIP databases, we could have a microservice wrapper around them that IPInfo calls out to. I'm reminded of T59126, "...access to GeoIP look-ups has monetary, legal & performance implications" (performance isn't probably an issue here, that was mostly for client-side). I realize this is more technical work so I'm not strictly recommending that and plus the microservice would have the exact same problem of getting the database into a running container. So,

I think there are two main options around the problem of getting the database into a running k8s pod:

  1. Embed the database in the image. This is non-ideal because:
    • The databases are private and cannot be freely published. The image will need to be restricted/ (this is already taken care of for MediaWiki)
    • Any change to the database requires deploying the new image, which is currently a manual process, whereas currently GeoIP databases are updated automatically.
  2. Add the database to all k8s nodes and mount it into the pod. This is non-ideal because:
    • It introduces a dependency upon the host system.

Regarding 1 - a maxmind db change requires a new MediaWiki image and deployment. I believe the current frequency of MediaWiki deployments (multiple times a week) is good enough for the maxmind databases. The automated update of databases to the build system and the inclusion in the next image should be good enough.

I think that if we need to add the maxmind database (I'm not sure if we also need an extension), probably the best two options I can think of are:

  • Build the database in the restricted image This would likely mean we'd update the database every time there is a train release or we do an emergency release.
  • Provide the data as a ConfigMap in k8s This would mean we'd update the database with every deployment (so usually multiple times per day).

I do prefer the second option for multiple reasons:

  1. It will mean faster releases of updates to the database
  2. It will reduce the amount of stuff that we bake into the image and is thus harder to audit in production.
  3. Given the data is anyways injected by puppet it seems more natural to keep things in the right place.

As far as implementation goes, the best way to do this is probably to just deliver the geoip files on disk on deploy1002 and just read the files from the chart directly. The files will thus need to be readable to deployment users.

I would prefer the 2nd option as well (configmap), but alas, configmaps have a size limit of 1MB[1], due to etcd having that size limit (there is some nuance here, it's not strictly 1MB, but a little bit more, but this is irrelevant and it's safer to just say 1MB). And a quick du in /usr/share/GeoIP points out that most files are well over 1MB (only GeoLite.dat is smaller). So, unfortunately the ConfigMap solution would not work. Shipping it on the hosts via puppet and mounting it in the container is incredibly ugly and creates dependencies on the host level that we 've avoided up to now and I 'd prefer if we kept it like that. Which means we 'll need to bake it into the image unfortunately. The alternative of isolating that in a microservice is a nice idea but it suffers form the same problem as @Legoktm already said.

[1] https://github.com/kubernetes/kubernetes/issues/19781

Sigh somehow I forgot to check the size of the geoip files :/

I love the idea of the microservice because it would solve the problem of geoip lookups for all services we might have in the future. There are third-party services for this, for example https://github.com/ProtocolONE/geoip-service so the development cost would mostly have to do with adding support to it to the extension we're deploying.

Sigh somehow I forgot to check the size of the geoip files :/

I love the idea of the microservice because it would solve the problem of geoip lookups for all services we might have in the future. There are third-party services for this, for example https://github.com/ProtocolONE/geoip-service so the development cost would mostly have to do with adding support to it to the extension we're deploying.

Yes, but this is a larger project and since the solution seems to need to be the same (having the files in the image) we are probably not gaining something in the short term by following down this path.

I believe the current frequency of MediaWiki deployments (multiple times a week) is good enough for the maxmind databases. The automated update of databases to the build system and the inclusion in the next image should be good enough.

That seems right. There was a comment somewhere that the expected frequency of updates at upstream is about weekly, but we couldn't be exactly sure so we are just trying to pull from MaxMind once daily but don't expect a change every day.

a quick du in /usr/share/GeoIP points out that most files are well over 1MB (only GeoLite.dat is smaller).

So.. the "legacy" databases we have always had, before this new request for the IPInfo extension, they are /usr/share/GeoIP and yes, their total is 289M currently with the largest file 125M.

And the new databases we are now fetching using a separate license since the request from Anti-Harassement in T288844, they are /usr/share/GeoIPInfo. I kept them totally separate on purpose so far. There are just 2 files (product IDs) they had requested, GeoIP2-Anonymous-IP.mmdb (9.8MB) and GeoIP2-Enterprise.mmdb (361MB). So technically if we are talking only about the IPInfo extension case it is limited to that and just 2 files but total even larger.

I deployed the latter to mediawiki::canary_appserver, mediawiki::appserver, mediawiki::appserver::canary_api, mediawiki::appserver::api). It's controlled through fetch_ipinfo_dbs key in Hiera. I did not use the role::mediawiki::common, so they are not also on mwmaint and mwlog, unlike "legacy" databases but I don't think they have to/should be there.

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Please no. That would instantly make kubernetes nodes dependent on a deployment (never mind the deployment tool) for them to be functional for (a subset using that resource) of pods. Race conditions would raise their ugly heads while deploying pods and their troubleshooting would be difficult as it would be dependent on a resource that is not deployed via the same tooling (helm/helmfile) that is used to deploy the actual pods that would make use of this resource. And this isn't limited to the GeoIP databases, it applies to any resource that would be used like that.

Abstracting this functionality from shipping files to a standalone internal service would be my preferred way to go here.

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Please no. That would instantly make kubernetes nodes dependent on a deployment (never mind the deployment tool) for them to be functional for (a subset using that resource) of pods.

Understood. Thanks for the explanation. I was just exploring ideas, but I can see now why it would be problematic.

Abstracting this functionality from shipping files to a standalone internal service would be my preferred way to go here.

Yes, agreed that would definitely be preferable.. Feel free to let me know if I can help in any way with this, when the time comes.

Basically we set wgIPInfoGeoIP2EnterprisePath to /usr/share/GeoIPInfo/. That directory gets populated on the appservers (but NOT on the deployment servers!) via puppet.

My proposal would be to have that directory be distributed via puppet to our deployment server, then make-container-image in mediawiki-tools-releases should just copy over the file when rebuilding the image from scratch, not otherwise. This would give us a bi-weekly update which is all we need until further proof.

Change 868199 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] mediawiki: download geoip databases on deployment servers

https://gerrit.wikimedia.org/r/868199

.../usr/share/GeoIPInfo/. That directory gets populated on the appservers (but NOT on the deployment servers!) via puppet.

My proposal would be to have that directory be distributed via puppet to our deployment server..

Hi, hope I am not stepping on any toes but here is a patch to do that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/868199
because I was looking at that in the past.

Originally I expected we have to add the geoip class to the deployment_server profile and such
but while trying to do that I realized all this really needs is a single line in Hiera.

So it's pretty trivial already, yay.

Change 868391 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[mediawiki/tools/release@master] build-mv-image: Copy GeoIP data on build from base

https://gerrit.wikimedia.org/r/868391

Change 868122 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/release@master] multiversion-base: Include /usr/share/{GeoIP,GeoIPInfo} from host if available

https://gerrit.wikimedia.org/r/868122

After a discussion on IRC:

  • We're not happy to add another 500 MB to the docker image
  • It would make deployment times longer, depend on deployments to update the geoip database, creating a big discrepancy with mediawiki
  • It's possible to ensure that geoip is present on the disk of the k8s nodes before the kubelet starts.

so we could go with a hostPath to mount both of the geoip directories that are used today inside the kubernetes pods.

Change 870565 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] P:kubernetes::mediawiki_runner: Copy GeoIP data

https://gerrit.wikimedia.org/r/870565

Change 868199 abandoned by Dzahn:

[operations/puppet@production] mediawiki: download geoip databases on deployment servers

Reason:

another solution is preferred, see ticket comments

https://gerrit.wikimedia.org/r/868199

Change 868122 abandoned by Ahmon Dancy:

[mediawiki/tools/release@master] multiversion-base: Include /usr/share/{GeoIP,GeoIPInfo} from host if available

Reason:

A different approach is being considered at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/870660

https://gerrit.wikimedia.org/r/868122

Change 870565 merged by Clément Goubert:

[operations/puppet@production] P:kubernetes::mediawiki_runner: Copy GeoIP data

https://gerrit.wikimedia.org/r/870565

GeoIP data copied to all mw-on-k8s kubernetes hosts.

Change 868391 abandoned by Clément Goubert:

[mediawiki/tools/release@master] build-mv-image: Copy GeoIP data on build from base

Reason:

Different approach was chosen - T288375

https://gerrit.wikimedia.org/r/868391

PSP needs to be updated before we can deploy.

Change 875812 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add GeoIP data to chart

https://gerrit.wikimedia.org/r/875812

Change 875812 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add GeoIP data to chart

https://gerrit.wikimedia.org/r/875812

Deployed on mw-debug and works.
Should it be enabled for all mw-on-k8s deployments, or are there some cases where it's not needed?

Deployed on mw-debug and works.
Should it be enabled for all mw-on-k8s deployments, or are there some cases where it's not needed?

It is probably not needed on jobrunners, but I wouldn't take that chance. Enabling it everywhere is the safe path.

Change 875891 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: enable geoip by default

https://gerrit.wikimedia.org/r/875891

Change 875891 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: enable geoip by default

https://gerrit.wikimedia.org/r/875891

geoip is now enabled by default in the mediawiki chart, and deployed to all current mw-on-k8s services. Resolving.