Page MenuHomePhabricator

IPInfo MediaWiki extension depends on presence of maxmind db in the container/host
Open, MediumPublicBUG REPORT

Event Timeline

Aklapper renamed this task from IPInfo MediaWiki extension : https://www.mediawiki.org/wiki/Extension:IPInfo depends on presence of maxmind db in the container/host to IPInfo MediaWiki extension depends on presence of maxmind db in the container/host.Aug 7 2021, 9:05 AM
Aklapper updated the task description. (Show Details)
wkandek triaged this task as Medium priority.Aug 7 2021, 5:12 PM
wkandek added a project: serviceops.

First, I think the solution picked depends on whether we expect other services or extensions to want to use this GeoIP information in the future. Is this just a one-off for IPInfo or will other services/extensions want this information too?

I am not sure whether it makes sense to have all of MediaWiki have access to the GeoIP databases, we could have a microservice wrapper around them that IPInfo calls out to. I'm reminded of T59126, "...access to GeoIP look-ups has monetary, legal & performance implications" (performance isn't probably an issue here, that was mostly for client-side). I realize this is more technical work so I'm not strictly recommending that and plus the microservice would have the exact same problem of getting the database into a running container. So,

I think there are two main options around the problem of getting the database into a running k8s pod:

  1. Embed the database in the image. This is non-ideal because:
    • The databases are private and cannot be freely published. The image will need to be restricted/ (this is already taken care of for MediaWiki)
    • Any change to the database requires deploying the new image, which is currently a manual process, whereas currently GeoIP databases are updated automatically.
  2. Add the database to all k8s nodes and mount it into the pod. This is non-ideal because:
    • It introduces a dependency upon the host system.

Regarding 1 - a maxmind db change requires a new MediaWiki image and deployment. I believe the current frequency of MediaWiki deployments (multiple times a week) is good enough for the maxmind databases. The automated update of databases to the build system and the inclusion in the next image should be good enough.

I think that if we need to add the maxmind database (I'm not sure if we also need an extension), probably the best two options I can think of are:

  • Build the database in the restricted image This would likely mean we'd update the database every time there is a train release or we do an emergency release.
  • Provide the data as a ConfigMap in k8s This would mean we'd update the database with every deployment (so usually multiple times per day).

I do prefer the second option for multiple reasons:

  1. It will mean faster releases of updates to the database
  2. It will reduce the amount of stuff that we bake into the image and is thus harder to audit in production.
  3. Given the data is anyways injected by puppet it seems more natural to keep things in the right place.

As far as implementation goes, the best way to do this is probably to just deliver the geoip files on disk on deploy1002 and just read the files from the chart directly. The files will thus need to be readable to deployment users.

I would prefer the 2nd option as well (configmap), but alas, configmaps have a size limit of 1MB[1], due to etcd having that size limit (there is some nuance here, it's not strictly 1MB, but a little bit more, but this is irrelevant and it's safer to just say 1MB). And a quick du in /usr/share/GeoIP points out that most files are well over 1MB (only GeoLite.dat is smaller). So, unfortunately the ConfigMap solution would not work. Shipping it on the hosts via puppet and mounting it in the container is incredibly ugly and creates dependencies on the host level that we 've avoided up to now and I 'd prefer if we kept it like that. Which means we 'll need to bake it into the image unfortunately. The alternative of isolating that in a microservice is a nice idea but it suffers form the same problem as @Legoktm already said.

[1] https://github.com/kubernetes/kubernetes/issues/19781

Sigh somehow I forgot to check the size of the geoip files :/

I love the idea of the microservice because it would solve the problem of geoip lookups for all services we might have in the future. There are third-party services for this, for example https://github.com/ProtocolONE/geoip-service so the development cost would mostly have to do with adding support to it to the extension we're deploying.

Sigh somehow I forgot to check the size of the geoip files :/

I love the idea of the microservice because it would solve the problem of geoip lookups for all services we might have in the future. There are third-party services for this, for example https://github.com/ProtocolONE/geoip-service so the development cost would mostly have to do with adding support to it to the extension we're deploying.

Yes, but this is a larger project and since the solution seems to need to be the same (having the files in the image) we are probably not gaining something in the short term by following down this path.

I believe the current frequency of MediaWiki deployments (multiple times a week) is good enough for the maxmind databases. The automated update of databases to the build system and the inclusion in the next image should be good enough.

That seems right. There was a comment somewhere that the expected frequency of updates at upstream is about weekly, but we couldn't be exactly sure so we are just trying to pull from MaxMind once daily but don't expect a change every day.

a quick du in /usr/share/GeoIP points out that most files are well over 1MB (only GeoLite.dat is smaller).

So.. the "legacy" databases we have always had, before this new request for the IPInfo extension, they are /usr/share/GeoIP and yes, their total is 289M currently with the largest file 125M.

And the new databases we are now fetching using a separate license since the request from Anti-Harassement in T288844, they are /usr/share/GeoIPInfo. I kept them totally separate on purpose so far. There are just 2 files (product IDs) they had requested, GeoIP2-Anonymous-IP.mmdb (9.8MB) and GeoIP2-Enterprise.mmdb (361MB). So technically if we are talking only about the IPInfo extension case it is limited to that and just 2 files but total even larger.

I deployed the latter to mediawiki::canary_appserver, mediawiki::appserver, mediawiki::appserver::canary_api, mediawiki::appserver::api). It's controlled through fetch_ipinfo_dbs key in Hiera. I did not use the role::mediawiki::common, so they are not also on mwmaint and mwlog, unlike "legacy" databases but I don't think they have to/should be there.

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Please no. That would instantly make kubernetes nodes dependent on a deployment (never mind the deployment tool) for them to be functional for (a subset using that resource) of pods. Race conditions would raise their ugly heads while deploying pods and their troubleshooting would be difficult as it would be dependent on a resource that is not deployed via the same tooling (helm/helmfile) that is used to deploy the actual pods that would make use of this resource. And this isn't limited to the GeoIP databases, it applies to any resource that would be used like that.

Abstracting this functionality from shipping files to a standalone internal service would be my preferred way to go here.

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Please no. That would instantly make kubernetes nodes dependent on a deployment (never mind the deployment tool) for them to be functional for (a subset using that resource) of pods.

Understood. Thanks for the explanation. I was just exploring ideas, but I can see now why it would be problematic.

Abstracting this functionality from shipping files to a standalone internal service would be my preferred way to go here.

Yes, agreed that would definitely be preferable.. Feel free to let me know if I can help in any way with this, when the time comes.