Page MenuHomePhabricator

Fix Maxmind geoip database archive
Closed, ResolvedPublic

Description

There is an issue on stat1007 for a systemd timer currently:

elukey@stat1007:~$ sudo systemctl -a | grep failed
● archive-maxmind-geoip-database.service                                                                         loaded    failed   failed    Archives Maxmind GeoIP files
elukey@stat1007:~$ sudo journalctl -u archive-maxmind-geoip-database.service
-- Logs begin at Mon 2020-09-28 10:00:42 UTC, end at Wed 2020-09-30 06:06:27 UTC. --
Sep 29 05:30:00 stat1007 systemd[1]: Started Archives Maxmind GeoIP files.
Sep 29 05:30:00 stat1007 kerberos-run-command[119440]: The user keytab that you are trying to use (/etc/security/keytabs/analytics/analytics.keytab) doesn't exist or ..
Sep 29 05:30:00 stat1007 systemd[1]: archive-maxmind-geoip-database.service: Main process exited, code=exited, status=1/FAILURE
Sep 29 05:30:00 stat1007 systemd[1]: archive-maxmind-geoip-database.service: Unit entered failed state.
Sep 29 05:30:00 stat1007 systemd[1]: archive-maxmind-geoip-database.service: Failed with result 'exit-code'.

A while ago we decided not to deploy the analytics user's keytab on stat100x hosts, but only to have it on analytics-admins-only nodes (like an-launcher1002). I didn't remove the files manually, and puppet didn't too, so the keytab left available until Tobias reimaged stat1007 to Buster.

There are some possibilities:

  1. We deploy again the analytics user keytab on all stat boxes. This could be handy for the team since we wouldn't need to remember that the analytics user needs to run only on an-launcher/coord nodes, and this timer would restart working.
  1. We run the timer as a different user, like analytics-privatedata, that is also present on the stat100x hosts.
  1. We do something more radical and refactor the script that the timer runs to avoid a huge backup on the host in which it runs (~80G) and only uploads snapshots of the MaxMind db on hdfs. Then we move the timer to a host that is meant to execute jobs, like an-launcher1002 (timers on stat100x hosts are only present on 1007 due to old use cases, in theory there shouldn't be any, reducing them would be nice to reduce tech debt).

@razzi @fdans is it something that you could work on during this week?

Event Timeline

Thanks for filing @elukey. I like option 3 the best, but if I remember correctly, the logic for the backup to be present on the host is to make it easier to use in case past snapshots need to be used to reconstruct history in some way. I'm not sure if removing that backup would be an obstacle for that purpose. What do @Nuria and @Milimetric think?

I think it will be fine to archive to hdfs alone. We use those files but sparingly so I do not think there is an issue with them being available just in hdfs.

@razzi, see archive.sh in pupet repo for context

Once we do change the backups let's update docs at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Geolocation

Mentioned in SAL (#wikimedia-analytics) [2020-10-01T05:58:50Z] <elukey> execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata /wmf/data/archive/geoip" - T264152

Change 631330 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] geoip::data::archive: run the systemd timer as analytics-privatedata

https://gerrit.wikimedia.org/r/631330

Mentioned in SAL (#wikimedia-analytics) [2020-10-01T06:04:22Z] <elukey> execyte "sudo chown -R analytics-privatedata:analytics-privatedata-users /srv/geoip/archive" on stat1007 - T264152

Change 631330 merged by Elukey:
[operations/puppet@production] geoip::data::archive: run the systemd timer as analytics-privatedata

https://gerrit.wikimedia.org/r/631330

As temporary solution I have implemented 2), namely moving the timer to the analytics-privatedata user (we have the keytab on the host). I have also moved the file ownership to analytics-privatedata-users rather than wikidev, seems more appropriate. The script currently takes a daily snapshot and it doesn't reconstruct old ones if missing, so to avoid loosing too many days/snapshots I proceeded with solution 2).

Long term I'd love to have solution 3) anyway, but now this task can be scheduled with less priority.

Thanks and agreed on solution 3) . I have assigned to @razzi and we can take this up as part of regular development

@razzi we can pair up on this if you have any questions or get stuck! It'll be a nice refresher for me too :)

Change 631896 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Archive Maxmind database files to hadoop only

https://gerrit.wikimedia.org/r/631896

Change 633032 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] geoip: move archive timer from stat1007 to an-launcher1002

https://gerrit.wikimedia.org/r/633032

Change 631896 merged by Razzi:
[operations/puppet@production] geoip: archive MaxMind database to hdfs only

https://gerrit.wikimedia.org/r/631896

Deployed the first part of this, to stop backing up maxmind files locally, putting them directly on hdfs.
Today is Tuesday, so the weekly backup already ran earlier today. Tomorrow we can manually run this change and see that the files backup as we expect.

Tested the backup script and confirmed files are still backed up to hdfs.

Change 633032 merged by Razzi:
[operations/puppet@production] geoip: move archive timer from stat1007 to an-launcher1002

https://gerrit.wikimedia.org/r/633032

I deployed this to an-launcher1002 but got a permission error:

razzi@an-launcher1002:~$ sudo journalctl -u archive-maxmind-geoip-database.service | cat
-- Logs begin at Wed 2020-10-21 17:14:05 UTC, end at Wed 2020-10-21 20:49:13 UTC. --
Oct 21 20:45:59 an-launcher1002 systemd[1]: Started Archives Maxmind GeoIP files.
Oct 21 20:45:59 an-launcher1002 kerberos-run-command[9664]: User analytics executes as user analytics the command ['/usr/local/bin/geoip_archive.sh', '/usr/share/GeoIP', '/wmf/data/archive/geoip']
Oct 21 20:45:59 an-launcher1002 kerberos-run-command[9664]: Copying /usr/share/GeoIP into HDFS at /wmf/data/archive/geoip
Oct 21 20:46:01 an-launcher1002 kerberos-run-command[9664]: mkdir: Permission denied: user=analytics, access=WRITE, inode="/wmf/data/archive/geoip":analytics-privatedata:analytics:drwxr-xr-x
Oct 21 20:46:01 an-launcher1002 systemd[1]: archive-maxmind-geoip-database.service: Main process exited, code=exited, status=1/FAILURE
Oct 21 20:46:01 an-launcher1002 systemd[1]: archive-maxmind-geoip-database.service: Failed with result 'exit-code'.

So I reverted it with https://gerrit.wikimedia.org/r/c/operations/puppet/+/635590 for now.

Hm, I think you can hdfs dfs -chmod 775 /wmf/data/archive/geoip and it should work.

Change 636517 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] geoip: cleanup having moved archiving to launcher

https://gerrit.wikimedia.org/r/636517

Change 636517 merged by Razzi:
[operations/puppet@production] geoip: cleanup having moved archiving to launcher

https://gerrit.wikimedia.org/r/636517