Page MenuHomePhabricator

Puppetize job that saves old versions of Maxmind geoIP database
Closed, ResolvedPublic13 Estimated Story Points

Description

Puppetize job that saves old versions of geoIP database

This now is a cron whose email is sent to all the ops registered to the root@ mails, so it might be good to puppetize this cron and have it send e-mail to analytics-alerts

On Sun, May 29, 2016 at 7:30 AM, Cron Daemon <root@stat1002.eqiad.wmnet> wrote:
[master 920b423] Adding database files from 2016-05-29
5 files changed, 0 insertions(+), 0 deletions(-)
rewrite GeoIP/GeoIP.dat (81%)
rewrite GeoIP/GeoIP2-Country.mmdb (72%)

Event Timeline

Milimetric triaged this task as Medium priority.Jun 6 2016, 4:38 PM
Milimetric moved this task from Incoming to Dashiki on the Analytics board.

@Milimetric : should we do this or rather kill cron and task? We have not had the need to do this thus far (in two years).

In my opinion geowiki is a very fragile process and I like having this backup. Killing the cron would not make me feel safe. Fixing geowiki to something more reliable and monitored would be great, though. So if we could do that then I'd be happy to give up on this cron and task.

But wait, is this cron related to geowiki only? This is the GeoIp database backup. cc @Milimetric

@Nuria no, it's a general-purpose backup, but the only place we would actually need it with our current setup is if geowiki processes fail for a while and nobody notices (which is very likely since nobody's really watching it).

Nuria renamed this task from Puppetize job that saves old versions of geoIP database to Puppetize job that saves old versions of Maxmind geoIP database.Apr 4 2018, 4:15 PM

FYI that /home/milimetric/GeoIP-toolbox/MaxMind-database/GeoIP is a git repository with all the backups of the GeoIP dbs that we kept over the years

Change 425247 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] [wip] Puppetize cron job archiving old MaxMind databases

https://gerrit.wikimedia.org/r/425247

As discussed post standup, we're planning on incorporating this task to the MaxMind module in puppet. This would mean:

  • Archiving isn't done by adding commits to a local git repository, but by appending a date to the file names or moving the files to a dated directory.
  • When files change, a new directory with the current date is generated , like puppet:///volatile/GeoIP/archive/20180417. This is then propagated to the /usr/share/GeoIP/ directory of the machines.

The latest data snapshot is 310Mb.
The first snapshot we have recorded is from 2014 and is 98Mb.

fdans@stat1005:~/geoip/MaxMind-database$ du .git/ --max-depth=1 -h
76K	.git/logs
4.0K	.git/branches
40K	.git/hooks
16K	.git/refs
17G	.git/objects
12K	.git/info
17G	.git/

The total size of the archive as it is right now is 17Gb. @Ottomata @elukey is it sustainable to sync 17Gb+ weekly between volatile and /usr/share/GeoIP?

The total size of the archive as it is right now is 17Gb. @Ottomata @elukey is it sustainable to sync 17Gb+ weekly between volatile and /usr/share/GeoIP?

So my understanding of the proposed plan is that we'd ditch git in favor of archiving files with dates every week, possibly re-creating from git the past "history" of the weekly snapshot (or maybe not, not sure if this is a requirement). As far as I can see every week rsync would need to move only what changed, so the last snapshot plus the new files created (say something like 2 * 310Mb) that is totally feasible. Am I missing something?

Great! 17G is a little big for puppetmasters as is now, but we can ask ops if we can expand the partition, or add another one. We'll talk about this with them today.

One nit: can we make the directory name ISO-8601 style? E.g. 2018-04-17.

@elukey yeah that's right. The piece that I'm missing is (and it's probably a trivial thing) how do we put the past archive in /usr/shared so that it's accessible by any machine?

@Ottomata hell yea about format dates

@fdans, puppet will do that.

The puppetmasters are the only ones that include geoip::data::maxmind. Every other puppet client node uses geoip::data::puppet, which syncs the data from the puppetmaster volatile location. We just need to add a $recurse = false parameter to geoip::data::puppet that we can override in places that want to sync the entire GeoIP directory.

I don't feel strongly about this, but I'm a bit skeptical about keeping this in puppet/volatile, given these are fairly out of scope for Puppet (it wouldn't really ever use this data AIUO). It'd be easy to forget, breakages wouldn't be immediately obvious etc.

Given the purpose of this archive is the ability to re-run past analysis in data stored in the Analytics infrastructure, wouldn't it make more sense to just archive these in a directory in HDFS instead?

We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.

Recursively managing a set of large files with puppet is going to be slow. In particular subsequent runs after the files have been populated will take a long time to complete.

What if instead of file with recurse => true we set up rsync::get to sync these archives out?

Ok, now thinking about options for the rsync source/server side... I tend to agree that volatile isn't very well suited for archive storage such as this. If HDFS is out perhaps we could store the archives on analytics1001 and expose them to clients with an rsync::server config there?

I would strongly suggest that any system that wants to archive geoip data from maxmind should create its own repository of data and NOT use puppet for it in any ways.

For instance, you could have a cronjob on some machine that triggers reading the geoip database and stores it on hadoop (this is just an example). Things not to do:

  • fill puppet/volatile with data we don't use if not for archival purposes. Volatile is rsynced between puppetmasters but is maybe the least durable/reliable form of storage of data we have.
  • having any other software than puppet accessing volatile, besides rsync specifically for keeping the puppetmasters in sync with each other.
  • Manage any large directory, or worse an ever-growing list of files, via puppet and recurse => true

We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.

Gerrit has "private" repository support, not sure how good/advisable it is though.

We could do that, but we wanted something centralized and reproducable (e.g. include a puppet class, get the historical dbs). We would have just put this as is in gerrit and auto-committed to it, but we can't host it anywhere publicly, since we pay for these files.

What would be the use case? To elaborate, the reason I proposed HDFS was that the only usefulness I can see for the old databases is in combination with data (logs) that can only be found in HDFS, so the capability to access HDFS (either with the native client, or fuse) seems like a given here. Maybe I'm misunderstanding how this would fit in the picture though :)

I don't have much context of how geowiki runs, but storing this in HDFS would be fine. We (I?) just thought it would be better to use some non analytics based way of doing this, and since geoip already comes from Puppet, we just thought of expanding that there.

But, it sounds like there is strong enough opposition to not doing it in Puppet soooo, @fdans, back to another way! Thanks Faidon and Guiseppe :)

@fdans, let's puppetize a cron that we can install on an analytics client node, that will create the /usr/share/GeoIP/archive timestamped directories as you were already working on, and then uploads them to a location in HDFS. We won't have the ability to re-sync on any node by just including a puppet class, but HDFS should be a good enough archival storage for these. We'll also add a bacula backup instance for /usr/share/GeoIP/archive.

Got it, yeah uploading to HDFS seems pretty sensible. The only documented application for this archive is history reconstruction, so it makes sense to just have it there.

Change 425247 abandoned by Fdans:
Puppetize cron job archiving old MaxMind databases

Reason:
Abandoning in favor of a different solution agreed in phab task

https://gerrit.wikimedia.org/r/425247

OK, so to determine the periodicity of the cron job, I ran a city query over ~17,000 IP addresses with:

  • The most current GeoIP data
  • Data from a week ago
  • Data from a month ago
  • Data from three months ago

https://docs.google.com/spreadsheets/d/16wuKS1N6vQ4hqS8EY7fXWQTgJAuIPaTuYV9KTA_df9I/edit?usp=sharing

The consistency of the data from current is:

Percentage equal the week before 96.56%
Percentage equal the month before 86.81%
Percentage equal three months before 79.64%

As far as periodicity goes, note that MaxMind states that GeoIP2 Country and City are updated every Tuesday and the rest every 1-4 weeks, so a weekly cronjob every Wednesday sounds like it would do the trick.

thanks @faidon, we were just seeing if maybe the accuracy of the old databases is really high, we can schedule the jobs less often, but yeah, wow, data changes a lot even in a week. So I vote for keeping it weekly as well (that's the frequency we use now)

Change 430067 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] Small improvements to the geoip archive script

https://gerrit.wikimedia.org/r/430067

mforns set the point value for this task to 13.May 7 2018, 3:38 PM

Change 430067 merged by Ottomata:
[operations/puppet@production] Small improvements to the geoip archive script

https://gerrit.wikimedia.org/r/430067