Page MenuHomePhabricator

Upgrade prod DNS daemons to gdnsd 2.2.0
Closed, ResolvedPublic

Description

We're currently running 2.1.0 and no obvious problems there in practice, but 2.1.1 contains a fix for sendmmsg() error handling that we should be running, and 2.2.0 incorporates that plus libmaxminddb (geoip2) support...

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack subscribed.
faidon subscribed.

I need to do at least the packaging part for Debian anyway (and the libmaxminddb part has been done already, needs an upload). Hopefully RSN :)

Update: The target here is still 2.2.0 on jessie for all NS boxes, current status is:
rubidium: 2.1.0 on precise
baham: 2.1.0 on trusty
eeden: 2.1.2 on jessie

Update, on the Debian front:

  • gdnsd 2.1.2-1 is in Debian unstable/testing. 2.1.2-1~deb8u1 is in stable-proposed-updates and will be part of Debian 8.1 in a few days. It's also, temporarily, in jessie-wikimedia.
  • libmaxminddb was uploaded to Debian and is in NEW. I'll backport for our -wikimedia suites next.
  • I've done some work towards gdnsd 2.2.0 Debian packages — still incomplete. It will need libmaxminddb to pass through NEW to reach Debian, so it's very likely we'll have it first at jessie-wikimedia.

In the meantime, it will simplify things greatly to have the same distribution across all NSes, both for debugging purposes and for building packages once. On that:

  • eeden has already been renamed and reformatted with jessie and runs 2.1.2, a couple of weeks ago.
  • baham was just reformatted with jessie and, consequently runs 2.1.2. It's serving traffic normally.
  • rubidium will follow next. Since baham/rubidium are the fallback for each other, I'll give it another day to be extra safe.

Finally, there is also the matter of the Jenkins authdns-lint jobs — we should run tests on jessie, not Ubuntu, both to avoid building 2.2 for Ubuntu and more importantly, to have a testing environment close to production. We'll have to poke @hashar.

Finally, there is also the matter of the Jenkins authdns-lint jobs — we should run tests on jessie, not Ubuntu, both to avoid building 2.2 for Ubuntu and more importantly, to have a testing environment close to production. We'll have to poke @hashar.

The operations-dns-lint job runs on Jenkins slaves in prod (gallium and lanthanum) and is one of the last job still running there. I tried earlier to migrate it to labs instance (T98737) but it fails because there is no GeoLiteCityv6.dat on labs. We can probably figure out a solution.

Also the Jessie labs slave has a bunch of puppet issues reported on T94836. That makes me uncomfortable in having jobs running on it cause we do not actively maintain it actually. Some tasks have been resolved already, there are more pending and to be discovered though :-\

The operations-dns-lint job runs on Jenkins slaves in prod (gallium and lanthanum) and is one of the last job still running there. I tried earlier to migrate it to labs instance (T98737) but it fails because there is no GeoLiteCityv6.dat on labs. We can probably figure out a solution.

Thanks — responded there.

Also the Jessie labs slave has a bunch of puppet issues reported on T94836. That makes me uncomfortable in having jobs running on it cause we do not actively maintain it actually. Some tasks have been resolved already, there are more pending and to be discovered though :-\

Well, time to start actively maintaining it then :) We probably have more jessie hosts than precise nowadays and testing our DNS config in a distribution that is 5 years older than production is too risky.

Well, time to start actively maintaining it then :) We probably have more jessie hosts than precise nowadays and testing our DNS config in a distribution that is 5 years older than production is too risky.

I created a single Jessie slave to report on package/puppet/upstart errors. Tracking is T94836. It is not a priority though :-\

I definitely agree the linting should be done on the same system as prod. Lets figure out a solution on T98737.

rubidium was just replaced by radon — radon runs jessie now. IOW, all 3 NSes run jessie/2.1.2.

Change 217467 had a related patch set uploaded (by Hashar):
contint: authdns::lint on light Jessie slave

https://gerrit.wikimedia.org/r/217467

Change 217467 merged by Faidon Liambotis:
contint: authdns::lint on light Jessie slave

https://gerrit.wikimedia.org/r/217467

gdnsd 2.2.0 packages were prepared and landed in Debian unstable. libmaxminddb & gdnsd 2.2.0 backports are now in jessie-wikimedia.

integration-lightslave-jessie-1002.eqiad.wmflabs has been upgraded and earlier this week baham was upgraded as well, with no ill effects so far. I plan on upgrading the other two next week (because Friday) and then proceed with the GeoIP->GeoIP2 configuration switch.

This is now done :)