Page MenuHomePhabricator

Periodic internal labs dns outages
Closed, ResolvedPublic

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 269152 had a related patch set uploaded (by Andrew Bogott):
Turn pdns loglevels WAY UP

https://gerrit.wikimedia.org/r/269152

  • clients are not failing over to the backup dns server (holmium) at all.
  • There's a spike in pdns_concurrent queries at the time of the outage
  • The cache (as per metric pdns_cache - entries) appears to empty, either before or during the outage. It looks to happen after we're a few minutes in.

"concurrent-queries shows the number of MThreads currently running" so that may be symptom rather than cause

chasemp set Security to None.

Change 269152 merged by Andrew Bogott:
Turn pdns loglevels WAY UP

https://gerrit.wikimedia.org/r/269152

Happened again today ~ 15:00Z.

Change 271797 had a related patch set uploaded (by Andrew Bogott):
Disable master/slave bits for labs pdns

https://gerrit.wikimedia.org/r/271797

The above is a partial fix but, as Mark notes, ' but then there's still a problem because two pdns auth instances are writing to the same dns db concurrently'

The current theory is that designate migrated from a "write directly to the pdns db" model to a "sync with mdns via AXFR" and due to largely nonexistent migration docs, we're stuck in the middle. I'm going to attempt to switch the backup dns server (on holmium) over to a pure axfr model and see how that plays.

From the openstack mailing list: "As I understand it, in Kilo and later mdns must be primary and send data to other backends via XFR."

From the openstack mailing list: "As I understand it, in Kilo and later mdns must be primary and send data to other backends via XFR."

And now a designate developer (Kiall) has corrected that: it uses xfr for records but db writes for domains. Both at once!

Kiall has now vetted our config, and I'm about to submit a patch that implements his changes.

He suggested that a possible cause of the lock-ups is one of the pdns servers locking the db and then blocking while waiting for an axfr update. The axfr master specified in the pdns database is a non-loopback IP, so in theory both pdns servers can listen just fine, but I need to make sure there aren't firewall issues.

Ahah! The firewall blocks traffic between pdns on holmium and mdns on labservices1001. That's not specifically what Kiall predicted but it fits his theory. Firewall patch incoming...

Change 272615 had a related patch set uploaded (by Andrew Bogott):
designate: Open firewall to axfr traffic from pdns hosts.

https://gerrit.wikimedia.org/r/272615

Change 272615 merged by Andrew Bogott:
designate: Open firewall to axfr traffic from pdns hosts.

https://gerrit.wikimedia.org/r/272615

Just chiming it to offer some help if moving pdns database around (which is really easy) is needed, and reminding to give some love to T126251, which may be semi-related?

This just happened again today, /after/ the above firewall patch was applied. So, that wasn't it! Kiall is hopeful that the 'Master=false' element of the above patch will make a significant difference, so I'm going to work on getting that patch merged and applied today.

Change 271797 merged by Andrew Bogott:
Updates to designate/mdns/pdns setup for Labs internal dns

https://gerrit.wikimedia.org/r/271797

Change 272769 had a related patch set uploaded (by Andrew Bogott):
Change master=no for labs pdns

https://gerrit.wikimedia.org/r/272769

Change 272771 had a related patch set uploaded (by Andrew Bogott):
Updates to designate/mdns/pdns setup for Labs internal dns

https://gerrit.wikimedia.org/r/272771

Change 272769 merged by Andrew Bogott:
Change master=no for labs pdns

https://gerrit.wikimedia.org/r/272769

This just happened again. So I guess this is happening... more often, now that the config is fixed?

It would still be interesting to know what happens at 15:00Z that triggers these outages. The fixed time doesn't sound like the DNS server goes belly up after x hours running or y requests served. Grepping through operations/puppet for cron and hour *=>.*1[45] only gives:

modules/geowiki/manifests/job/limn.pp:    # cron job to do the actual fetching from the database, computation of
modules/geowiki/manifests/job/limn.pp-    # the limn files, and pushing the limn files to the data repositories
modules/geowiki/manifests/job/limn.pp:    cron { 'geowiki-process-db-to-limn':
modules/geowiki/manifests/job/limn.pp-        minute  => 0,
modules/geowiki/manifests/job/limn.pp-        hour    => 15,
modules/geowiki/manifests/job/limn.pp-        user    => $::geowiki::params::user,
modules/geowiki/manifests/job/limn.pp:        command => "${::geowiki::params::scripts_path}/scripts/make_and_push_limn_files.sh --cron-mode --basedir_public=${::geowiki::params::public_data_path} --basedir_private=${::geowiki::params::private_data_path} --source_sql_cnf=${::geowiki::mysql_conf::conf_file}",
modules/geowiki/manifests/job/limn.pp-        require => [
modules/geowiki/manifests/job/limn.pp-            Git::Clone['geowiki-scripts'],
modules/geowiki/manifests/job/limn.pp-            Git::Clone['geowiki-data-public'],
modules/geowiki/manifests/job/limn.pp-            Git::Clone['geowiki-data-private'],
modules/geowiki/manifests/job/limn.pp-            File[$::geowiki::mysql_conf::conf_file],
modules/geowiki/manifests/job/limn.pp-        ],
modules/geowiki/manifests/job/limn.pp-    }

and:

modules/mediawiki/manifests/maintenance/pagetriage.pp:    cron { 'pagetriage_cleanup_testwiki':
modules/mediawiki/manifests/maintenance/pagetriage.pp-        ensure   => $ensure,
modules/mediawiki/manifests/maintenance/pagetriage.pp-        user     => $::mediawiki::users::web,
modules/mediawiki/manifests/maintenance/pagetriage.pp-        minute   => 55,
modules/mediawiki/manifests/maintenance/pagetriage.pp-        hour     => 14,
modules/mediawiki/manifests/maintenance/pagetriage.pp-        monthday => '*/2',
modules/mediawiki/manifests/maintenance/pagetriage.pp:        command  => '/usr/local/bin/mwscript extensions/PageTriage/cron/updatePageTriageQueue.php testwiki > /var/log/mediawiki/updatePageTriageQueue.test.log',
modules/mediawiki/manifests/maintenance/pagetriage.pp-    }

What I find a bit odd about P2643 (after the outage):

[…]
Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='labs-puppetmaster-codfw.wikimedia.org.multimedia.eqiad.wmflabs'
​Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='labs-puppetmaster-codfw.wikimedia.org.multimedia.eqiad.wmflabs'
​Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='wikimedia.org.multimedia.eqiad.wmflabs'
​Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='wikimedia.org.multimedia.eqiad.wmflabs'

Is that a configuration error in some program/instance, or does that mean that the recursor could not resolve wikimedia.org at that time?

Change 272771 merged by Andrew Bogott:
Updates to designate/mdns/pdns setup for Labs internal dns

https://gerrit.wikimedia.org/r/272771

@scfc: I don't think it's related to the outage, but those queries are very strange! I see lots like that whenever I turn on logging; I'll try to see where they are coming from.

Change 273314 had a related patch set uploaded (by Andrew Bogott):
Lower max-mthreads for pdns recursor.

https://gerrit.wikimedia.org/r/273314

Change 273314 merged by Andrew Bogott:
Lower max-mthreads for pdns recursor.

https://gerrit.wikimedia.org/r/273314

new datapoint: Both labs-ns2 (labservices1001) and labs-ns3 (holmium) caused alerts during the outage on the 24th.

To recap:

Designate writes domain information to the pdns mysql database directly. Records, on the other hand, are relayed to pdns via axfr. That means there are potentially three writers to the pdns datase: designate, ns2, and ns3. ns2 and ns2 also need to communicate with mdns, presumably on port 53 for axfr.

Kiall suggested that there might be something happening with locking when writing to the dbs that's causing both nameservers to block on writing to the db and stop doing anything else. I've addressed all of his specific concerns but this still seems like a possibility.

seems to have just happened again: similar symptoms and recovery with restarts but it did not coincide with the assumed time of day.

This outage pattern happened again today with a similar pattern and time of day.

2:47 UTC or so first reported and then it was flaky until 3:15 UTC or so. We restarted pdns in there at least once.

I've setup https://grafana.wikimedia.org/dashboard/db/labs-dns-dashboard to have some metrics about our DNS systems now - there are more detailed metrics in graphite (response type counts, timing metrics, etc) if we need them.

Change 285402 had a related patch set uploaded (by Rush):
dnsrecursor update module and template

https://gerrit.wikimedia.org/r/285402

Change 285402 merged by Rush:
dnsrecursor update module and template

https://gerrit.wikimedia.org/r/285402

Change 285412 had a related patch set uploaded (by Rush):
dnsrecursor keep stats on last 1000 queries

https://gerrit.wikimedia.org/r/285412

Change 285412 merged by Rush:
dnsrecursor keep stats on last 1000 queries

https://gerrit.wikimedia.org/r/285412

Change 285418 had a related patch set uploaded (by Rush):
labs dnsrecurser update settings

https://gerrit.wikimedia.org/r/285418

Change 285418 merged by Rush:
labs dnsrecurser update settings

https://gerrit.wikimedia.org/r/285418

Change 285746 had a related patch set uploaded (by Rush):
icinga for labs Auth DNS convert to check_dig

https://gerrit.wikimedia.org/r/285746

Change 285746 merged by Rush:
icinga for labs Auth DNS convert to check_dig

https://gerrit.wikimedia.org/r/285746

Change 285755 had a related patch set uploaded (by Rush):
icinga labs auth dns check update description

https://gerrit.wikimedia.org/r/285755

Change 285755 merged by Rush:
icinga labs auth dns check update description

https://gerrit.wikimedia.org/r/285755

A few notes.

These failures tend to happen in waves lasting 20-30 minutes, they resolve on their own, they are also not always total outages. We have periods of "flakiness" where some percentage of resolution requests are succeeding. This has meant a few of these are soft outages that were not noticed. Because of the spotty nature of the unavailability and the self resolution this seems to smell like a resource contention problem that lasts for as long as some clients abuse or a scheduled job.

https://doc.powerdns.com/3/authoritative/performance/
https://doc.powerdns.com/3/recursor/scripting/

labservices1001
labs-ns2.wikimedia.org
labs-ns0.wikimedia.org - 208.80.155.117
labs-recursor1.wikimedia.org - 208.80.155.118

holmium
labs-ns3.wikimedia.org
labs-ns1.wikimedia.org - 208.80.154.12
labs-recursor0.wikimedia.org - 208.80.154.20

Changeset previously used to up verbosity of logging https://gerrit.wikimedia.org/r/#/c/269152/2/modules/labs_dns/templates/pdns.conf.erb did cause disk usage issues as rotation wasn't often enough.

Considerations:

  • The PDNS checks have been updated in https://gerrit.wikimedia.org/r/#/c/285746/ with reasoning outlined in T133791. I am hopeful this will surface errors in a saner way. Up to this point the check_dns being used had little ability to interpret the output from the PDNS server. This resulted in false positives and could be obscuring alert situations early in the failure process.
  • The recursive server is now logging the last 1k queries to look for client spikes

rec_control top-remotes

  • We have updated the resolver stanza in nginx for the tools proxy with multiple IP's. Previously it was only using one and so maintenance or failures to a single resolver were causing tools home page outages. Now we have converted the tools proxy to use IP addresses directly instead of hostnames. When logging behavior the dynamic proxies were accounting for 50-60% of all queries from labs so this is a substantial reduction in traffic. There are plans in the works to do the same for the general nova proxy.
  • We are in the midst of moving from a single DB for both PDNS auth servers to independent DB's in T128737

Outage notifications I have logged:

2 2016-01-25
1 2016-02-08
2 2016-02-19
1 2016-02-24
1 2016-03-02
2 2016-04-11
4 2016-04-25

Breaks down as:

2016-01-25 - Monday
2016-02-08 - Monday
2016-02-19 - Friday
2016-02-24 - Wednesday
2016-03-02 - Wednesday
2016-04-11 - Monday
2016-04-25 - Monday

Times in CDT:

09:02
09:11
09:13
09:12
14:28
09:13
17:57
09:58
10:04
10:38
10:55
11:05
11:06

restricted.bastion.wmflabs.org
208.80.155.155

Things to confirm during an event:

  • Can both authoritative servers resolve a Labs VM name when queried directly

dig tools-bastion-03.eqiad.wmflabs @208.80.155.117
dig tools-bastion-03.eqiad.wmflabs @208.80.154.12

  • Especially if yes, can the recursive servers?

dig tools-bastion-03.eqiad.wmflabs @208.80.154.20
dig tools-bastion-03.eqiad.wmflabs @208.80.155.118

  • What about baremetal servers for recursor? (should hit a lua script)

metal["promethium.eqiad.wmflabs."] = "10.68.16.2"
metal["promethium.wikitextexp.eqiad.wmflabs."] = "10.68.16.2"

  • What about recursive to 3rd party domains?

google.com, yahoo.com?

  • rec_control current-queries
  • Does pdns_control rping succeed?
  • pdns_control reload change behavior?
  • pdns_control purge change behavior?
  • What is the output of pdns_control ccounts over 3 tries at 10s intervals
  • Is designate pool_manager actively trying to update the database?
  • depending on whichever seems to be the focal point of the failure for eqiad.wmflabs domains strace -f -p for the PID to observe and log behavior.

Questions:

What is this also_notifies line for? What is covered here that isn't in the pool_nameserver definition for the secondary pdns?

Does Holmium still have ongoing issues connectivity wise?

On the active designate host I am seeing a lot of errors like:

2016-04-29 21:36:31.815 15976 WARNING designate.pool_manager.service [req-2739eca8-604b-4e32-9bad-7334b14a4359 - - - - -] For domain 68.10.in-addr.arpa. the error serial is 0.

I ran pool-manager in debug mode and collected some output that makes me think communication between designate and holmium has issues:

2016-04-29 20:02:44.856 15378 DEBUG designate.pool_manager.service [req-0699f747-d9fb-459d-8738-4218ce457706 - - - - -] Cache hit! Retrieved status ERROR and serial 1461960156 for domain 68.10.in-addr.arpa. on nameserver 208.80.154.12:53 with action UPDATE from the cache. _retrieve_statuses /usr/lib/python2.7/dist-packages/designate/pool_manager/service.py:666

2016-04-29 20:02:33.125 15378 INFO designate.central.rpcapi [req-e537a95f-0a61-4c03-bcc9-087094fb5826 - - - - -] update_status: Calling central's update_status for 114f1333-c2c1-44d3-beb4-ebed1a91742b : ERROR : 0

This is happening pretty constantly as pool-manager runs through it's periodic sync timer. Is holmium having issues understanding AXFR updates and responding?

holmium seems to have a lot of this:

Apr 29 21:36:31 holmium pdns[1498]: Received serial number updates for 0 zones, had 1 timeouts
Apr 29 21:36:38 holmium pdns[1498]: Received serial number updates for 0 zones, had 2 timeouts
Apr 29 21:36:51 holmium pdns[1498]: Received serial number updates for 0 zones, had 2 timeouts
Apr 29 21:37:03 holmium pdns[1498]: Received serial number updates for 0 zones, had 2 timeouts

I don't have time to look at it today still so I'm not sure if the fw rule fixes are still missing something or if some other consideration is causing issues with timeouts from holmium's perspective.

Change 286897 had a related patch set uploaded (by Andrew Bogott):
Increase the cache size for the Labs dns recursor

https://gerrit.wikimedia.org/r/286897

Change 286905 had a related patch set uploaded (by Andrew Bogott):
Labs DNS: Change the cache ttls back to defaults.

https://gerrit.wikimedia.org/r/286905

Happened again today ~ 15:00Zish.

OK I caught the tail end of a brief storm this morning.

There is indeed a query storm that hits the pdns recursor associated with an increase in servfail answers to clients within labs.

Screen Shot 2016-05-04 at 4.01.01 PM.png (523×1 px, 71 KB)

Screen Shot 2016-05-04 at 4.02.32 PM.png (524×1 px, 66 KB)

I don't have a lot of evidence at the moment that non labs WMF domains are still succeeding during these outages. When I have been able to catch an occurance I have tried (google, yahoo, etc) with success even while foo.eqiad.wmflabs was failing. The recursor process itself seems to be serving clients sometimes at least and I haven't seen it fail when resolving a 3rd party domain yet (though it could be). My main suspicion has been that authoritative pdns is the source of the issue then.

Screen Shot 2016-05-04 at 4.10.30 PM.png (521×1 px, 68 KB)

This seems to associate powerdns recursor servfail increase with the powerdns authoritative servers timedout-packets increase.

i.e. the pdns auth server is dropping requests and as a result the recursor is forced to tell the labs clients it cannot resolve. pdns auth latency spikes with the increase as well. It seems like pdns auth is choking on an influx of queries basically.

Another view

Screen Shot 2016-05-04 at 4.14.14 PM.png (516×999 px, 94 KB)

It's interesting that servfail packets from the pdns auth server do not experience a similar spike. auth dns is not serving the storm from recursive pdns, but it is just dropping the requests on the floor essentially. No protocol response just timed out / dropped.

My first thought was that this was further sign of our possible DB woes but unless the internal metrics for pdns are faulty that seems not to be the case. rarely do we have a queue of requests waiting on the DB.

Screen Shot 2016-05-04 at 4.16.55 PM.png (515×1 px, 53 KB)

...interestingly even when we are failing to serve storms of requests when I would at least expect something? but max 2 for qsize-q?

So contention is seemingly somewhere between the request made by the recursor and the DB, and these "storms" are somewhat paltry (at least some of them) to be effecting pdns in this manner which makes me curious if we have a tuning problem.

Taking a look at https://doc.powerdns.com/3/authoritative/performance/ I have a few thoughts:

  • we should up our receiver threads and in conjunction bump up our distributor-threads to take advantage. At the moment we are essentially single threaded it wouldn't shock me if it's resulting in dropping incoming traffic at load (in fact I suspect exactly this atm).
  • our query-cache-ttl should match our record ttl's as configured (120)
  • our packet-cache ttl is way too low and we could up this to get a better hit ratio. Right now at 10 it is around a 17% hit rate for labservices1001. I temporarily bumped holmium up to 60 and am seeing 75% hit rate with virtually no perf cost
  • It is probably a reasonable idea to put some rate limit on queries from specific labs instances to avoid the flood cases. I had been looking at configuration like this:
iptables -A INPUT -p udp \
 --dport 53 -d 208.80.154.20 \
 -m hashlimit \
 --hashlimit-name DNS-RECURSE \
 --hashlimit-above 10/second \
 --hashlimit-mode srcip \
 --hashlimit-burst 30 \
 --hashlimit-srcmask 32 \
 --hashlimit-htable-max 2048 \
 -j LOG --log-prefix "DNS-RECURSE-LOG-10-30: " \
 --log-level 1

Reference:

https://doc.powerdns.com/3/recursor/settings/
https://doc.powerdns.com/3/authoritative/settings/#distributor-threads

cool doc on performance for popular dns servers

https://mice.cs.columbia.edu/getTechreport.php?techreportID=412&format=pdf&

note:

  • I did poke about in the DB a bit after reading a few cases like this one

Change 287004 had a related patch set uploaded (by Rush):
labs pdns adjustments for perf

https://gerrit.wikimedia.org/r/287004

Change 286905 abandoned by Andrew Bogott:
Labs DNS: Change the cache ttls back to defaults.

Reason:
Dropped in favor of https://gerrit.wikimedia.org/r/#/c/287004/

https://gerrit.wikimedia.org/r/286905

Change 287004 merged by Rush:
labs pdns adjustments for perf

https://gerrit.wikimedia.org/r/287004

@chasemp: Thanks. Does that mean that some Labs instance(s) are effectively causing the outages? Could you identify those with your debugging? While it is nice (and necessary) to armor the DNS servers so that they can cope with such situations better, if we could mitigate the indirect causes until then it would avoid the fallout that currently hits all project administrators in the form of dozens or hundreds of warning mails.

An important fact (which i knew, but forgot, and just now re-learned as part of fixing T134501):

labservices1001 is labs-ns0 but labs-recursor1
holmium is labs-ns1 but labs-recursor0

Getting the names properly switched over is a bit tricky, but should be easier once we get lvs in the mix. And in any case it's harmless but for making our troubleshooting confusing.

A few notes on 286897

max-cache-entries 1M to 3M

Currently on labservices1001:

cache-bytes 89051465
cache-entries 817477
cache-hits 713865
cache-misses 4724778

significant miss rate (although that could continue to be true with a hugely disproportionate amount of uniques) but we are hitting our entry limit now within a day or so at this miss rate.

cache-entries on labservices1001

Screen Shot 2016-05-06 at 8.53.31 AM.png (521×1 px, 70 KB)

@chasemp: Thanks. Does that mean that some Labs instance(s) are effectively causing the outages? Could you identify those with your debugging? While it is nice (and necessary) to armor the DNS servers so that they can cope with such situations better, if we could mitigate the indirect causes until then it would avoid the fallout that currently hits all project administrators in the form of dozens or hundreds of warning mails.

hey @scfc you are right and we've been doing that for a few weeks and not documenting it well here.

When I started in on this the instances of dynamic proxy in Labs were far and away our biggest normal consumers. We have now converted the Labs dynamic proxy to using IP's directly, and have plans in the works to convert the normal nova proxy.

I have tracked down a few large consumers/clients who generated flood like behavior. The etcd project had some issues where crashed instances were sometimes flooding DNS looking for their downed counterparts. This was addressed yesterday by the project owner. That was the significant up-tick cause for at least some of the april 25th incident. I think the ease with which this has happened historically means there isn't one source of issue within labs, more like a lot of medium sized ones.

query numbers that led me to that conclusion from during that outage event:

 1519 10.68.16.34
 1538 10.68.21.239
 1543 10.68.18.221
 2705 10.68.23.89
 3270 10.68.17.22
 3710 10.68.21.49
 5048 10.68.16.27
14481 10.68.16.130
15990 10.68.21.68
27750 10.68.16.132

It also came to my attention yesterday that our resolv.conf configuration changed during the migration from foo.wmflabs.org model to foo.project.wmflabs.org model. This caused an increase with searching tools.eqiad.wmflabs and eqiad.wmflabs for example.

I enabled remotes-ringbuffer-entries for the recursor last week to get a better ongoing idea but 1000 isn't a great long term or short term tracking value maybe.

Here are the top users atm (and in particular 10.68.16.40 is always in the top 2 or 3 and yesterday was doing 20-40% of all traffic for most of my afternoon):

labservices1001

8.40%	10.68.16.157
6.90%	10.68.19.195
6.70%	10.68.16.40
5.70%	10.68.23.85
4.60%	10.68.16.187

holmium

41.20%	10.68.21.68
24.70%	10.68.18.93
5.00%	10.68.17.26
3.60%	10.68.17.252
3.10%	10.68.16.179

Change 286897 merged by Rush:
Increase the cache size for the Labs dns recursor

https://gerrit.wikimedia.org/r/286897

There is a lot of spiking in activity for the recursor especially this afternoon. That is me poking our cache settings and resolver response under load. I !log'd it.

This seems better!

Wonder if this was the older version of T224828