Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Andrew | T124680 Periodic internal labs dns outages | |||
Resolved | • jcrespo | T128737 Move labs pdns database off of m5-master | |||
Resolved | Andrew | T134501 Why is the IP for labs-recursor1 assigned to two hosts? |
Event Timeline
Change 269152 had a related patch set uploaded (by Andrew Bogott):
Turn pdns loglevels WAY UP
- clients are not failing over to the backup dns server (holmium) at all.
- There's a spike in pdns_concurrent queries at the time of the outage
- The cache (as per metric pdns_cache - entries) appears to empty, either before or during the outage. It looks to happen after we're a few minutes in.
"concurrent-queries shows the number of MThreads currently running" so that may be symptom rather than cause
Change 271797 had a related patch set uploaded (by Andrew Bogott):
Disable master/slave bits for labs pdns
The above is a partial fix but, as Mark notes, ' but then there's still a problem because two pdns auth instances are writing to the same dns db concurrently'
The current theory is that designate migrated from a "write directly to the pdns db" model to a "sync with mdns via AXFR" and due to largely nonexistent migration docs, we're stuck in the middle. I'm going to attempt to switch the backup dns server (on holmium) over to a pure axfr model and see how that plays.
From the openstack mailing list: "As I understand it, in Kilo and later mdns must be primary and send data to other backends via XFR."
From the openstack mailing list: "As I understand it, in Kilo and later mdns must be primary and send data to other backends via XFR."
And now a designate developer (Kiall) has corrected that: it uses xfr for records but db writes for domains. Both at once!
Kiall has now vetted our config, and I'm about to submit a patch that implements his changes.
He suggested that a possible cause of the lock-ups is one of the pdns servers locking the db and then blocking while waiting for an axfr update. The axfr master specified in the pdns database is a non-loopback IP, so in theory both pdns servers can listen just fine, but I need to make sure there aren't firewall issues.
Ahah! The firewall blocks traffic between pdns on holmium and mdns on labservices1001. That's not specifically what Kiall predicted but it fits his theory. Firewall patch incoming...
Change 272615 had a related patch set uploaded (by Andrew Bogott):
designate: Open firewall to axfr traffic from pdns hosts.
Change 272615 merged by Andrew Bogott:
designate: Open firewall to axfr traffic from pdns hosts.
Just chiming it to offer some help if moving pdns database around (which is really easy) is needed, and reminding to give some love to T126251, which may be semi-related?
This just happened again today, /after/ the above firewall patch was applied. So, that wasn't it! Kiall is hopeful that the 'Master=false' element of the above patch will make a significant difference, so I'm going to work on getting that patch merged and applied today.
Change 271797 merged by Andrew Bogott:
Updates to designate/mdns/pdns setup for Labs internal dns
Change 272769 had a related patch set uploaded (by Andrew Bogott):
Change master=no for labs pdns
Change 272771 had a related patch set uploaded (by Andrew Bogott):
Updates to designate/mdns/pdns setup for Labs internal dns
This just happened again. So I guess this is happening... more often, now that the config is fixed?
It would still be interesting to know what happens at 15:00Z that triggers these outages. The fixed time doesn't sound like the DNS server goes belly up after x hours running or y requests served. Grepping through operations/puppet for cron and hour *=>.*1[45] only gives:
modules/geowiki/manifests/job/limn.pp: # cron job to do the actual fetching from the database, computation of modules/geowiki/manifests/job/limn.pp- # the limn files, and pushing the limn files to the data repositories modules/geowiki/manifests/job/limn.pp: cron { 'geowiki-process-db-to-limn': modules/geowiki/manifests/job/limn.pp- minute => 0, modules/geowiki/manifests/job/limn.pp- hour => 15, modules/geowiki/manifests/job/limn.pp- user => $::geowiki::params::user, modules/geowiki/manifests/job/limn.pp: command => "${::geowiki::params::scripts_path}/scripts/make_and_push_limn_files.sh --cron-mode --basedir_public=${::geowiki::params::public_data_path} --basedir_private=${::geowiki::params::private_data_path} --source_sql_cnf=${::geowiki::mysql_conf::conf_file}", modules/geowiki/manifests/job/limn.pp- require => [ modules/geowiki/manifests/job/limn.pp- Git::Clone['geowiki-scripts'], modules/geowiki/manifests/job/limn.pp- Git::Clone['geowiki-data-public'], modules/geowiki/manifests/job/limn.pp- Git::Clone['geowiki-data-private'], modules/geowiki/manifests/job/limn.pp- File[$::geowiki::mysql_conf::conf_file], modules/geowiki/manifests/job/limn.pp- ], modules/geowiki/manifests/job/limn.pp- }
and:
modules/mediawiki/manifests/maintenance/pagetriage.pp: cron { 'pagetriage_cleanup_testwiki': modules/mediawiki/manifests/maintenance/pagetriage.pp- ensure => $ensure, modules/mediawiki/manifests/maintenance/pagetriage.pp- user => $::mediawiki::users::web, modules/mediawiki/manifests/maintenance/pagetriage.pp- minute => 55, modules/mediawiki/manifests/maintenance/pagetriage.pp- hour => 14, modules/mediawiki/manifests/maintenance/pagetriage.pp- monthday => '*/2', modules/mediawiki/manifests/maintenance/pagetriage.pp: command => '/usr/local/bin/mwscript extensions/PageTriage/cron/updatePageTriageQueue.php testwiki > /var/log/mediawiki/updatePageTriageQueue.test.log', modules/mediawiki/manifests/maintenance/pagetriage.pp- }
What I find a bit odd about P2643 (after the outage):
[…] Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='labs-puppetmaster-codfw.wikimedia.org.multimedia.eqiad.wmflabs' Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='labs-puppetmaster-codfw.wikimedia.org.multimedia.eqiad.wmflabs' Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='wikimedia.org.multimedia.eqiad.wmflabs' Feb 19 15:02:33 labservices1001 pdns[34420]: Query: select content,ttl,prio,type,domain_id,name from records where type='SOA' and name='wikimedia.org.multimedia.eqiad.wmflabs'
Is that a configuration error in some program/instance, or does that mean that the recursor could not resolve wikimedia.org at that time?
Change 272771 merged by Andrew Bogott:
Updates to designate/mdns/pdns setup for Labs internal dns
@scfc: I don't think it's related to the outage, but those queries are very strange! I see lots like that whenever I turn on logging; I'll try to see where they are coming from.
Change 273314 had a related patch set uploaded (by Andrew Bogott):
Lower max-mthreads for pdns recursor.
new datapoint: Both labs-ns2 (labservices1001) and labs-ns3 (holmium) caused alerts during the outage on the 24th.
To recap:
Designate writes domain information to the pdns mysql database directly. Records, on the other hand, are relayed to pdns via axfr. That means there are potentially three writers to the pdns datase: designate, ns2, and ns3. ns2 and ns2 also need to communicate with mdns, presumably on port 53 for axfr.
Kiall suggested that there might be something happening with locking when writing to the dbs that's causing both nameservers to block on writing to the db and stop doing anything else. I've addressed all of his specific concerns but this still seems like a possibility.
seems to have just happened again: similar symptoms and recovery with restarts but it did not coincide with the assumed time of day.
This outage pattern happened again today with a similar pattern and time of day.
2:47 UTC or so first reported and then it was flaky until 3:15 UTC or so. We restarted pdns in there at least once.
I've setup https://grafana.wikimedia.org/dashboard/db/labs-dns-dashboard to have some metrics about our DNS systems now - there are more detailed metrics in graphite (response type counts, timing metrics, etc) if we need them.
Change 285402 had a related patch set uploaded (by Rush):
dnsrecursor update module and template
Change 285412 had a related patch set uploaded (by Rush):
dnsrecursor keep stats on last 1000 queries
Change 285418 had a related patch set uploaded (by Rush):
labs dnsrecurser update settings
Change 285746 had a related patch set uploaded (by Rush):
icinga for labs Auth DNS convert to check_dig
Change 285755 had a related patch set uploaded (by Rush):
icinga labs auth dns check update description
A few notes.
These failures tend to happen in waves lasting 20-30 minutes, they resolve on their own, they are also not always total outages. We have periods of "flakiness" where some percentage of resolution requests are succeeding. This has meant a few of these are soft outages that were not noticed. Because of the spotty nature of the unavailability and the self resolution this seems to smell like a resource contention problem that lasts for as long as some clients abuse or a scheduled job.
https://doc.powerdns.com/3/authoritative/performance/
https://doc.powerdns.com/3/recursor/scripting/
labservices1001
labs-ns2.wikimedia.org
labs-ns0.wikimedia.org - 208.80.155.117
labs-recursor1.wikimedia.org - 208.80.155.118
holmium
labs-ns3.wikimedia.org
labs-ns1.wikimedia.org - 208.80.154.12
labs-recursor0.wikimedia.org - 208.80.154.20
Changeset previously used to up verbosity of logging https://gerrit.wikimedia.org/r/#/c/269152/2/modules/labs_dns/templates/pdns.conf.erb did cause disk usage issues as rotation wasn't often enough.
Considerations:
- The PDNS checks have been updated in https://gerrit.wikimedia.org/r/#/c/285746/ with reasoning outlined in T133791. I am hopeful this will surface errors in a saner way. Up to this point the check_dns being used had little ability to interpret the output from the PDNS server. This resulted in false positives and could be obscuring alert situations early in the failure process.
- The recursive server is now logging the last 1k queries to look for client spikes
rec_control top-remotes
- We have updated the resolver stanza in nginx for the tools proxy with multiple IP's. Previously it was only using one and so maintenance or failures to a single resolver were causing tools home page outages. Now we have converted the tools proxy to use IP addresses directly instead of hostnames. When logging behavior the dynamic proxies were accounting for 50-60% of all queries from labs so this is a substantial reduction in traffic. There are plans in the works to do the same for the general nova proxy.
- We are in the midst of moving from a single DB for both PDNS auth servers to independent DB's in T128737
Outage notifications I have logged:
2 2016-01-25 1 2016-02-08 2 2016-02-19 1 2016-02-24 1 2016-03-02 2 2016-04-11 4 2016-04-25
Breaks down as:
2016-01-25 - Monday 2016-02-08 - Monday 2016-02-19 - Friday 2016-02-24 - Wednesday 2016-03-02 - Wednesday 2016-04-11 - Monday 2016-04-25 - Monday
Times in CDT:
09:02 09:11 09:13 09:12 14:28 09:13 17:57 09:58 10:04 10:38 10:55 11:05 11:06
restricted.bastion.wmflabs.org
208.80.155.155
Things to confirm during an event:
- Can both authoritative servers resolve a Labs VM name when queried directly
dig tools-bastion-03.eqiad.wmflabs @208.80.155.117
dig tools-bastion-03.eqiad.wmflabs @208.80.154.12
- Especially if yes, can the recursive servers?
dig tools-bastion-03.eqiad.wmflabs @208.80.154.20
dig tools-bastion-03.eqiad.wmflabs @208.80.155.118
- What about baremetal servers for recursor? (should hit a lua script)
metal["promethium.eqiad.wmflabs."] = "10.68.16.2"
metal["promethium.wikitextexp.eqiad.wmflabs."] = "10.68.16.2"
- What about recursive to 3rd party domains?
google.com, yahoo.com?
- rec_control current-queries
- Does pdns_control rping succeed?
- pdns_control reload change behavior?
- pdns_control purge change behavior?
- What is the output of pdns_control ccounts over 3 tries at 10s intervals
- Is designate pool_manager actively trying to update the database?
- depending on whichever seems to be the focal point of the failure for eqiad.wmflabs domains strace -f -p for the PID to observe and log behavior.
Questions:
What is this also_notifies line for? What is covered here that isn't in the pool_nameserver definition for the secondary pdns?
Does Holmium still have ongoing issues connectivity wise?
On the active designate host I am seeing a lot of errors like:
2016-04-29 21:36:31.815 15976 WARNING designate.pool_manager.service [req-2739eca8-604b-4e32-9bad-7334b14a4359 - - - - -] For domain 68.10.in-addr.arpa. the error serial is 0.
I ran pool-manager in debug mode and collected some output that makes me think communication between designate and holmium has issues:
2016-04-29 20:02:44.856 15378 DEBUG designate.pool_manager.service [req-0699f747-d9fb-459d-8738-4218ce457706 - - - - -] Cache hit! Retrieved status ERROR and serial 1461960156 for domain 68.10.in-addr.arpa. on nameserver 208.80.154.12:53 with action UPDATE from the cache. _retrieve_statuses /usr/lib/python2.7/dist-packages/designate/pool_manager/service.py:666
2016-04-29 20:02:33.125 15378 INFO designate.central.rpcapi [req-e537a95f-0a61-4c03-bcc9-087094fb5826 - - - - -] update_status: Calling central's update_status for 114f1333-c2c1-44d3-beb4-ebed1a91742b : ERROR : 0
This is happening pretty constantly as pool-manager runs through it's periodic sync timer. Is holmium having issues understanding AXFR updates and responding?
holmium seems to have a lot of this:
Apr 29 21:36:31 holmium pdns[1498]: Received serial number updates for 0 zones, had 1 timeouts
Apr 29 21:36:38 holmium pdns[1498]: Received serial number updates for 0 zones, had 2 timeouts
Apr 29 21:36:51 holmium pdns[1498]: Received serial number updates for 0 zones, had 2 timeouts
Apr 29 21:37:03 holmium pdns[1498]: Received serial number updates for 0 zones, had 2 timeouts
I don't have time to look at it today still so I'm not sure if the fw rule fixes are still missing something or if some other consideration is causing issues with timeouts from holmium's perspective.
Change 286897 had a related patch set uploaded (by Andrew Bogott):
Increase the cache size for the Labs dns recursor
Change 286905 had a related patch set uploaded (by Andrew Bogott):
Labs DNS: Change the cache ttls back to defaults.
OK I caught the tail end of a brief storm this morning.
There is indeed a query storm that hits the pdns recursor associated with an increase in servfail answers to clients within labs.
I don't have a lot of evidence at the moment that non labs WMF domains are still succeeding during these outages. When I have been able to catch an occurance I have tried (google, yahoo, etc) with success even while foo.eqiad.wmflabs was failing. The recursor process itself seems to be serving clients sometimes at least and I haven't seen it fail when resolving a 3rd party domain yet (though it could be). My main suspicion has been that authoritative pdns is the source of the issue then.
This seems to associate powerdns recursor servfail increase with the powerdns authoritative servers timedout-packets increase.
i.e. the pdns auth server is dropping requests and as a result the recursor is forced to tell the labs clients it cannot resolve. pdns auth latency spikes with the increase as well. It seems like pdns auth is choking on an influx of queries basically.
Another view
It's interesting that servfail packets from the pdns auth server do not experience a similar spike. auth dns is not serving the storm from recursive pdns, but it is just dropping the requests on the floor essentially. No protocol response just timed out / dropped.
My first thought was that this was further sign of our possible DB woes but unless the internal metrics for pdns are faulty that seems not to be the case. rarely do we have a queue of requests waiting on the DB.
...interestingly even when we are failing to serve storms of requests when I would at least expect something? but max 2 for qsize-q?
So contention is seemingly somewhere between the request made by the recursor and the DB, and these "storms" are somewhat paltry (at least some of them) to be effecting pdns in this manner which makes me curious if we have a tuning problem.
Taking a look at https://doc.powerdns.com/3/authoritative/performance/ I have a few thoughts:
- we should up our receiver threads and in conjunction bump up our distributor-threads to take advantage. At the moment we are essentially single threaded it wouldn't shock me if it's resulting in dropping incoming traffic at load (in fact I suspect exactly this atm).
- our query-cache-ttl should match our record ttl's as configured (120)
- our packet-cache ttl is way too low and we could up this to get a better hit ratio. Right now at 10 it is around a 17% hit rate for labservices1001. I temporarily bumped holmium up to 60 and am seeing 75% hit rate with virtually no perf cost
- It is probably a reasonable idea to put some rate limit on queries from specific labs instances to avoid the flood cases. I had been looking at configuration like this:
iptables -A INPUT -p udp \ --dport 53 -d 208.80.154.20 \ -m hashlimit \ --hashlimit-name DNS-RECURSE \ --hashlimit-above 10/second \ --hashlimit-mode srcip \ --hashlimit-burst 30 \ --hashlimit-srcmask 32 \ --hashlimit-htable-max 2048 \ -j LOG --log-prefix "DNS-RECURSE-LOG-10-30: " \ --log-level 1
Reference:
https://doc.powerdns.com/3/recursor/settings/
https://doc.powerdns.com/3/authoritative/settings/#distributor-threads
cool doc on performance for popular dns servers
https://mice.cs.columbia.edu/getTechreport.php?techreportID=412&format=pdf&
note:
- I did poke about in the DB a bit after reading a few cases like this one
Change 287004 had a related patch set uploaded (by Rush):
labs pdns adjustments for perf
Change 286905 abandoned by Andrew Bogott:
Labs DNS: Change the cache ttls back to defaults.
Reason:
Dropped in favor of https://gerrit.wikimedia.org/r/#/c/287004/
@chasemp: Thanks. Does that mean that some Labs instance(s) are effectively causing the outages? Could you identify those with your debugging? While it is nice (and necessary) to armor the DNS servers so that they can cope with such situations better, if we could mitigate the indirect causes until then it would avoid the fallout that currently hits all project administrators in the form of dozens or hundreds of warning mails.
An important fact (which i knew, but forgot, and just now re-learned as part of fixing T134501):
labservices1001 is labs-ns0 but labs-recursor1
holmium is labs-ns1 but labs-recursor0
Getting the names properly switched over is a bit tricky, but should be easier once we get lvs in the mix. And in any case it's harmless but for making our troubleshooting confusing.
A few notes on 286897
max-cache-entries 1M to 3M
Currently on labservices1001:
cache-bytes 89051465
cache-entries 817477
cache-hits 713865
cache-misses 4724778
significant miss rate (although that could continue to be true with a hugely disproportionate amount of uniques) but we are hitting our entry limit now within a day or so at this miss rate.
hey @scfc you are right and we've been doing that for a few weeks and not documenting it well here.
When I started in on this the instances of dynamic proxy in Labs were far and away our biggest normal consumers. We have now converted the Labs dynamic proxy to using IP's directly, and have plans in the works to convert the normal nova proxy.
I have tracked down a few large consumers/clients who generated flood like behavior. The etcd project had some issues where crashed instances were sometimes flooding DNS looking for their downed counterparts. This was addressed yesterday by the project owner. That was the significant up-tick cause for at least some of the april 25th incident. I think the ease with which this has happened historically means there isn't one source of issue within labs, more like a lot of medium sized ones.
query numbers that led me to that conclusion from during that outage event:
1519 10.68.16.34 1538 10.68.21.239 1543 10.68.18.221 2705 10.68.23.89 3270 10.68.17.22 3710 10.68.21.49 5048 10.68.16.27 14481 10.68.16.130 15990 10.68.21.68 27750 10.68.16.132
It also came to my attention yesterday that our resolv.conf configuration changed during the migration from foo.wmflabs.org model to foo.project.wmflabs.org model. This caused an increase with searching tools.eqiad.wmflabs and eqiad.wmflabs for example.
I enabled remotes-ringbuffer-entries for the recursor last week to get a better ongoing idea but 1000 isn't a great long term or short term tracking value maybe.
Here are the top users atm (and in particular 10.68.16.40 is always in the top 2 or 3 and yesterday was doing 20-40% of all traffic for most of my afternoon):
labservices1001
8.40% 10.68.16.157 6.90% 10.68.19.195 6.70% 10.68.16.40 5.70% 10.68.23.85 4.60% 10.68.16.187
holmium
41.20% 10.68.21.68 24.70% 10.68.18.93 5.00% 10.68.17.26 3.60% 10.68.17.252 3.10% 10.68.16.179
There is a lot of spiking in activity for the recursor especially this afternoon. That is me poking our cache settings and resolver response under load. I !log'd it.