During the T154758 incident, I noticed that lvs2006 was marking all of the backends down, not just the row A ones. While investigating, I noticed that the first nameserver in resolv.conf is acamar, which is in row A and is thus unreachable by LVSes (which sit on all of the subnets). I guessed this may be causing issues, so I commented that out, and things started to recover for Pybal shortly after that.
This didn't cause any actual problems this time but the situation would be different if this was A2 instead (and so lvs2003) and would result into multiple services failing at codfw.
So, our resolv.conf has timeout:1 attempts:3 but it seems that even that isn't enough for Pybal. I'd say that this is something that we should probably fix in Pybal (being more resilient to DNS delays) but perhaps we can do something to alleviate this in the infrastructure as well. (Before anyone says "anycast", note that we have anycasted dns-rec-lb but that goes via LVS, so LVS servers are the only ones not using it because of the catch-22).