Page MenuHomePhabricator
Paste P5289

(An Untitled Masterwork)
ActivePublic

Authored by chasemp on Apr 19 2017, 3:49 PM.
Tags
None
Referenced Files
F7639472:
Apr 19 2017, 3:49 PM
Subscribers
None
08:28 yuvipanda: [16:10] bd808: ldap.yaml, ldap.conf and novaobserver.yaml now present in all containers
08:28 yuvipanda: [16:10] \o/
08:28 yuvipanda: [16:20] lol, kube-proxy is failing and I've no idea why
08:28 yuvipanda: [16:21] however, it isn't failing because of anything I did
08:28 yuvipanda: [16:21] it's been failing for a while
08:28 yuvipanda: [16:21] madhuvishy: chasemp ^
08:28 chasemp: [16:21] yuvipanda: failing where?
08:28 yuvipanda: [16:21] chasemp: on tools-proxy-01
08:28 chasemp: [16:21] is that the active one?
08:28 yuvipanda: [16:22] yup
08:28 yuvipanda: [16:22] this causes breakage for tools that did a webserive restart since it was broken
08:28 yuvipanda: [16:22] which seems to be... reasonator
08:28 chasemp: [16:22] can this be right?
08:28 chasemp: [16:22] root 13594 1 0 21:18 ? 00:00:00 /usr/bin/kube-proxy --master=127.0.0.1:8080 --kubeconfig=/etc/kubernetes/kubeconfig --proxy-mode='iptables' --masquerade-all=true
08:28 chasemp: [16:22] master=127.0.0.1?
08:28 yuvipanda: [16:22] looking at proxy-02
08:28 yuvipanda: [16:22] chasemp: nope, it isn't right.
08:28 chasemp: [16:23] so packae upgrade or some upgrade caushed it to lose the correct params
08:28 chasemp: [16:23] yuvipanda: were params set in the service unit files?
08:28 chasemp: [16:23] that we replaced?
08:28 yuvipanda: [16:23] chasemp: they're actually in kubeconfig file
08:28 yuvipanda: [16:23] cat /etc/default/kube-proxy
08:28 yuvipanda: [16:23] DAEMON_ARGS="--kubeconfig=/etc/kubernetes/kubeconfig --proxy-mode='iptables' --masquerade-all=true"
08:28 chasemp: [16:23] master should = server: https://k8s-master.tools.wmflabs.org:6443?
08:28 chasemp: [16:24] I mean, seems pretty suspect with new packages?
08:28 chasemp: [16:24] on both
08:28 chasemp: [16:24] usr/bin/kube-proxy --master=127.0.0.1:8080
08:28 chasemp: [16:25] persists through a restart on -02
08:28 yuvipanda: [16:25] chasemp: nah, it's not the packages - it's probably the puppet patch I made that took away the unit files from puppet and put them in the package
08:28 yuvipanda: [16:26] chasemp: hand hacking that out worked. I'll make a patch
08:28 chasemp: [16:26] ok
08:28 chasemp: [16:26] yeah I wondered, I was asking that with 'yuvipanda: were params set in the service unit files?'
08:28 chasemp: [16:26] thanks
08:28 yuvipanda: [16:27] chasemp: np. I am just going to remove that 127 line, since it's a shitty default
08:28 yuvipanda: [16:28] chasemp: thanks for the quick catch
08:28 chasemp: [16:28] yuvipanda: yup
08:28 chasemp: [16:28] yuvipanda: so this will get set by puppet from now on?
08:28 yuvipanda: [16:28] chasemp: nope, I'm just fixing the deb and rolling out a new deb
08:28 chasemp: [16:29] ah
08:28 chasemp: [16:29] fixing the deb to point to?
08:28 yuvipanda: [16:29] chasemp: since that default will never actually be true in 99% of setups
08:28 chasemp: [16:29] I guess I'm wondering if the deb will then be tools specific
08:28 yuvipanda: [16:29] (I mentioned this in the original packaging patch but forgot to see if it was fixed)
08:28 yuvipanda: [16:29] chasemp: nope, it's just reading from /etc/kubernetes/kubeconfig
08:28 yuvipanda: [16:29] chasemp: but the 127.0.0.1 overrode it
08:28 chasemp: [16:29] right, ok I understand now
08:28 yuvipanda: [16:29] chasemp: and /etc/kubernetes/kubeconfig path is set in /etc/defaults/kubernetes, which is set in puppet
08:28 chasemp: [16:30] mods
08:28 chasemp: [16:30] heh nods
08:28 yuvipanda: [16:30] I kicked off a build
08:28 chasemp: [16:30] yuvipanda: post new package let's spin up a new tool and restart a few etc and do a bit of poking?
08:28 yuvipanda: [16:30] chasemp: at least I found it myself instead of leaving a ticking time bomb in my wake :)
08:28 chasemp: [16:30] yes
08:28 yuvipanda: [16:30] chasemp: yeah, that's how I found this one too (I was restarting reasonator)
08:28 yuvipanda: [16:31] I usutally check that doing a webservice shell works & a restart works
08:28 yuvipanda: [16:35] and I've successfully zero'd out the disk of my old laptop
08:28 andrewbogott: [16:42] yuvipanda: if you haven't crossed the horizon yet… can you tell me about instance_info_dumper.pp ?
08:28 andrewbogott: [16:42] Is its output consumed by anything?
08:28 yuvipanda: [16:42] andrewbogott: nope, you can destroy it
08:28 yuvipanda: [16:42] well, it's useful in that the output can be used to find which instances have which roles
08:28 yuvipanda: [16:42] which I don't think we've anything else for
08:28 andrewbogott: [16:42] Ah, it includes roles? That does seem useful.
08:28 yuvipanda: [16:42] so ideally that code should be moved into watroles
08:28 andrewbogott: [16:42] ok, I'll nurse it along for now.
08:28 andrewbogott: [16:42] Thanks
08:28 yuvipanda: [16:43] andrewbogott: the current code isn't actually outputting the JSON file to anywhere useful
08:28 andrewbogott: [16:43] Yeah, I noticed :)
08:28 yuvipanda: [16:43] so it's one of those 'if we have a roles to instances mapping in a forest and nobody knows of it, do we really have a roles to instances mapping?' things
08:28 andrewbogott: [16:43] But now I know where it is, so it will be useful to me!
08:28 yuvipanda: [16:44] andrewbogott: :D
08:28 andrewbogott: [16:50] is out for now.
08:28 andrewbogott: [16:51] Catch you later, Yuvi!
08:28 yuvipanda: [16:51] andrewbogott: bye!
08:28 chasemp: [16:56] yuvipanda: what's the lead tiem on new packages?
08:28 chasemp: [17:04] madhuvishy: are you about?
08:28 chasemp: [17:11] I'm stepping away for a bit for some personal business, yuvipanda, madhuvishy is going to walk through things with you and hang out for a bit to verify new packages resolve
08:28 yuvipanda: [17:13] chasemp: have a good day and happy anniversary :)
08:28 yuvipanda: [17:13] chasemp: cool, got it.
08:28 yuvipanda: [17:14] madhuvishy: the debs are almost done, maybe 2 more mins?
08:28 madhuvishy: [17:15] yuvipanda: yup okay
08:28 yuvipanda: [17:22] madhuvishy: ok, build completed. am scping it to the aptly host now (tools-services-01)
08:28 Reedy: [17:30] yuvipanda: btw, no one would stop you hanging around in #mediawiki_security
08:28 Reedy: [17:30] But if you left just for some seperation, that's fine :)
08:28 yuvipanda: [17:30] Reedy: :D Yeah, am just leaving for some more separation for a bit :)
08:28 bd808: [17:31] we will have to stalk him on slack ;)
08:28 yuvipanda: [17:31] hehe
08:28 yuvipanda: [17:32] bd808: Reedy although, tbh, leaving _security was the first 'oh shit this really is happening' moment :(
08:28 Reedy: [17:33] On and up
08:28 bd808: [17:33] new things are fun. you'll be too busy to miss us
08:28 yuvipanda: [17:36] bd808: I merged your novaobserver.yaml in containers patch btw
08:28 bd808: [17:37] you wrote it :) but thanks
08:28 bd808: [17:37] it will let me clean up some stuff in openstack-browser
08:28 bd808: [17:37] and maybe somebody else wiil figure out a cool thing to build with it
08:28 bd808: [17:37] will need to document how to use it
08:28 yuvipanda: [17:43] bd808: yeah!
08:28 yuvipanda: [17:43] I'm gonna leave this channel too :(
08:28 yuvipanda: [17:44] bd808: Krenair valhallasw madhuvishy chasemp andrewbogott it was great working with y'all :) I'll come back in May, so don't entirely forget me :) And do still page me for PAWS and Quarry. Thanks!
08:28 yuvipanda: [17:44] <3
08:28 bd808: [17:44] see you soon yuvipanda
08:28 yuvipanda: [17:44] :) good luck everyone!
08:28 ***: Playback Complete.
08:29 Mode: +ns
08:29 Created at: Feb 24, 2016, 8:31 AM
08:36 andrewbogott: chasemp: so, nothing actually wrong that you can see?
08:37 chasemp: andrewbogott: sge_qmaster on tools-grid-master kept spiking at 100% cpu and my guess is single threaded?
08:37 chasemp: I'm not sure why
08:37 chasemp: I can't remember ever seeing it high usage
08:38 andrewbogott: huh
08:38 andrewbogott: If someone has a script that's creating a million jobs, maybe?
08:38 chasemp: my first thought too, but I haven't dug anything up yet
08:38 chasemp: sucks
08:39 chasemp: andrewbogott: sorry for txting, I got nervous when gridmaster started throwing dns failure
08:39 andrewbogott: no worries, I was awake
08:41 andrewbogott: I'm looking at sge_qmaster in 'top' but of course I don't know what normal behavior looks like
08:41 chasemp: andrewbogott: nslcd and nscd on tools-grid-master were reallly pegged there for awhile
08:41 chasemp: I restarted both and on 1420 too
08:42 chasemp: andrewbogott: that box is single core tools-grid-master
08:42 chasemp: so when it's pegging at 100% it's not kidding
08:42 chasemp: idk why...
08:43 chasemp: any ideas?
08:44 andrewbogott: I've never seen them be busy. I guess if there was a brief dns outage they might have been retrying frantically
08:44 andrewbogott: I'll look on the dns server and see if there's anything in the log
08:51 andrewbogott: …I don't see anything interesting
08:51 chasemp: it's not constant but it's pretty often still cpu is hiking up
08:54 chasemp: andrewbogott: I removed any nfs throttling from just teh master
08:54 chasemp: I'm tempted to restart the master service
08:54 chasemp: I don't see or dont' recognize anyone doing anything nuts?
08:54 chasemp: but it's very busy
08:54 andrewbogott: restarting it seems ok
08:55 chasemp: I wonder if there isn't something about less nodes and more loaded on each that causes more work rather than more nodes that are under resourced...
08:55 andrewbogott: We don't know for sure that this isn't normal, do we?
08:56 chasemp: andrewbogott: well definite iowait spike to associate
08:56 chasemp: https://graphite-labs.wikimedia.org/render/?width=718&height=506&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=tools.tools-grid-master.cpu.total.user&target=tools.tools-grid-master.cpu.total.iowait&from=-3d
08:56 chasemp: that's kinda nuts
08:56 andrewbogott: wow
08:57 chasemp: tools-exec-1420
08:57 chasemp: https://graphite-labs.wikimedia.org/render/?width=718&height=506&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=tools.tools-grid-master.cpu.total.user&target=tools.tools-exec-1420.cpu.total.iowait&from=-1d
08:57 chasemp: over 180d
08:58 chasemp: https://graphite-labs.wikimedia.org/render/?width=718&height=506&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=tools.tools-grid-master.cpu.total.user&target=tools.tools-exec-14*.cpu.total.iowait&from=-180d
08:59 andrewbogott: That could easily be just one greedy tool
08:59 andrewbogott: But a greedy tool shouldn't be able to ruin things on the grid master
09:00 chasemp: https://graphite-labs.wikimedia.org/render/?width=718&height=506&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=tools.tools-grid-master.cpu.total.user&target=tools.tools-exec-14*.cpu.total.iowait&from=-180d
09:00 chasemp: that's over all trusty execs over 180d each line being a different node
09:00 chasemp: early jan it definitely trends up consistently
09:01 andrewbogott: yeah, what changed in January? That's before the precise nodes were killed.
09:01 chasemp: but yeah I don't have an explanation from iowait spike on master
09:01 chasemp: other than something causing cpu spike causing issues that could include iowait...
09:01 andrewbogott: Did dramatic things happen with the labstores in Jan?
09:01 chasemp: no and it is really fine
09:02 chasemp: andrewbogott: didn't we already reduce precise nodes by that time?
09:02 andrewbogott: checks the log
09:03 andrewbogott: five nodes deleted on 2017-01-03
09:04 andrewbogott: that doesn't quite fit with the graph
09:04 chasemp: I mean that or a kernel update causing io issues or something idk
09:04 chasemp: its definitely not clear
09:05 chasemp: I would like to start running longer tests in labtest for kernel updates and some regular predictive workloads on vms there
09:05 chasemp: taht's an aside
09:05 chasemp: I'm not sure what to do for this atm
09:05 chasemp: andrewbogott: shoudl we build out some trusty nodes?
09:05 chasemp: (I have a lunch scheduled at 11 fyi)
09:05 andrewbogott: Yeah, that's the only thing I can think of to try.
09:05 andrewbogott: (I have lunch coming up too)
09:06 andrewbogott: But I'll try to get some new nodes together over the weekend and we'll see if the iowaits settle down.
09:06 chasemp: andrewbogott: I'm thinking of merging a slight bump is allowed read/write
09:06 chasemp: I also disabled puppet and removed the cap on teh master node
09:06 chasemp: just to see
09:06 andrewbogott: You think just -exec nodes? Or web as well?
09:06 chasemp: idk
09:06 andrewbogott: Yeah, increasing read/write seems good.
09:06 chasemp: andrewbogott: maybe you could thinkg about how we can bump up cpu on tools-grid-master?
09:07 andrewbogott: We could build a new, larger master. As I recall, though, failing over the master is super ugly
09:07 chasemp: I mean hitting cap on a single core cpu isnt exactly shocking even tho we don't normally do it historically
09:07 andrewbogott: And actually it's totally calm now, since your restarts.
09:08 chasemp: andrewbogott: I was thinking more shut down existing, make a backup, bump up flavor https://docs.openstack.org/user-guide/cli-change-the-size-of-your-server.html ?
09:08 andrewbogott: hm, nope, I spoke too soon, there it goes :)
09:08 andrewbogott: I've never had an instance survive a resize like that.
09:09 chasemp: yeah
09:09 andrewbogott: the whole feature is nonsense as far as I know
09:10 andrewbogott: man, there are really a LOT of exec nodes already! 24.
09:10 chasemp: yeah, aer we short on webgrid or soemthign?
09:10 chasemp: that seems like parity to me or I can't recall
09:11 chasemp: something just whacky happend to the master around that time
09:11 chasemp: https://graphite-labs.wikimedia.org/render/?width=718&height=506&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=tools.tools-grid-master.cpu.total.user&target=tools.tools-grid-master.cpu.total.iowait&from=-2h
09:11 chasemp: andrewbogott: reminisent of the freezes that happened for awhile
09:11 andrewbogott: the iowaits are on exec nodes, that I see
09:12 andrewbogott: So, I'll make some more anyway :)
09:12 chasemp: which I do honesty think was a client<=>server kernel mismatch
09:14 chasemp: https://graphite-labs.wikimedia.org/render/?width=1400&height=1000&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=cactiStyle(tools.tools-exec-14*.cpu.total.iowait)&from=-180d&hideLegend=false
09:14 chasemp: def after 1/03 or 1/09 things start to trend up
09:15 chasemp: which is not true andrewbogott of tools-worker* nodes
09:15 chasemp: https://graphite-labs.wikimedia.org/render/?width=1400&height=1000&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=cactiStyle(tools.tools-worker*.cpu.total.iowait)&from=-180d&hideLegend=false
09:15 chasemp: I dont' think
09:16 andrewbogott: hm
09:17 andrewbogott: ok, I have five new exec nodes in the works. It takes hours to build these guys.
09:17 chasemp: tools webgrid
09:17 chasemp: https://graphite-labs.wikimedia.org/render/?width=1400&height=1000&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=cactiStyle(tools.tools-webgrid*.cpu.total.iowait)&from=-180d&hideLegend=false
09:18 andrewbogott: wow. So it def is the exec nodes
09:19 chasemp: well that chart is misleading in some ways due to those spikes
09:20 andrewbogott: Oh, the scale is messed up?
09:20 chasemp: 30d
09:20 chasemp: https://graphite-labs.wikimedia.org/render/?width=1400&height=1000&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=cactiStyle(tools.tools-webgrid*.cpu.total.iowait)&from=-30d&hideLegend=false
09:20 chasemp: vs
09:20 chasemp: https://graphite-labs.wikimedia.org/render/?width=1400&height=1000&_salt=1491054971.814&target=tools.tools-grid-master.cpu.total.system&target=cactiStyle(tools.tools-exec-14*.cpu.total.iowait)&from=-30d&hideLegend=false
09:20 chasemp: so still shows up
09:21 chasemp: andrewbogott: so what we are thinking definitely exists, I don't know if that's /the/ problem but
09:21 chasemp: there is more to less exec's than meets the eye
09:22 andrewbogott: seems like
09:23 chasemp: andrewbogott: https://gerrit.wikimedia.org/r/#/c/345975/
09:23 andrewbogott: that's a very modest increase :)
09:24 chasemp: 20% increase in write andrewbogott
09:24 chasemp: across all
09:27 chasemp: andrewbogott: so you're going to add 5 execs as you can, and I'm upping the threshold across, leaving the thresholed off only on the master
09:27 andrewbogott: yep
09:27 chasemp: it seems pretty clear iowait on execs since early jan is up
09:27 chasemp: so that is a reasonable place to try to adjust, thanks for making those
09:28 chasemp: I'm scared of bumping up write too much as one tool could hammer hard
09:28 chasemp: we'll see
09:28 chasemp: madhuvishy: whenever you wakeup check out SAL for tools we are trying to sort out an issue that seems like io/iowait with small adjustments while adnrew builds a few execs
09:28 chasemp: to spread the load
09:32 chasemp: andrewbogott: it's rolling out now fyi
09:35 chasemp: andrewbogott: it looks like https://phabricator.wikimedia.org/P5182 when it rolls out
09:35 chasemp: doing puppet on a 5 node fanout from clush now for execs
09:35 chasemp: I have to step away andrewbogott for a second, I do have a brunch with my sisters home on spring break in a bit I'll bring my laptop
09:36 chasemp: going to watch this rollout
09:36 andrewbogott: ok
09:42 andrewbogott: I'm stepping away for a bit — going to take forever for this initial puppet run to finish on the new nodes
09:45 chasemp: ok
09:47 madhuvishy: hello
09:48 chasemp: mornin
09:48 madhuvishy: chasemp: tools.iabot is running like 62 jobs across execs
09:48 madhuvishy: they are all in running
09:48 chasemp: madhuvishy: ah I didn't catch that
09:48 madhuvishy: not wait or anything
09:48 chasemp: what the heck is that?
09:49 madhuvishy: internet archiver bot
09:49 chasemp: maybe it's periodic too
09:49 chasemp: because it seems to come and go
09:49 madhuvishy: something involving cyberpower
09:49 chasemp: can you attempt to ping him in -labs and ask him to throttle that back?
09:54 madhuvishy: https://www.irccloud.com/pastebin/chcJs2Bk/
09:54 madhuvishy: chasemp: ^
09:55 chasemp: nods
09:55 chasemp: * * * * *
09:55 chasemp: what the crap
09:56 madhuvishy: chasemp: there's 20 jsubs, but they aren't finishing before the next job starts
09:56 chasemp: right
09:56 chasemp: ok
09:56 madhuvishy: so there's like 3 workerX jobs
09:56 madhuvishy: 3*20
09:56 chasemp: madhuvishy: shouldn't -once handle that?
09:57 madhuvishy: chasemp: hmmm i don't think jsub prevents you from starting 2 jobs with same name
09:57 madhuvishy: jstart does i think
09:57 chasemp: right I was thinking of jstart
09:57 chasemp: madhuvishy: I'm trying to keep an 11 lunch, do youhave a minute to make a task for this assign to cyberbot and let's just stop these since it's clearly in violation? or?
09:57 chasemp: overrunnign teh grid w/ overlapping jobs every...minute
09:58 chasemp: is cleary a big problem
09:58 chasemp: why or why is this not capped at soemthing sane (concurrent jobs per tool)
09:58 madhuvishy: yeah!
09:58 chasemp: madhuvishy: for context we are hurting io wise /already/ so this is just icing on the cake
09:58 madhuvishy: without any decent intervals on the crons
09:59 chasemp: yeah
10:00 madhuvishy: chasemp: ya i'll make a task and then comment out the crons
10:00 madhuvishy: may be kill the existing jobs
10:00 chasemp: madhuvishy: I would
10:00 chasemp: or even can
10:00 chasemp: how about I do that and you make the task :)
10:00 madhuvishy: chasemp: okay :)
10:03 chasemp: madhuvishy: should I leave 3 workers going?
10:04 madhuvishy: chasemp: sure why not
10:04 madhuvishy: will save from some wrath :)
10:04 chasemp: fyi
10:04 chasemp: ####### Commented out by a Tool admin -- this is overwhelming the grid 2017-4-1
10:04 chasemp: #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker4 -o $HOME/Workers/Worker4.ou
10:04 madhuvishy: will save us*
10:04 chasemp: 3257207 0.30049 lighttpd-i tools.iabot r 04/01/2017 10:06:57 webgrid-lighttpd@tools-webgrid 1
10:04 chasemp: 3267758 0.30000 worker3 tools.iabot r 04/01/2017 15:04:07 task@tools-exec-1420.tools.eqi 1
10:04 chasemp: 3267759 0.30000 worker2 tools.iabot r 04/01/2017 15:04:07 task@tools-exec-1421.tools.eqi 1
10:04 chasemp: 3267760 0.30000 worker1 tools.iabot r 04/01/2017 15:04:07 task@tools-exec-1422.tools.eqi 1
10:05 chasemp: let's keep an eye on those to see if they start overlapping and hanging on
10:06 madhuvishy: chasemp: yeah okay
10:07 chasemp: next offenders are
10:07 chasemp: 5 tools.jimmy
10:07 chasemp: 6 tools.betaco
10:07 chasemp: 6 tools.eranbo
10:07 chasemp: 7 tools.phetoo
10:07 chasemp: 9 tools.yifeib
10:07 chasemp: 11 tools.anomie
10:07 chasemp: 11 tools.avicbo
10:08 chasemp: most of that seems ok to me atm
10:08 madhuvishy: yeah
10:09 chasemp: when you get that made w/ all the contex can you make a note related to https://phabricator.wikimedia.org/T161950
10:09 chasemp: thanks madhuvishy
10:09 chasemp: and again, good morning!
10:10 madhuvishy: T161951
10:10 stashbot: T161951: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951
10:10 madhuvishy: chasemp: np! morning :)
10:30 andrewbogott: The new exec nodes are up and running now. Sounds like y'all found something else to blame though :)
10:30 Disconnected