du -h -d0 /srv/carbon/whisper/archived_metrics/
Thu, Jun 22
I've installed a cleanup cron on labmon1001. I'm going to give it a day and make sure that things are properly cleaned up, then this can be closed.
@Cmjohnson, @RobH, the cert for the existing puppetmaster is expiring on July 15th, so I'd like to move everything over to these new puppetmasters before that happens. Is it possible to bump these two boxes up in your queue so that they have network and OS by the end of this month?
We're close to setting up new puppetmaster hardware, as per T167905. So I'm going to let this slide in hopes of just moving everything to the new masters (with new certs and CAs) rather than trying to do a precarious in-place update.
The way to bypass this is
Wed, Jun 21
They forgot their nameservers for some reason. Fixed.
Apparently this is
Tue, Jun 20
Would that be acceptable?
Mon, Jun 19
Author: Paladox <firstname.lastname@example.org>
Date: Wed Jun 14 19:37:17 2017 +0000
Jun 19 23:41:04 einsteinium systemd: Started TCP socket to IRC bot: tcpircbot-logmsgbot. Jun 19 23:41:04 einsteinium python: Traceback (most recent call last): Jun 19 23:41:04 einsteinium python: File "tcpircbot.py", line 110, in <module> Jun 19 23:41:04 einsteinium python: bot._connect() Jun 19 23:41:04 einsteinium python: File "/usr/lib/python2.7/dist-packages/irc/bot.py", line 115, in _connect Jun 19 23:41:04 einsteinium python: **self.__connect_params) Jun 19 23:41:04 einsteinium python: File "tcpircbot.py", line 80, in connect Jun 19 23:41:04 einsteinium python: ircbot.SingleServerIRCBot.connect(self, *args, **kwargs) Jun 19 23:41:04 einsteinium python: File "/usr/lib/python2.7/dist-packages/irc/client.py", line 1191, in connect Jun 19 23:41:04 einsteinium python: self.connection.connect(*args, **kwargs) Jun 19 23:41:04 einsteinium python: File "/usr/lib/python2.7/dist-packages/irc/functools.py", line 35, in wrapper Jun 19 23:41:04 einsteinium python: return method(self, *args, **kwargs) Jun 19 23:41:04 einsteinium python: TypeError: connect() got an unexpected keyword argument 'ssl' Jun 19 23:41:04 einsteinium systemd: tcpircbot-logmsgbot.service: main process exited, code=exited, status=1/FAILURE Jun 19 23:41:04 einsteinium systemd: Unit tcpircbot-logmsgbot.service entered failed state.
For now, I am going to delete all metrics more than 2 years old:
I see this but can't figure out where it's coming from. We haven't had a box named virt1000 for ages... is the CA cert somehow still named that anyway?
Sun, Jun 18
I'm in the process of making a more comprehensive fix for this issue, but in the meantime I've tried a one-off fix... @mpopov, does it work now?
Fri, Jun 16
Thu, Jun 15
I am pretty sure that this is fine. I would like to be present and alert during the switchover, though, in case I'm forgetting about corner cases.
In general we're trying to make wikitech (and labtestwikitech) more like normal wikis... they're currently updated by the standard deployment train, and running normal up-to-date mediawiki releases.
Tue, Jun 13
right now I'm just checking periodically to see if there are new leaks.
ms-be03 is on labvirt1001. It's not the biggest CPU user on that host, but it /is/ the second biggest.
Unassigning as this is currently blocked on new Striker features.
The prime offender here is deployment-ms-be04.deployment-prep.eqiad.wmflabs, which is doing some kind of giant Swift operation. I don't know if this is on purpose or in error... hoping @fgiunchedi can chime in.
Mon, Jun 12
not sure if h/w raid is needed
I (finally) wrote a script to hunt and kill leaned dns records:
Tue, Jun 6
I don't have a strong opinion about this. Running on tools = eating our own dogfood, which is often useful... but having a more canonical k8s cluster to bang on seems also useful.
Mon, Jun 5
Subbu is still using Prometheum. We have half a plan to clean that up but in the meantime we'll need to keep some cruft around.
Leaked, and ssh fails. Ldap errors as below.
>>> banana Sorry! Could not fetch "banana" for you. No worries. There are lots of other pages to read. Pick a different title.
ok, let's leave things as they are for now -- I'll re-open if I can find a good example where it's needed.
Sat, Jun 3
@RobH, is there any reason we can't just register wikitech-static.wikimedia.org with markmonitor? Or does the fact that we control the rest of the wikimedia.org domain mean that -static would have to be in a different domain to outsource the DNS?
Looks good! Thanks!
Fri, Jun 2
All better now -- thank you!
Thu, Jun 1
Labtestvirt2003 is installed now, and properly attached to rabbitmq and the nova controller.
Great, thank you!
@dcausse nothing has changed on the actual wikitech, right? Because this is still pending the above patch? If so, I'm totally fine with merging the patch and seeing how things go. The updated test index seemed good to me.
wikitech-static-ord is now updating properly! There's some fancy automatic cert stuff on wikitech-static, so I'm hoping to refer the next steps to whoever set that up... @Dzahn was that you?
Wed, May 31
I moved one more away -- CPU usage is high now but not so high that I'm worried.
I moved two tools instances off of 1006. No obvious change in cpu metrics so far.