Tomorrow we're going to reinstall puppetmaster1001, puppet traffic is already pointed away from it. After the CA/private failover is completed (T189891: Failover puppet ca service from eqiad to codfw) these are the remaining steps:
Ported the dashboard to pdb 4, the dashboard could use some more work but Good Enough (tm) for now.
Uploaded + deployed
Confirmed working! Thanks @ayounsi !
Mon, Mar 19
https://gerrit.wikimedia.org/r/c/420342/ for the namevirtualhost deprecation
puppetmaster2001 was reimaged with stretch and traffic moved back as planned, notes from the process:
Fri, Mar 16
cc hardware-requests as per process
On Monday 19th I'll reinstall puppetmaster2001 with stretch, using the following procedure:
Thu, Mar 15
puppetmaster2002 was repooled today and is working as intended. puppetdb on nihal had a spike in commands processed while compilations were happening on puppetmaster2002 and "recovered" after about half an hour.
Wed, Mar 14
Including something like this might help as well LogLevel warn proxy:info proxy_http:info proxy_balancer:info
@Papaul racking plan looks good (i.e. one machine per row), thanks!
We're not whitelisting mbeans anymore, resolving.
Tue, Mar 13
Would be helpful if we had a cleanup policy for artifacts so operators don't have to manually chase and delete things to recover disk space.
We'd still need the oom settings to help debugging oom cases we've seen on nitrogen for example. Passing a directory instead of a file to -XX:HeapDumpPath will create dump files with pid and thus we can get rid of ExecStartPre setting too. Alternatively we can ship a systemd override file with the custom changes we need.
This should be resolved as all patches are merged and rhodium is running hiera 3 and compiling fine.
Not ideal to have secret data on the command line but AFAICS there isn't a way to use the environment in this case.
Mon, Mar 12
All hosts in this task and its subtasks are ready for decom (running as spare systems now)
Archiving works for me -- I didn't know that's our MO! Is that for everything on gerrit whether it is software we wrote or just imported into gerrit for convenience ?
Tue, Mar 6
@akosiaris suggested also edac-tools and that reminded me we're exporting edac metrics from node-exporter:
Mon, Mar 5
Outcome from today's monitoring meeting: needs more investigation wrt we can get the hardware errors status from e.g. ipmi or linux directly. Another option is also looking at mce logs, assuming the same type/quantity of errors are reported.
Fri, Mar 2
Sounds good to me, we'd also need to audit dashboards in case we're using it somewhere and replace with logstash metrics.
I mentioned this task and problem to a friend working in SRE networking, we're now receiving about one tenth of the icmp traffic inbound on lvses.
Thu, Mar 1
I've experienced the same with hiera 3 and a puppet master on stretch, likely related to "segmented keys" lookups
Wed, Feb 28
rhodium with puppetdb-terminus from puppetdb 2.3 works as expected, the only initialization I had to do was to update /srv/private with actual contents instead of waiting for a commit on private.git
Tue, Feb 27
Would be nice indeed, my preference would be for something around latency and/or (number of errors) / (number of successes + number of errors)
Mon, Feb 26
I mocked some configuration values and installed mariadb on puppetmaster-filippo-stretch2 to test servermon.rb reported, got a segfault (on the second run).
So role::puppetmaster::standalone with the patches proposed above works on stretch. For production AFAIK it isn't trivial to run ::frontend / ::backend in labs, the simplest is probably to add (or reimage) capacity with stretch and iterate as needed.
After manually running puppet master --debug --no-daemonize --masterport 8142 and then interrupting it, apparently now also phusion is able to correctly spawn a puppet master.
I am trying at least to get role::puppetmaster::standalone going on stretch, so far not a whole lot of luck, namely the server 500s when contacted by its agents:
Sat, Feb 24
Fri, Feb 23
I noticed while working on puppetmaster on stretch that we didn't have a git repo to host puppetdb source (packages), so I created operations/debs/puppetdb for this purpose, to be populated
I'll take a stab at this, first provisioning a stretch vm on wmcs and applying the relevant roles.
Thu, Feb 22
Wed, Feb 21
Today rsyslogd was "stuck" accepting new connections on lithium and wezen, at about the same time. This is a strace from check_ssl on einsteinium:
Tue, Feb 20
I bumped the minimum retention period to six months for all instances and no adverse effects observed so far, I'm tentatively resolving this task as the behaviour described hasn't reoccured in recent Prometheus versions.