<_joe_> Wmflib::Service has been added today <_joe_> ok the original error was <_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Failed to parse template varnish/upload-common.inc.vcl.erb: <_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]: Filepath: /etc/puppet/modules/varnish/templates/upload-common.inc.vcl.erb <_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]: Line: 42 <_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]: Detail: key not found: "upload_domain" <_joe_> so Reedy I can't really wrap my head around why that wouldn't work <_joe_> ok, found out I think <_joe_> it's a bug in puppet 4.x <Reedy> nice <_joe_> yes confirmed <_joe_> so I guess you need to open a task about upgrading puppet version in deployment-prep
Description
Details
Related Objects
- Mentioned In
- T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled
T153163: Set up and use exported resources for Tool Labs's shared knowledge
T244472: Stream a subset of mediawiki apache logs to logstash
T244586: Restbase routing down on beta, 2020-02-07
T244074: Host key verification failed x 2
T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP
T243355: puppet panel: Can't add new prefixes - Mentioned Here
- T223971: Old cloudvirt (with Intel Xeon) are half the speed of newer ones (Intel Sky Lake)
T229441: CloudVPS: codfw1dev: missing bits
rCLIPec5f9c1645d2: Horizon auto commit for user krenair
Event Timeline
FWIW:
maurelio@deployment-deploy01:~$ sudo puppet --version 4.8.2
to which version do we need to upgrade?
am guessing this is just us needing to get a new puppetmaster with buster instead of stretch
>>! In T243226#5817998, @Krenair wrote:
am guessing this is just us needing to get a new puppetmaster with buster instead of stretch
I added the puppet-master version 5 packages to the puppet5 component sso you should be able to upgrade the puppet master with instructions i have added to wikitech. however the bigger problem is the puppetdb server. this will need to be rebuilt on buster as it is not a simple task to backport the puppetdb packages. the procedure i would advice is
- build new puppetdb server on buster
- update hiera config to include the following which will ensure that the new puppetdb server gets populated with the required metadata
profile::puppetmaster::common::puppetdb_hosts: - $fqdn_current_puppetdb - $fqdn_new_puppetdb profile::puppetmaster::common::command_broadcast: true
- puppet agent now needs to run on all clients so the new db can get a collection of exported resources, i would run it at least twice just incase there are some complex dependencies
- either run manually or wait an hour or two
- swap the order of profile::puppetmaster::common::puppetdb_hosts: so the new puppetdb is first
- validate everything is working
- remove the old puppetdb from profile::puppetmaster::common::puppetdb_hosts: and remove profile::puppetmaster::common::command_broadcast: true
I'm not an admin in the project but happy for your to add me if you need any help, its quite possible i have missed a step as i'm largely going from memory
Mentioned in SAL (#wikimedia-releng) [2020-01-21T22:10:51Z] <Krenair> deployment-prep added jbond as admin, T243226
- made deployment-puppetdb03, did the usual puppet client setup
- Turns out we use role::puppetmaster::standalone which it seems doesn't support command_broadcast
done anyway :)
Change 566380 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] role::puppetmaster::standalone: Add support for multiple PuppetDB hosts
Well, my new puppetdb instance seems to not be working very well yet:
root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)# puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: [503 Service Unavailable] PuppetDB is currently down. Try again later. Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)#
The change applied earlier on the puppetmaster was:
Notice: /Stage[main]/Puppetmaster::Puppetdb::Client/File[/etc/puppet/puppetdb.conf]/content: --- /etc/puppet/puppetdb.conf 2018-05-27 03:14:06.246340384 +0000 +++ /tmp/puppet-file20200121-15085-n4buga 2020-01-21 23:11:14.919611883 +0000 @@ -1,3 +1,4 @@ [main] -server_urls = https://deployment-puppetdb02.deployment-prep.eqiad.wmflabs:443 +server_urls = https://deployment-puppetdb02.deployment-prep.eqiad.wmflabs:443,https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs:443 +command_broadcast = true soft_write_failure = false
Am going to remove the new host from the puppetdb_hosts lists in hieradata on the puppetmaster and manually revert this change to get things working again.
rCLIPec5f9c1645d2eadc8db259755bf163c69e0409d6
Reloaded apache2 on deployment-puppetmaster03
Change 566500 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] role::puppetmaster::standalone: support multiple puppetdb servers
hi alex,
I noticed that the postgress databases was missing the uppetdb user, however a simple puppet run on the puppet master fixed the problem. i did restart the postgress server first as i had seen issues of it only binding to localhost in the past so this could have helped. Anyway after the puppet run puppetdb started correctly and applying a change similar to the one you highlighted above seems to work.
puppet run output
root@deployment-puppetdb03:~# sudo puppet agent -t Warning: Downgrading to PSON for future requests Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for deployment-puppetdb03.deployment-prep.eqiad.wmflabs Info: Applying configuration version '(2460e5f31d) root - role::puppetmaster::standalone: Add support for multiple PuppetDB hosts' Notice: The LDAP client stack for this host is: sssd/sudo Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo' Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[create_user-puppetdb@localhost]/returns: executed successfully Info: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[create_user-puppetdb@localhost]: Scheduling refresh of Exec[pass_set-puppetdb@localhost] Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[pass_set-puppetdb@localhost]/returns: executed successfully Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[pass_set-puppetdb@localhost]: Triggered 'refresh' from 1 event Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[prometheus@localhost]/Exec[create_user-prometheus@localhost]/returns: executed successfully Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::Db[puppetdb]/Exec[create_postgres_db_puppetdb]/returns: executed successfully Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Exec[create_tgrm_extension]/returns: executed successfully Notice: Applied catalog in 6.41 seconds
Change 566500 merged by Jbond:
[operations/puppet@production] role::puppetmaster::standalone: support multiple puppetdb servers
Hi alex,
I have pushed the change to allow multiple puppetdb's with command_broadcast and updated the node meta data in horiozon (i removed some redundant keys as well). You can check on how populated the new DB is with the following command:
- check how many nodes have checked once the output of this matches on both nodes it should be safe to make the switch
curl -X POST http:/localhost:8080/pdb/query/v4/nodes | jq .[].certname
Simple running of puppet on the puppetmaster/puppetdb hosts was one of the things that was broken... I bet the restarting of postgres was what actually fixed things.
alex@alex-laptop:~$ ssh deployment-puppetdb02 curl -s -X POST http://localhost:8080/pdb/query/v4/nodes | jq -r .[].certname | sort -d > puppetdb02 alex@alex-laptop:~$ ssh deployment-puppetdb03 curl -s -X POST http://localhost:8080/pdb/query/v4/nodes | jq -r .[].certname | sort -d > puppetdb03 alex@alex-laptop:~$ diff puppetdb0{2,3} 26d25 < deployment-eventgate-1.deployment-prep.eqiad.wmflabs 43,45d41 < deployment-logstash03.deployment-prep.eqiad.wmflabs < deployment-logstash2.deployment-prep.eqiad.wmflabs < deployment-maps04.deployment-prep.eqiad.wmflabs 48d43 < deployment-mediawiki-07.deployment-prep.eqiad.wmflabs 50d44 < deployment-mediawiki-jhuneidi.deployment-prep.eqiad.wmflabs 64d57 < deployment-pdfrender02.deployment-prep.eqiad.wmflabs 69d61 < deployment-puppetmaster03.deployment-prep.eqiad.wmflabs 75d66 < deployment-schema-1.deployment-prep.eqiad.wmflabs 78,79d68 < deployment-sessionstore01.deployment-prep.eqiad.wmflabs < deployment-sessionstore02.deployment-prep.eqiad.wmflabs 81d69 < deployment-snapshot01.deployment-prep.eqiad.wmflabs
deployment-eventgate-1.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-logstash03.deployment-prep.eqiad.wmflabs - exists but out of disk space so no puppet runs
deployment-logstash2.deployment-prep.eqiad.wmflabs - exists but out of disk space so no puppet runs (not planning to fix this instance anymore, I made 03 to replace it)
deployment-maps04.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-mediawiki-07.deployment-prep.eqiad.wmflabs - Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'effie'); The last Puppet run was at Mon Jan 20 10:56:33 UTC 2020 (3515 minutes ago). Puppet is disabled. effie @jijiki
deployment-mediawiki-jhuneidi.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-pdfrender02.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-puppetmaster03.deployment-prep.eqiad.wmflabs - Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'test multiple puppetdb backends - jbond');
deployment-schema-1.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-sessionstore01.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-sessionstore02.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-snapshot01.deployment-prep.eqiad.wmflabs - Resource type not found: Wmflib::Service at /etc/puppet/modules/wmflib/functions/service/fetch.pp:4:51
New diff after deactivating deceased things and enabling puppet on the puppetmaster:
alex@alex-laptop:~$ diff puppetdb0{2,3} 42,43d41 < deployment-logstash03.deployment-prep.eqiad.wmflabs < deployment-logstash2.deployment-prep.eqiad.wmflabs 46d43 < deployment-mediawiki-07.deployment-prep.eqiad.wmflabs 74d70 < deployment-snapshot01.deployment-prep.eqiad.wmflabs
Change 566380 abandoned by Alex Monk:
role::puppetmaster::standalone: Add support for multiple PuppetDB hosts
Reason:
Redone in I3a2cb763, for some reason.
logstash03 seems to be making good use of the 20G root volume to store... logs :(
root@deployment-logstash03:~# du -hsx /var/log/* | grep -E '^([0-9]*M|[0-9\.]*G)' 5.8G /var/log/daemon.log 230M /var/log/daemon.log.1 211M /var/log/elasticsearch 1.9G /var/log/kafka 2.2G /var/log/logstash 3.4G /var/log/syslog 1.7G /var/log/syslog.1 37M /var/log/syslog.4.gz 18M /var/log/syslog.5.gz
as for snapshot01 - that one is erroring about wmflib::service, is this something that'll get resolved by the puppet upgrade? should we proceed without that one?
as for snapshot01 - that one is erroring about wmflib::service, is this something that'll get resolved by the puppet upgrade? should we proceed without that one?
Yes i believe this is a bug in puppet 4.8 so this can be ignored
I'm not sure what did it but logstash2 has shown up. I made a little space on logstash03 and that's on the list now too. So that just leaves us with mediawiki-07 - need to find out why puppet is disabled and see if we can get it running again, if only temporarily.
@jijiki: Hi, deployment-mediawiki-07.deployment-prep.eqiad.wmflabs has puppet disabled since approx Mon Jan 20 10:56:33 UTC 2020 (12417 minutes ago) with the comment effie - assuming that's you, do you still need that? Would like to enable puppet so we can complete some puppet infrastructure changes in the project.
Swapped the list in hieradata for deployment-puppetmaster03 around. Once I'm happy it's working I'll remove the old puppetdb from the list and disable command_broadcast.
On the subject of deployment-mediawiki-07: I plan to re-enable puppet on Wednesday 5th unless we can find out why it was disabled.
@Krenair I have been experimenting with this host, feel free to enable puppet if needed, sorry for the inconvenience:)
No problem, just trying to avoid overwriting other's work. Am also aware
(including having done it myself a few times) sometimes people just forget
to re-enable it when done. When I do it I'll post a copy of the diff to
avoid losing anything
@Krenair thank you, I already have a backup of the configuration, so there is no need to let this disrupt you any further
@jbond: Okay so I think our puppetdb stuff is migrated now. I just removed the old server from the list, removed command broadcasting, and puppet still seems to be working. Next step is to upgrade our puppetmaster?
Just done the puppetmaster upgrade and found puppet no longer runs on the puppetmaster: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: undefined method `key_attributes' for nil:NilClass
Is it possibly our lack of updated puppetdb-termini package?
root@deployment-puppetmaster03:~# apt-cache policy puppetdb-termini puppetdb-termini: Installed: 4.4.0-1~wmf2 Candidate: 4.4.0-1~wmf2 Version table: *** 4.4.0-1~wmf2 1001 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/component/puppetdb4 amd64 Packages 100 /var/lib/dpkg/status 4.4.0-1~wmf1 1001 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
@Krenair i think the package needed is puppet-terminus-puppetdb which is provided by the puppetdb-6.2.0 source package. I have looked at building this but it seems to have a lot of dependencies based on clojure making it difficult to build for stretch. I think the simplest solution would be to build a new buster puppetmaster04 and migrate to that. I'm not to sure on that infrastructure so wouldn't want to make the changes myself however happy to help, feel free to ping me on irc (jbond42)
Last night I tried moving cache-text05 to use the new puppet master. It
doesn't seem to work just yet as for some reason it's (the puppetmaster?)
attempting to connect to puppet:8140 instead of our configured puppetdb
host. I think in labs that name always points to the central puppetmaster
which we don't use in deployment-prep.
I have preformed the following actions
- copied the CA from deployment-puppetmaster03 to deployment-puppetmaster04
- on deployment-puppetmaster04 renamed the deployment-puppetmaster03 to puppetca under /var/lib/puppet/server/ssl/
- regenerate the deployment-cache-text05 certificate.
With this set up the puppet masters can have there own client/agent certificate under /var/lib/puppet/ssl/ and use the ca certificates under /var/lib/puppet/server/ssl/certs/ for signing . The ca certificate will continue to have the issuer=CN = Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs however this is not a big problem as it will just be used for signing. This is similar to productions which still has issuer=CN = Puppet CA: palladium.eqiad.wmnet. With this configuration you should be able to just point hosts at the new puppet master i.e. add the following to /etc/puppet/puppet.conf and things should just work without issue and no need to delete and re-sign ssl certicates
[agent] server = deployment-puppetmaster04.deployment-prep.eqiad.wmflabs
I test a puppet run on deployment-cache-text05 however that is no failing with the following unrelated error
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find template 'varnish/misc-common.inc.vcl.erb' (file: /etc/puppet/modules/varnish/manifests/wikimedia_vcl.pp, line: 31, column: 28) (file: /etc/puppet/modules/varnish/manifests/instance.pp, line: 38) on node deployment-cache-text05.deployment-prep.eqiad.wmflabs
This seems to be caused because the deployment-puppetmaster's have a different checkout of the puppet repo which is missing a commit which removes varnish/misc-common.inc.vcl.erb
That is what I was hoping to avoid, but thanks for getting it working.
Nah it's got the commit, it's just another hieradata difference. I'm behind on the whack-a-mole.
Got puppet running on -cache-text05, whole beta cluster broke, fixed acme-chief and ATS, going to sleep.
puppetdb on deployment-puppetdb03 was killed by kernel OOM at Feb 7 09:50:29, per syslog. I just now ran systemctl start puppetdb on that host, to fix puppet issues in beta.
Not sure if the OOM is a known issue related to this task. As a follow-up, we might want to add a Restart=... line to the systemd unit file to make this self-healing next time this happens.
Thanks dpifke. I've seen it do that before too. We probably need to tune it a bit - I recall puppetdb hosts in particular have a hiera setting relating to memory usage, though this one should be similar in size to puppetdb02. They're m1.small so probably not going to get the same resources ops would give the prod puppetdb VMs.
On a separate note I have moved all the deployment-prep hosts I knew failing due to this error to the new puppetmaster, and will continue to move the rest.
I've moved the remaining instances over to using the new puppetmaster. Puppet does appear to be struggling on deployment-mwmaint01 and deployment-puppetmaster03 though - not sure why.
So before closing this task and removing puppetmaster03, someone should address:
- puppetdb03 memory usage
- puppetmaster04 disk usage
- puppet run time on mwmaint01/puppetmaster03
Turning off debug logging on puppetmaster04 (I set logdest = /dev/null in /etc/puppet/puppet.conf) has helped with the disk usage, and sluggishness issues. But sadly puppet runs are currently failing with:
deployment-puppetmaster04:~# puppet agent -t Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for Resolv::DNS::Resource::IN::A (file: /etc/puppet/manifests/realm.pp, line: 64, column: 9) on node deployment-puppetmaster04.deployment-prep.eqiad.wmflabs Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
That last issue (the resolution failure) was a side-effect of work I was doing for T229441. That issue is resolved, but now the failure us
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template ssh/known_hosts.erb: Filepath: /var/lib/git/operations/puppet/modules/puppetdbquery/lib/puppetdb/connection.rb Line: 68 Detail: PuppetDB query error: [500] Server Error, query: ["and",["=","type","Sshkey"],["~","title",".*"],["=","exported",true]] (file: /etc/puppet/modules/ssh/manifests/client.pp, line: 8, column: 24) on node deployment-puppetmaster04.deployment-prep.eqiad.wmflabs
No idea if that's related to my work or not.
So this error was the memory usage problem on puppetdb03 I mentioned above - puppetdb won't work without postgresql, which can't start because it wants more memory than the system has. Changed role::puppetmaster::puppetdb::shared_buffers from 768MB to 600MB.
After doing this we found everything was very slow when the puppetmaster was running on cloudvirt1013 but ok when moved to cloudvirt1017. Either the host was under particular load (it shouldn't've been, there wasn't a lot of other things running there) or this is another case of T223971
Thanks to Andrew it seems to be running well now. I've copied across /var/lib/puppet/volatile to sort a lot of swift/GeoIP failures.
Also copied /etc/conftool-state/mediawiki.yaml to sort out mediawiki::state for mwmaint01
I've also taken /root and /home and put them at deployment-puppetmaster04:/root/deployment-puppetmaster03-{homes,root}.tar.gz in case anyone needs anything
Mentioned in SAL (#wikimedia-releng) [2020-02-15T20:25:06Z] <Krenair> T243226 Shut off deployment-puppetdb02 and deployment-puppetmaster03 - will leave for a week before deletion