Page MenuHomePhabricator

Krenair (Alex Monk)
Wikimedia volunteer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 2:34 PM (281 w, 2 d)
Availability
Available
IRC Nick
Krenair
LDAP User
Alex Monk
MediaWiki User
Krenair [ Global Accounts ]

I am a Wikimedia volunteer helping in various technical ways. These days it's usually Beta Cluster, Cloud VPS, or Operations related labs puppet migrations. Since 2012 I've spent significant amounts of time involved in MediaWiki development, software deployments to the Wikimedia cluster, OTRS (email response to e.g. info-en@wikimedia.org addresses), and various other things.

Some of my old VisualEditor and other work (2014-2016) can be found under @AlexMonk-WMF instead.

I have opinions on things, which do not necessarily represent those of any organisation I am, have previously been, or will in the future be affiliated with.

Recent Activity

Today

Krenair added a comment to T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses.

for the traffic cloud instances I bypassed this issue with this ugly hack:

authdns_servers:
  208.80.154.11: 208.80.154.11
  208.80.154.135: 208.80.154.135
Sun, Feb 23, 8:27 PM · Patch-For-Review, IPv6, cloud-services-team, Acme-chief
Krenair added a comment to T245606: CloudVPS: enable BGP in the neutron transport network.

I've been reading the linked proposal and noticed this:
"the internal flat network CIDR. This is 172.16.0.0/21 in eqiad1 and 172.16.128.0/24 in codfw1dev."
I see hieradata/codfw/profile/openstack/codfw1dev/neutron.yaml has that as a /24 but I see several other references in puppet to it being /21?

Sun, Feb 23, 4:46 PM · netops, Operations, cloud-services-team (Kanban)
Krenair added a comment to T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses.

T176891: DNS resolution chosing IPv6 addrs on hosts with only link-local IPv6 addresses may be related.

Sun, Feb 23, 4:35 PM · Patch-For-Review, IPv6, cloud-services-team, Acme-chief
Krenair added a comment to T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses.

I'd be interested to know why this has not been a problem before by the way - those cloud-ns hosts have had AAAA records since creation AFAIK

Sun, Feb 23, 2:33 AM · Patch-For-Review, IPv6, cloud-services-team, Acme-chief
Krenair renamed T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses from tools-acme-chief-01 is attempting to validate DNS challenge against authdns IPv6 addresses to tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses.
Sun, Feb 23, 2:31 AM · Patch-For-Review, IPv6, cloud-services-team, Acme-chief
Krenair created T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses.
Sun, Feb 23, 2:30 AM · Patch-For-Review, IPv6, cloud-services-team, Acme-chief

Yesterday

Krenair added a comment to T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.

Also cleaned up some other random things I found laying around - [...] and the old broken root ssh userkeys directory thing also on -elastic-0[34].

Sat, Feb 22, 11:57 PM · cloud-services-team (Kanban), Toolforge
Krenair closed T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) as Resolved.

done

Sat, Feb 22, 10:47 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T245494: CloudVPS: figure out DNS zone ownership transfers and setup.

codfw1dev.wmcloud.org. is still broken as I can't seem to create records under it - now I get HTTP 500s :(
I've also noticed codfw1dev.wikimedia.cloud. is lacking NS/SOA records like this one - presumably another record ownership problem?
I have successfully created my puppetmaster.cloudinfra-codfw1dev.codfw1dev.wmcloud.org. record in designate though it doesn't show up to external DNS queries, possibly because of lack of NS records for cloudinfra-codfw1dev under codfw1dev.wmcloud.org?

Sat, Feb 22, 10:11 PM · cloud-services-team (Kanban)
Krenair added a comment to T245494: CloudVPS: figure out DNS zone ownership transfers and setup.

Please can someone sort out the codfw1dev.wmcloud.org. zone in cloudinfra-codfw1dev? It's broken in the same way wmcloud.org was and I don't have the kind of access needed to fix this anymore. I need it for T242607: Create in-cloud puppetmaster for codfw1dev (labs cloud-wide puppetmasters need a public IP to be exposed so horizon can talk to encapi, and also so designate can SSH in to clean up certs of deleted instances).
Also while we're here, shouldn't there be a cloudinfra-codfw1dev.codfw1dev.wmcloud.org. zone?

Sat, Feb 22, 1:52 AM · cloud-services-team (Kanban)

Fri, Feb 21

Krenair added a comment to T235218: Catch cloud-puppetmasters up with production puppetmaster software versions.

So stuff is now moved over. The old instances are shut down and I expect to delete them in a week or two (anyone have any preferences for how long to wait?)
Unfortunately due to https://github.com/puppetlabs/puppet/pull/5813 we've had to remove the horizon UI for adding roles.

Fri, Feb 21, 1:11 AM · Patch-For-Review, User-jbond, cloud-services-team (Kanban)

Thu, Feb 20

Bstorm awarded T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster a Yellow Medal token.
Thu, Feb 20, 1:17 AM · cloud-services-team (Kanban), Toolforge
Krenair added a comment to T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.

This happened uneventfully as far as I can tell - only thing I saw change (I supervised it on a few random hosts) I traced back to a difference in how stretch/buster's build of ruby serialises YAML - somewhere between ruby 2.1.5p273 (2014-11-13) [x86_64-linux-gnu] and ruby 2.3.3p222 (2016-11-21) [x86_64-linux-gnu] it started adding quotes around some strings.
Also cleaned up some other random things I found laying around - puppet being disabled on -prometheus-04, old role::puppet::self configs on -elastic-0[34], and the old broken root ssh userkeys directory thing also on -elastic-0[34].

Thu, Feb 20, 12:37 AM · cloud-services-team (Kanban), Toolforge

Wed, Feb 19

Krenair added a comment to T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.

I tried it on tools-package-builder-02 and nothing changed. How do we want to roll this out, just update project hieradata and run puppet a couple of times everywhere? Choose a few instance name prefixe and just move those over to start with?

Wed, Feb 19, 12:17 AM · cloud-services-team (Kanban), Toolforge

Tue, Feb 18

Krenair added a comment to T168677: Add new Cloud Services domains to public suffix list.

now it's missing from designate though:

Tue, Feb 18, 9:00 PM · Toolforge, Cloud-VPS, cloud-services-team (Kanban)
Krenair awarded T168677: Add new Cloud Services domains to public suffix list a Evil Spooky Haunted Tree token.
Tue, Feb 18, 12:25 AM · Toolforge, Cloud-VPS, cloud-services-team (Kanban)
Krenair added a comment to T168677: Add new Cloud Services domains to public suffix list.

It seems to me something is wrong with this particular domain in Designate. I remember doing some database updates by had.

Tue, Feb 18, 12:22 AM · Toolforge, Cloud-VPS, cloud-services-team (Kanban)
Krenair added a comment to T245174: CloudVPS: automatically create per-project subdomain.

We already have <project>.wmflabs.org auto-creations and should probably continue this with the new domains

Tue, Feb 18, 12:15 AM · cloud-services-team (Kanban)

Mon, Feb 17

Krenair added a comment to T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.

Unmounted NFS stuff, copied /root (to /root/tools-puppetmaster-01-root.tgz), /home (to /root/tools-puppetmaster-01-homes.tgz), labs/private.git cherry-picks, and CA (and deleted the temporary local copy from my laptop). Turns out it has no operations/puppet.git cherry-picks (yay).
Anything else that needs to be done before picking a guinea pig instance to try moving to it?

Mon, Feb 17, 1:06 AM · cloud-services-team (Kanban), Toolforge

Sun, Feb 16

Krenair updated Krenair.
Sun, Feb 16, 9:19 PM
Krenair added a comment to T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.

Created tools-puppetmaster-02 and set up with role::puppetmaster::standalone - will need to copy files (CA, operations/puppet.git cherry-picks, labs/private.git cherry-picks, /root and /home) across before changing project hieradata to use the new instance.

Sun, Feb 16, 9:19 PM · cloud-services-team (Kanban), Toolforge
Krenair updated the task description for T236565: "tools" Cloud VPS project jessie deprecation.
Sun, Feb 16, 6:11 PM · cloud-services-team (Kanban), Toolforge, Cloud-VPS (Debian Jessie Deprecation)
Krenair added a subtask for T236565: "tools" Cloud VPS project jessie deprecation: T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.
Sun, Feb 16, 6:11 PM · cloud-services-team (Kanban), Toolforge, Cloud-VPS (Debian Jessie Deprecation)
Krenair added a parent task for T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster: T236565: "tools" Cloud VPS project jessie deprecation.
Sun, Feb 16, 6:10 PM · cloud-services-team (Kanban), Toolforge
Krenair created T245365: Replace tools-puppetmaster-01 (jessie) with a buster puppetmaster.
Sun, Feb 16, 6:10 PM · cloud-services-team (Kanban), Toolforge
Krenair added a comment to T153163: Set up and use exported resources for Tool Labs's shared knowledge.

Based on T243226#5843560 (can't use a stretch puppetmaster with a buster puppetdb) our jessie puppetmaster in tools will be useless, we'll need a new (buster) puppetmaster, and we'll need one soon anyway given the upcoming removal of jessie support in general, so I guess I'll make a task for that.

Sun, Feb 16, 6:09 PM · cloud-services-team (Kanban), Patch-For-Review, Toolforge
Krenair added a comment to T153163: Set up and use exported resources for Tool Labs's shared knowledge.

Sorted puppet out on toolsbeta-puppetdb-01 (puppetmaster::servers entry had no loadfactor).

Sun, Feb 16, 5:46 PM · cloud-services-team (Kanban), Patch-For-Review, Toolforge
Krenair added a comment to T235218: Catch cloud-puppetmasters up with production puppetmaster software versions.

@Krenair I have also ran into the certificate issue while looking at labtestpuppetmaster. i have a patch

Sun, Feb 16, 2:29 AM · Patch-For-Review, User-jbond, cloud-services-team (Kanban)

Sat, Feb 15

Krenair added a comment to T242607: Create in-cloud puppetmaster for codfw1dev.

Actually I realised I could hack around the problem, it works now:

Sat, Feb 15, 10:15 PM · Epic, cloud-services-team (Kanban)
Krenair added a comment to T242607: Create in-cloud puppetmaster for codfw1dev.

I wanted to put down writing since it's not particularly obvious that this work is currently blocked on puppet functioning using the existing labtestpuppetmaster2001.wikimedia.org master. I've done what I can by setting hieradata in horizon, but I'm at the stage now where I can't continue without either control of the master or https://gerrit.wikimedia.org/r/572421

Sat, Feb 15, 10:07 PM · Epic, cloud-services-team (Kanban)
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Also copied /etc/conftool-state/mediawiki.yaml to sort out mediawiki::state for mwmaint01
I've also taken /root and /home and put them at deployment-puppetmaster04:/root/deployment-puppetmaster03-{homes,root}.tar.gz in case anyone needs anything

Sat, Feb 15, 8:23 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Thanks to Andrew it seems to be running well now. I've copied across /var/lib/puppet/volatile to sort a lot of swift/GeoIP failures.

Sat, Feb 15, 7:47 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

So this error was the memory usage problem on puppetdb03 I mentioned above - puppetdb won't work without postgresql, which can't start because it wants more memory than the system has. Changed role::puppetmaster::puppetdb::shared_buffers from 768MB to 600MB.

Sat, Feb 15, 12:32 AM · Operations, Beta-Cluster-Infrastructure
Krenair created P10417 WMCS canary 1m load averages by host CPU type.
Sat, Feb 15, 12:09 AM · Cloud-VPS

Fri, Feb 14

Krenair claimed T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).
Fri, Feb 14, 7:04 PM · Operations, Beta-Cluster-Infrastructure

Wed, Feb 12

Krenair added a comment to T234234: Redesign architecture of irc-recentchanges on top of Kafka.
  1. About the low usage of irc.wikimedia.org - yes I agree that few bots are using it (~300)
Wed, Feb 12, 9:09 AM · User-Elukey, Analytics
Krenair added a comment to T244719: Create a replacement for kraz.wikimedia.org.

The way this works now is that the entire MW fleet sends UDP packets to a specific IP (kraz) using the so-called "echo" protocol (= #channel<tab>message). We could theoretically switch this to a multicast address in order to get the ability of having multiple listeners (all connecting to separate IRC servers, each on each listener's localhost perhaps?), but noone has invested the time to do this and set up those multiple frontends.

Wed, Feb 12, 12:39 AM · serviceops, Operations, vm-requests, User-Elukey, Analytics

Mon, Feb 10

Krenair added a comment to T244624: Beta puppet patch "prometheus: make ferm DNS record type configurable".

We still have some Jessie instances which are in need of migration. I think
production still has some jessie instances.

Mon, Feb 10, 1:29 PM · Beta-Cluster-Infrastructure, Operations, observability

Sat, Feb 8

Krenair placed T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) up for grabs.

So before closing this task and removing puppetmaster03, someone should address:

  • puppetdb03 memory usage
  • puppetmaster04 disk usage
  • puppet run time on mwmaint01/puppetmaster03
Sat, Feb 8, 8:52 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

I've moved the remaining instances over to using the new puppetmaster. Puppet does appear to be struggling on deployment-mwmaint01 and deployment-puppetmaster03 though - not sure why.

Sat, Feb 8, 8:45 PM · Operations, Beta-Cluster-Infrastructure
Krenair closed T221879: thumbor on deployment-imagescaler03 does not want to start with firejail private-dev rule as Resolved.

guess this got fixed at some point:

krenair@deployment-imagescaler03:~$ sudo service thumbor@8801 status
● thumbor@8801.service - Thumbor image manipulation service (instance 8801)
   Loaded: loaded (/lib/systemd/system/thumbor@.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-10-16 09:21:34 UTC; 3 months 23 days ago
 Main PID: 287 (firejail)
Sat, Feb 8, 3:44 PM · Thumbor, Beta-Cluster-Infrastructure
Krenair removed a project from T244642: Do something about deployment-imagescaler01: Cloud-VPS (Debian Jessie Deprecation).
Sat, Feb 8, 3:41 PM · Beta-Cluster-Infrastructure
Krenair created T244642: Do something about deployment-imagescaler01.
Sat, Feb 8, 3:41 PM · Beta-Cluster-Infrastructure
Krenair closed T217279: Ferm failing to start in a new way on deployment-(imagescaler02|mediawiki-09) as Resolved.

Looks like this got solved at some point, shrug

Sat, Feb 8, 3:35 PM · Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Thanks dpifke. I've seen it do that before too. We probably need to tune it a bit - I recall puppetdb hosts in particular have a hiera setting relating to memory usage, though this one should be similar in size to puppetdb02. They're m1.small so probably not going to get the same resources ops would give the prod puppetdb VMs.

Sat, Feb 8, 3:22 PM · Operations, Beta-Cluster-Infrastructure
Krenair closed T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP as Resolved.

should be solved now, if it breaks again (within the next few weeks only of course) you can reopen

Sat, Feb 8, 2:26 PM · Browser-Support-Firefox, Beta-Cluster-Infrastructure
Krenair added a comment to T244586: Restbase routing down on beta, 2020-02-07.

(I added some corrected hieradata to cache-text05 in horizon)

Sat, Feb 8, 12:02 PM · User-Ryasmeen, Operations, Traffic, Beta-Cluster-Infrastructure, RESTBase
Krenair closed T244586: Restbase routing down on beta, 2020-02-07 as Resolved.
Sat, Feb 8, 11:57 AM · User-Ryasmeen, Operations, Traffic, Beta-Cluster-Infrastructure, RESTBase
Krenair claimed T244586: Restbase routing down on beta, 2020-02-07.

looks like profile::trafficserver::backend::mapping_rules in hieradata/labs.yaml only has support for mediawiki and upload - it's missing the restbase section that appears in hieradata/common/profile/trafficserver/backend.yaml

Sat, Feb 8, 11:45 AM · User-Ryasmeen, Operations, Traffic, Beta-Cluster-Infrastructure, RESTBase

Fri, Feb 7

Krenair added a comment to T244586: Restbase routing down on beta, 2020-02-07.

Did fix puppet on cache-text05 yesterday, it did a lot of stuff to replace
some nginx/varnish stuff with ATS. May be related?

Fri, Feb 7, 11:20 PM · User-Ryasmeen, Operations, Traffic, Beta-Cluster-Infrastructure, RESTBase
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Got puppet running on -cache-text05, whole beta cluster broke, fixed acme-chief and ATS, going to sleep.

Fri, Feb 7, 1:28 AM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

This is similar to productions which still has issuer=CN = Puppet CA: palladium.eqiad.wmnet.

That is what I was hoping to avoid, but thanks for getting it working.

Fri, Feb 7, 12:22 AM · Operations, Beta-Cluster-Infrastructure

Thu, Feb 6

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Last night I tried moving cache-text05 to use the new puppet master. It
doesn't seem to work just yet as for some reason it's (the puppetmaster?)
attempting to connect to puppet:8140 instead of our configured puppetdb
host. I think in labs that name always points to the central puppetmaster
which we don't use in deployment-prep.

Thu, Feb 6, 9:59 AM · Operations, Beta-Cluster-Infrastructure

Wed, Feb 5

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Have put puppetmaster03 back on the old version and created puppetmaster04

Wed, Feb 5, 11:31 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

ugh, ok

Wed, Feb 5, 11:08 PM · Operations, Beta-Cluster-Infrastructure
Krenair awarded T244307: Request creation of wikisource VPS project a Like token.
Wed, Feb 5, 1:14 AM · Community-Tech (Kanban-Q3-2019-20), Cloud-VPS (Project-requests)

Tue, Feb 4

Krenair added a comment to T244307: Request creation of wikisource VPS project.

can you point to the tasks around reliability issues with this on toolforge? is it due to anything inherent with the toolforge infrastructure?

Tue, Feb 4, 11:51 PM · Community-Tech (Kanban-Q3-2019-20), Cloud-VPS (Project-requests)
Krenair closed T244074: Host key verification failed x 2 as Resolved.

<Reedy> Krenair: Seemingly that unbroke the CI deployment jobs, yeah

Tue, Feb 4, 11:36 PM · Beta-Cluster-Infrastructure
Krenair merged T244306: Puppet broken on Beta Cluster app server into T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).
Tue, Feb 4, 11:31 PM · Operations, Beta-Cluster-Infrastructure
Krenair merged task T244306: Puppet broken on Beta Cluster app server into T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).
Tue, Feb 4, 11:31 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T240960: Beta Cluster points to maps-beta.wmflabs.org instead of maps.wikimedia.org, which CSP blocks.

not having beta ones eases detecting bugs.

Tue, Feb 4, 9:22 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Maps, ContentSecurityPolicy
Krenair added a comment to T240960: Beta Cluster points to maps-beta.wmflabs.org instead of maps.wikimedia.org, which CSP blocks.

I'm not sure I follow. Why would it need prod domains for that? Wouldn't it
be a case of whether the beta maps server is whitelisted?

Tue, Feb 4, 9:22 AM · Patch-For-Review, Beta-Cluster-Infrastructure, Maps, ContentSecurityPolicy

Mon, Feb 3

Krenair added a comment to T240960: Beta Cluster points to maps-beta.wmflabs.org instead of maps.wikimedia.org, which CSP blocks.

Yes, so I think my point stands that our CSP should not contain prod's domains - it should be our equivalent domains.

Mon, Feb 3, 10:24 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Maps, ContentSecurityPolicy
Krenair added a comment to T240960: Beta Cluster points to maps-beta.wmflabs.org instead of maps.wikimedia.org, which CSP blocks.

I wonder if the CSP should contain any prod domains at all. Maybe commons/upload...

Primary reason for CSP containing prod domains is for people to able to load user-scripts cross-wiki, and for those user-scripts to be able to hit the api cross wiki. In principle extensions that need to access resources should be responsible for adjusting the CSP themselves, and not rely on global config for it.

Mon, Feb 3, 10:13 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Maps, ContentSecurityPolicy
Krenair added a comment to T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP.

Yep

Mon, Feb 3, 7:23 PM · Browser-Support-Firefox, Beta-Cluster-Infrastructure

Sun, Feb 2

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Is it possibly our lack of updated puppetdb-termini package?

root@deployment-puppetmaster03:~# apt-cache policy puppetdb-termini
puppetdb-termini:
  Installed: 4.4.0-1~wmf2
  Candidate: 4.4.0-1~wmf2
  Version table:
 *** 4.4.0-1~wmf2 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/component/puppetdb4 amd64 Packages
        100 /var/lib/dpkg/status
     4.4.0-1~wmf1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
Sun, Feb 2, 7:24 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Just done the puppetmaster upgrade and found puppet no longer runs on the puppetmaster: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: undefined method `key_attributes' for nil:NilClass

Sun, Feb 2, 6:38 PM · Operations, Beta-Cluster-Infrastructure
Krenair claimed T244074: Host key verification failed x 2.

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/4cecb2fbef2ea8a54e7e4134a655d7b37d900228%5E%21/ ->

krenair@deployment-dumps-puppetmaster02:~$ sudo  puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-dumps-puppetmaster02.deployment-prep.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '(c082145eac) root - Beta: maintenance: no openldap management'
Notice: The LDAP client stack for this host is: classic/sudoldap
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: classic/sudoldap'
Notice: /Stage[main]/Puppetmaster::Scripts/File[/usr/local/bin/puppet-facts-export]/target: target changed '/usr/local/bin/puppet-facts-export-nodb' to '/usr/local/bin/puppet-facts-export-puppetdb'
Notice: /Stage[main]/Role::Puppetmaster::Standalone/Apt::Repository[wikimedia-puppetdb4]/File[/etc/apt/sources.list.d/wikimedia-puppetdb4.list]/ensure: defined content as '{md5}c45d8816666f422cb3719030ffbdb116'
Info: /Stage[main]/Role::Puppetmaster::Standalone/Apt::Repository[wikimedia-puppetdb4]/File[/etc/apt/sources.list.d/wikimedia-puppetdb4.list]: Scheduling refresh of Exec[apt-get update]
Notice: /Stage[main]/Apt/Exec[apt-get update]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Puppetmaster::Puppetdb::Client/File[/etc/puppet/puppetdb.conf]/content: 
--- /etc/puppet/puppetdb.conf	2018-06-03 04:04:43.750096586 +0000
+++ /tmp/puppet-file20200202-25701-8lmmts	2020-02-02 12:42:25.697845884 +0000
@@ -1,3 +1,3 @@
 [main]
-server_urls = https://deployment-puppetdb02.deployment-prep.eqiad.wmflabs:443
+server_urls = https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs:443
 soft_write_failure = false
Sun, Feb 2, 12:46 PM · Beta-Cluster-Infrastructure
Krenair added a comment to T244074: Host key verification failed x 2.

This may be due to stuff going on in T243226. -mediawiki-07 should be better, I think snapshot01 needs more work to handle its special puppetmaster

Sun, Feb 2, 12:39 PM · Beta-Cluster-Infrastructure
Krenair added a comment to T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP.

worked around by deploying the new cert from acme-chief manually on -cache-text05 for now. upload may still be broken

Sun, Feb 2, 12:38 PM · Browser-Support-Firefox, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

@jbond: Okay so I think our puppetdb stuff is migrated now. I just removed the old server from the list, removed command broadcasting, and puppet still seems to be working. Next step is to upgrade our puppetmaster?

Sun, Feb 2, 12:28 PM · Operations, Beta-Cluster-Infrastructure

Sat, Feb 1

Krenair claimed T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP.

I'll handle this in the next few days if no one else does. A short term fix
might just be copying OCSP stapling files from an acme-chief instance to
the cache instances manually instead of waiting for puppet to be repaired

Sat, Feb 1, 5:00 PM · Browser-Support-Firefox, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

No problem, just trying to avoid overwriting other's work. Am also aware
(including having done it myself a few times) sometimes people just forget
to re-enable it when done. When I do it I'll post a copy of the diff to
avoid losing anything

Sat, Feb 1, 1:29 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Swapped the list in hieradata for deployment-puppetmaster03 around. Once I'm happy it's working I'll remove the old puppetdb from the list and disable command_broadcast.

Sat, Feb 1, 12:44 PM · Operations, Beta-Cluster-Infrastructure

Fri, Jan 31

Krenair added a comment to T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP.
alex@alex-laptop:~$ ssh deployment-cache-text05
Linux deployment-cache-text05 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64
Debian GNU/Linux 9.5 (stretch)
deployment-cache-text05 is a text Varnish/ATS cache server (cache::text)
The last Puppet run was at Mon Jan  6 14:20:09 UTC 2020 (36202 minutes ago).

T243226?

Fri, Jan 31, 5:44 PM · Browser-Support-Firefox, Beta-Cluster-Infrastructure

Wed, Jan 29

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

@jijiki: Hi, deployment-mediawiki-07.deployment-prep.eqiad.wmflabs has puppet disabled since approx Mon Jan 20 10:56:33 UTC 2020 (12417 minutes ago) with the comment effie - assuming that's you, do you still need that? Would like to enable puppet so we can complete some puppet infrastructure changes in the project.

Wed, Jan 29, 1:55 AM · Operations, Beta-Cluster-Infrastructure

Mon, Jan 27

Krenair added a comment to T243700: +2 for Zoranzoki21 in mediawiki/*.

I'm unclear about what substantive differences there have been since T231758. Do Umherirrender or Jdforrester-WMF support this request?

Mon, Jan 27, 10:13 PM · MediaWiki-Gerrit-Group-Requests

Fri, Jan 24

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

I'm not sure what did it but logstash2 has shown up. I made a little space on logstash03 and that's on the list now too. So that just leaves us with mediawiki-07 - need to find out why puppet is disabled and see if we can get it running again, if only temporarily.

Fri, Jan 24, 10:55 PM · Operations, Beta-Cluster-Infrastructure

Jan 24 2020

Krenair added a comment to T243556: Fix internal TLD in use in codfw1dev.
  • evaluate zones with noauth and consider moving them to a proper project. Some options:
    • use admin project to hold all base domains. May not be read by novaobserver, so this option may be wrong.
    • use wmflabsdotorg, which makes me nervous because it holds more than just the wmflabs.org domain which eventually will go away anyway.
    • create a new infra project to hold base domain names. Naming things is hard, but perhaps something like the dns-infra project can hold all the new base domains <deployment>.wikimedia.cloud and .wmcloud.org.
Jan 24 2020, 7:35 PM · Cloud-VPS, cloud-services-team (Kanban)

Jan 23 2020

Krenair created T243556: Fix internal TLD in use in codfw1dev.
Jan 23 2020, 10:06 PM · Cloud-VPS, cloud-services-team (Kanban)
Krenair added a watcher for Python3-Porting: Krenair.
Jan 23 2020, 12:34 AM

Jan 22 2020

Krenair added a comment to T242607: Create in-cloud puppetmaster for codfw1dev.

I could just split modules/role/manifests/wmcs/instance.pp into two and conditionally include the right one from manifests/site.pp based on hostname, but it feels hacky. Better ideas welcome.

Jan 22 2020, 11:09 PM · Epic, cloud-services-team (Kanban)
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

logstash03 seems to be making good use of the 20G root volume to store... logs :(

root@deployment-logstash03:~# du -hsx /var/log/* | grep -E '^([0-9]*M|[0-9\.]*G)'
5.8G	/var/log/daemon.log
230M	/var/log/daemon.log.1
211M	/var/log/elasticsearch
1.9G	/var/log/kafka
2.2G	/var/log/logstash
3.4G	/var/log/syslog
1.7G	/var/log/syslog.1
37M	/var/log/syslog.4.gz
18M	/var/log/syslog.5.gz
Jan 22 2020, 10:23 PM · Operations, Beta-Cluster-Infrastructure
Krenair updated subscribers of T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

deployment-eventgate-1.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-logstash03.deployment-prep.eqiad.wmflabs - exists but out of disk space so no puppet runs
deployment-logstash2.deployment-prep.eqiad.wmflabs - exists but out of disk space so no puppet runs (not planning to fix this instance anymore, I made 03 to replace it)
deployment-maps04.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-mediawiki-07.deployment-prep.eqiad.wmflabs - Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'effie'); The last Puppet run was at Mon Jan 20 10:56:33 UTC 2020 (3515 minutes ago). Puppet is disabled. effie @jijiki
deployment-mediawiki-jhuneidi.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-pdfrender02.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-puppetmaster03.deployment-prep.eqiad.wmflabs - Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'test multiple puppetdb backends - jbond');
deployment-schema-1.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-sessionstore01.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-sessionstore02.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting
deployment-snapshot01.deployment-prep.eqiad.wmflabs - Resource type not found: Wmflib::Service at /etc/puppet/modules/wmflib/functions/service/fetch.pp:4:51

Jan 22 2020, 9:35 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Hi alex,
I have pushed the change to allow multiple puppetdb's with command_broadcast and updated the node meta data in horiozon (i removed some redundant keys as well). You can check on how populated the new DB is with the following command:

  • check how many nodes have checked once the output of this matches on both nodes it should be safe to make the switch
curl -X POST http:/localhost:8080/pdb/query/v4/nodes | jq .[].certname
Jan 22 2020, 9:26 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

I noticed that the postgress databases was missing the uppetdb user, however a simple puppet run on the puppet master fixed the problem. i did restart the postgress server first as i had seen issues of it only binding to localhost in the past so this could have helped

Jan 22 2020, 9:20 PM · Operations, Beta-Cluster-Infrastructure

Jan 21 2020

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

rCLIPec5f9c1645d2eadc8db259755bf163c69e0409d6
Reloaded apache2 on deployment-puppetmaster03

Jan 21 2020, 11:52 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Well, my new puppetdb instance seems to not be working very well yet:

root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)# puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: [503 Service Unavailable] PuppetDB is currently down. Try again later.
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)#
Jan 21 2020, 11:48 PM · Operations, Beta-Cluster-Infrastructure
Krenair claimed T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).
Jan 21 2020, 11:02 PM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

however the bigger problem is the puppetdb server. this will need to be rebuilt on buster as it is not a simple task to backport the puppetdb packages. the procedure i would advice is

  • build new puppetdb server on buster
  • update hiera config to include the following which will ensure that the new puppetdb server gets populated with the required metadata
profile::puppetmaster::common::puppetdb_hosts:
 - $fqdn_current_puppetdb
 - $fqdn_new_puppetdb
profile::puppetmaster::common::command_broadcast: true
  • puppet agent now needs to run on all clients so the new db can get a collection of exported resources, i would run it at least twice just incase there are some complex dependencies
    • either run manually or wait an hour or two
  • swap the order of profile::puppetmaster::common::puppetdb_hosts: so the new puppetdb is first
  • validate everything is working
  • remove the old puppetdb from profile::puppetmaster::common::puppetdb_hosts: and remove profile::puppetmaster::common::command_broadcast: true
Jan 21 2020, 11:02 PM · Operations, Beta-Cluster-Infrastructure
Krenair created T243355: puppet panel: Can't add new prefixes.
Jan 21 2020, 10:54 PM · Horizon
Krenair added a comment to T242607: Create in-cloud puppetmaster for codfw1dev.

I suppose some of the next questions (@Andrew?) may include:

  • How do we make modules/role/manifests/wmcs/instance.pp include the right classes depending on region?
  • Do we even need observerenv and co. to be split based on region?
  • Should manifests/site.pp include a different version of role::wmcs::instance depending on region?
  • Instances don't seem to know what region they're in (even curl http://169.254.169.254/latest/meta-data/placement/availability-zone just says nova), does that mean we need to resort to saying .eqiad.wmflabs -> eqiad1, codfw1dev.cloud -> codfw1dev i.e. based on hostname?
Jan 21 2020, 10:38 PM · Epic, cloud-services-team (Kanban)
Krenair removed a project from T242607: Create in-cloud puppetmaster for codfw1dev: Patch-For-Review.

Here's a fun one:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Class[Profile::Openstack::Base::Observerenv] is already declared in file /etc/puppet/modules/profile/manifests/openstack/eqiad1/observerenv.pp:7; cannot redeclare at /etc/puppet/modules/profile/manifests/openstack/codfw1dev/observerenv.pp:7 at /etc/puppet/modules/profile/manifests/openstack/codfw1dev/observerenv.pp:7:5 on node puppetmaster-codfw1dev-01.cloudinfra-codfw1dev.codfw1dev.cloud
What's that eqiad1 observerenv class doing there? My suspicion is this:
modules/role/manifests/wmcs/instance.pp: include ::profile::openstack::eqiad1::observerenv

Jan 21 2020, 10:30 PM · Epic, cloud-services-team (Kanban)
Krenair created P10239 refill-api fixWikipage POST exception.
Jan 21 2020, 8:44 PM · Tools
Krenair closed T242697: Fix LDAP config on codfw1dev instances, a subtask of T229441: CloudVPS: codfw1dev: missing bits, as Resolved.
Jan 21 2020, 8:23 PM · Epic, cloud-services-team (Kanban)
Krenair closed T242697: Fix LDAP config on codfw1dev instances as Resolved.

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/566351/
https://gerrit.wikimedia.org/r/566353 (not sure this is actually in use but anyway)
and the LDAP password change done by Andrew. thanks

Jan 21 2020, 8:23 PM · cloud-services-team (Kanban)

Jan 20 2020

Krenair added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

am guessing this is just us needing to get a new puppetmaster with buster instead of stretch

Jan 20 2020, 8:01 PM · Operations, Beta-Cluster-Infrastructure
Krenair closed T243161: Unblock port 22 on commons-corruption-checker-main.commons-corruption-checker as Resolved.

Appears to be open now.

Jan 20 2020, 12:24 AM · Cloud-VPS

Jan 17 2020

Krenair added a comment to T243048: python3.4 broken on deployment-logstash2.

(IIRC f-strings are python 3.6 so I'm not sure how this has shown up in a python3.4 directory, am guessing newer instances would not have this problem as they'd have newer python)

Jan 17 2020, 1:35 AM · Operations, Beta-Cluster-Infrastructure
Krenair added a comment to T243048: python3.4 broken on deployment-logstash2.

we should probably stop trying to fix problems with this instance and aim to shut it down, have people fix logstash03 instead?

Jan 17 2020, 1:35 AM · Operations, Beta-Cluster-Infrastructure