Page MenuHomePhabricator

Krenair (Alex Monk)
Wikimedia volunteer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 2:34 PM (335 w, 20 h)
Availability
Available
IRC Nick
Krenair
LDAP User
Alex Monk
MediaWiki User
Krenair [ Global Accounts ]

I am a Wikimedia volunteer helping in various technical ways. These days it's usually Beta Cluster, Cloud VPS, or Operations related labs puppet migrations. Since 2012 I've spent significant amounts of time involved in MediaWiki development, software deployments to the Wikimedia cluster, OTRS (email response to e.g. info-en@wikimedia.org addresses), and various other things.

Some of my old VisualEditor and other work (2014-2016) can be found under @AlexMonk-WMF instead.

I have opinions on things, which do not necessarily represent those of any organisation I am, have previously been, or will in the future be affiliated with.

Recent Activity

Tue, Feb 23

Krenair added a comment to T275453: addWiki.php warns Deprecated: Premature access to HookContainer, ObjectFactory and ServiceContainer.

This brings back memories. Think it's long past time we get some tests
added so this script stops getting broken multiple times a year.

Tue, Feb 23, 8:03 PM · Platform Team Workboards (Clinic Duty Team), Wiki-Setup, MediaWiki-extensions-WikimediaMaintenance

Mon, Feb 22

Krenair added a comment to T275294: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided.

My general suggestion is not to only look at FOSS, but to look at what would fit our needs best. That could be something that’s open source, but it might be something else, and I don’t think we should limit ourselves.

Just to be clear, this is a non-starter. We don't deploy non-free software. This is covered in https://foundation.wikimedia.org/wiki/Resolution:Wikimedia_Foundation_Guiding_Principles#Freedom_and_open_source and not really worth rehashing IMO.

But that doesn't preclude us from paying for/sponsoring feature development in free software so it fits our needs. Given how valuable OTRS is to the movement, it seems like a worthy investment.

There’s some wiggle room in that. If you make the argument that no free software can adequately meet our needs you can use non-free software.

Mon, Feb 22, 12:16 AM · SRE, Security, OTRS

Fri, Feb 19

Krenair awarded T274953: Access group for Gitlab contractors a Dislike token.
Fri, Feb 19, 8:20 PM · GitLab, User-brennen, SRE, SRE-Access-Requests

Sat, Feb 6

Krenair added a comment to T273956: acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP.

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

Sat, Feb 6, 10:03 AM · Acme-chief, cloud-services-team (Kanban)

Fri, Feb 5

hashar awarded T138672: Duplicate LDAP user for cn=smccandlish a Love token.
Fri, Feb 5, 5:54 PM · User-bd808, SRE, cloud-services-team (Kanban), LDAP-Access-Requests, Gerrit, LDAP, Phabricator

Jan 30 2021

Krenair added a comment to T258660: WebAuthn: signed in {some bogus number} times with this key.

Also ran into this, after first login I saw over 100 logins on this page. Either needs removal or clarification

Jan 30 2021, 8:45 PM · MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), MediaWiki-extensions-OATHAuth

Jan 17 2021

Krenair added a comment to T271778: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12.

Unlikely

I was asking because T267006#6624466 (deployment-cache-upload06 is upload.beta.wmflabs.org I think?) and the problems started I think around T267858. But could be just coincidence.

Jan 17 2021, 7:33 PM · Beta-Cluster-Infrastructure, Acme-chief

Jan 14 2021

Krenair added a comment to T271778: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12.

Unlikely

Jan 14 2021, 1:27 PM · Beta-Cluster-Infrastructure, Acme-chief
Krenair added a comment to T271808: The certificate for upload.beta.wmflabs.org expired on January 12, 2021..
root@deployment-cache-upload06:/etc/acmecerts/unified/live# openssl x509 -dates -noout -in rsa-2048.crt
notBefore=Jan 12 01:23:09 2021 GMT
notAfter=Apr 12 01:23:09 2021 GMT
root@deployment-cache-upload06:/etc/acmecerts/unified/live# touch /srv/trafficserver/tls/etc/ssl_multicert.config
root@deployment-cache-upload06:/etc/acmecerts/unified/live# systemctl reload trafficserver-tls.service

It should be up & running now.. I'm not really familiar with the cloud puppetization but this doesn't mimic production behaviour

Jan 14 2021, 12:31 AM · SRE, Traffic, HTTPS, Beta-Cluster-reproducible

Jan 12 2021

Dzahn awarded T252199: Stop using letsencrypt::cert::integrated a Like token.
Jan 12 2021, 9:32 PM · cloud-services-team (Kanban), Mail
Krenair added a comment to T271778: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12.

re acme-chief part: It looks like the same thing happened to the mx and wikibase certs too. Haven't checked those updated on the machines that serve them.
Also spotted various prod ncredir certs in /etc/acme-chief/config.yaml that can't be doing any good.

Jan 12 2021, 2:37 AM · Beta-Cluster-Infrastructure, Acme-chief
Krenair updated the task description for T271778: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12.
Jan 12 2021, 2:34 AM · Beta-Cluster-Infrastructure, Acme-chief
Krenair created T271778: Issues with acme-chief cert rotation on deployment-prep, 2021-01-12.
Jan 12 2021, 2:34 AM · Beta-Cluster-Infrastructure, Acme-chief
Krenair edited projects for T271644: Fatal exception undeleting a file on Commons: rev_page field must not be 0!, added: MediaWiki-Page-deletion; removed MediaWiki-Revision-deletion.
Jan 12 2021, 2:13 AM · Patch-For-Review, Platform Team Workboards (Clinic Duty Team), MediaWiki-Page-deletion, Wikimedia-production-error, Commons
Krenair added a comment to T207372: Add simple script for account creation.

well, ideally it would've been a script applicable to all installs of the
package, not just in wikimedia puppet.git

Jan 12 2021, 1:58 AM · Patch-For-Review, Acme-chief

Dec 5 2020

Krenair awarded T260614: Phase out use of .wmflabs tld a Burninate token.
Dec 5 2020, 11:57 PM · Cloud-VPS, cloud-services-team (Kanban)

Dec 1 2020

Krenair reopened T268978: String vs Binary issues while running the puppet compiler as "Open".

reopening too to ensure this gets looked at

Dec 1 2020, 2:00 AM · SRE, puppet-compiler
Krenair set Security to security-bug on T268978: String vs Binary issues while running the puppet compiler.

Protecting as security issue due to presence of what appears to be a Jenkins API token in the task description, based on https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility)

Dec 1 2020, 2:00 AM · SRE, puppet-compiler

Nov 30 2020

Krenair added a comment to T268948: Add editprotected permission for interface-admin.

@Urbanecm: I can see it being interpreted either way - at the time this task was named for Wikipedias :) But I don't mind

Nov 30 2020, 3:48 AM · MediaWiki-General

Nov 29 2020

Krenair added a project to T268948: Add editprotected permission for interface-admin: Wikimedia-Site-requests.
Nov 29 2020, 6:10 PM · MediaWiki-General
Krenair added a comment to T268926: tools-sgeexec-0908.tools.eqiad1.wikimedia.cloud is misbehaving.

I just went and checked on this again and found I can SSH in, tasks are running on it, SGE has cleared the alarm/unreachable flags, based on the prometheus data it came back at 02:12:25 (after having stopped at 21:36:25), and according to uptime it hasn't been restarted.

Nov 29 2020, 6:04 PM · Tools
Krenair added a comment to T268893: [tools-sgecron-01] The server is getting out of space, daemon.log is growing a lot.

sudo service webservicemonitor restart has shut it up. Broken connection to LDAP/SSSD or something? I notice sssd has only been running since Tue 2020-11-24 18:06:07 UTC; 4 days ago, and zgrep collector-runner /var/log/syslog.3.gz | grep Traceback -C3 | head -n 300 reveals these exceptions took off only 34 seconds later. That file also shows puppet had just applied a config change and restarted sssd. Maybe we're missing a subscribe/notify relationship in puppet to have it restart webservicemonitor as well, or if that's awkward (do we still have some old sssd alternative lurking somewhere that's conditional in puppet?) then maybe we can make it detect this through monitoring the existence of some always-existing LDAP user, and when that fails, crash to have systemd restart it.

Nov 29 2020, 5:58 PM · Toolforge, cloud-services-team (Kanban)
Krenair added a comment to T268943: crontab: crontabs/tmp.YdH9kW: No space left on device.

No we just removed some stuff under the system's /var/log. It looks like /var/log/syslog for example had filled up with collector-runner exceptions, it had managed to generate over 9 million lines in 36 hours, like this:

Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: 2020-11-29 17:42:11,517 Exception trying to validate / load tool grantmetrics
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: Traceback (most recent call last):
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 39, in from_name
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]:     user_info = pwd.getpwnam(username)
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: KeyError: 'getpwnam(): name not found: tools.grantmetrics'
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: During handling of the above exception, another exception occurred:
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: Traceback (most recent call last):
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 146, in collect
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]:     tool = Tool.from_name(toolname)
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 42, in from_name
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]:     raise Tool.InvalidToolException("No tool with name %s" % (name,))
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: tools.manifest.webservicemonitor.Tool.InvalidToolException: No tool with name grantmetrics

It's still generating more so this will happen again at some point. It has 3.7G to burn through first though.

Nov 29 2020, 5:45 PM · Toolforge
Krenair updated subscribers of T268943: crontab: crontabs/tmp.YdH9kW: No space left on device.

Me and @Andrew removed some stuff, please try again. Note this is writing files on tools-sgecron-01 rather than whatever bastion you are logged on to, so a simple df won't show anything.

Nov 29 2020, 5:23 PM · Toolforge
Krenair added a comment to T268926: tools-sgeexec-0908.tools.eqiad1.wikimedia.cloud is misbehaving.

Some of the continuous jobs that were stopped (except anomie's) have issued root@ failure emails with errors like can't get password entry for user "tools.ket-bot" (I imagine that given how broken this instance is, LDAP connectivity is one of the issues), and there's another couple more 'problems with defaults entries' emails too. All around 01:12

Nov 29 2020, 1:41 AM · Tools
Krenair created T268926: tools-sgeexec-0908.tools.eqiad1.wikimedia.cloud is misbehaving.
Nov 29 2020, 12:03 AM · Tools

Nov 28 2020

Krenair added a comment to T268904: can't start webservices kubernetes.

Hi @Krenair sorry for bothering you, i just have a little question, shortly before i use --backend=kubernetes and worked fine, now i try to use --backend=gridengine, and it give me a massage:

Could not find a public_html folder or a .lighttpd.conf file in your tool home.

is that normal and i just need to set lighttpd.conf file or it's a problem need to be fixed? thx again.

I don't know much about the Grid Engine, sorry.

@Krenair that's fire thank you, and if can mention any body knows i'll be grateful, if u don't it's OK also.

Nov 28 2020, 6:05 AM · Toolforge, Kubernetes
Krenair added a comment to T268904: can't start webservices kubernetes.

Hi @Krenair sorry for bothering you, i just have a little question, shortly before i use --backend=kubernetes and worked fine, now i try to use --backend=gridengine, and it give me a massage:

Could not find a public_html folder or a .lighttpd.conf file in your tool home.

is that normal and i just need to set lighttpd.conf file or it's a problem need to be fixed? thx again.

Nov 28 2020, 5:51 AM · Toolforge, Kubernetes
Krenair closed T268904: can't start webservices kubernetes as Resolved.
Nov 28 2020, 5:14 AM · Toolforge, Kubernetes
Krenair added a comment to T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled.
alex@alex-laptop:~$ ssh deployment-puppetdb03
Linux deployment-puppetdb03 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64
Debian GNU/Linux 10 (buster)
deployment-puppetdb03 is a PuppetDB server (puppetmaster::puppetdb (postgres master))
The last Puppet run was at Thu Nov 26 20:45:39 UTC 2020 (1890 minutes ago). 
Last puppet commit: 
Last login: Sun Jul 26 10:45:20 2020 from 172.16.1.136
krenair@deployment-puppetdb03:~$ sudo service puppetdb  status
● puppetdb.service - Puppet data warehouse server
   Loaded: loaded (/lib/systemd/system/puppetdb.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Thu 2020-11-26 21:15:48 UTC; 1 day 7h ago
     Docs: man:puppetdb(8)
           file:/usr/share/doc/puppetdb/index.markdown
 Main PID: 519 (code=killed, signal=KILL)
Nov 28 2020, 4:39 AM · Patch-For-Review, Developer Productivity, Puppet, Beta-Cluster-Infrastructure
Krenair claimed T268904: can't start webservices kubernetes.

@Mohnd_Kh: Try now

Nov 28 2020, 4:37 AM · Toolforge, Kubernetes

Nov 25 2020

Krenair changed the status of T190781: Secure deployment-prep sudo access to prevent privilege escalation by dns-manager credentials, a subtask of T182927: Get letsencrypt wildcard cert for *.beta.wmflabs.org domains, from Invalid to Declined.
Nov 25 2020, 1:26 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), Beta-Cluster-Infrastructure
Krenair changed the status of T190781: Secure deployment-prep sudo access to prevent privilege escalation by dns-manager credentials from Invalid to Declined.
Nov 25 2020, 1:26 AM · Beta-Cluster-Infrastructure
Krenair closed T190781: Secure deployment-prep sudo access to prevent privilege escalation by dns-manager credentials as Invalid.

Based on that task having been done, I think we can safely say the rest of this is fairly pointless. If you have membership on deployment-prep the most destructive stuff you could do is likely within the instances you're expected to have root on, DNS is likely the more easily recoverable part. If anything this permissions setup more closely relates to what a production root would be able to do.

Nov 25 2020, 1:25 AM · Beta-Cluster-Infrastructure
Krenair closed T190781: Secure deployment-prep sudo access to prevent privilege escalation by dns-manager credentials, a subtask of T182927: Get letsencrypt wildcard cert for *.beta.wmflabs.org domains, as Invalid.
Nov 25 2020, 1:25 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), Beta-Cluster-Infrastructure

Nov 22 2020

Krenair placed T219085: wmf_style check in puppet silently fails when it finds the addition of an error that was also already occurring in the same file up for grabs.
Nov 22 2020, 4:41 AM · Patch-For-Review, Puppet, SRE-tools, SRE
Krenair closed T257968: Certificate for *.beta.wmflabs.org has expired as Resolved.

@Vgutierrez: Can we make it dynamically reload its code somehow? We should probably have another task for this if so, I'm resolving this one

Nov 22 2020, 4:38 AM · Beta-Cluster-Infrastructure
Krenair placed T220268: Consider ways to make puppetmaster CA changes smoother on the puppet client end up for grabs.
Nov 22 2020, 4:37 AM · Puppet, cloud-services-team (Kanban), Cloud-Services
Krenair awarded T268393: UDP traffic throughput to instances in the "meet" Cloud VPS project not meeting expectations a Evil Spooky Haunted Tree token.
Nov 22 2020, 12:57 AM · cloud-services-team (Kanban), Cloud-VPS, Wikimedia Meet

Nov 16 2020

Krenair closed T267858: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. as Resolved.
Nov 16 2020, 7:55 PM · SRE, Traffic, HTTPS, Beta-Cluster-Infrastructure
Krenair added a comment to T267935: wikitech: INSERT command denied to user 'wikiuser'@'10.64.32.36' for table 'comment' (10.64.0.98).

It's coming from the job runner, given the values shown this is likely a
post distributed via MassMessage or something that expects to be able to
write cross-wiki, and a wikitech page was one of the expected destinations?
This probably should've been permitted

Nov 16 2020, 4:30 PM · wikitech.wikimedia.org, cloud-services-team (Kanban)
Krenair added a comment to T196248: TLS certificates renewal process.

I don't think we use certbot anywhere except maybe Gerrit.

Nov 16 2020, 12:50 AM · Documentation, Performance-Team (Radar), HTTPS, Traffic, SRE

Nov 14 2020

Krenair claimed T267858: The certificate for upload.beta.wmflabs.org expired on November 13, 2020..

@Vgutierrez FYI in case this could happen in prod too, I haven't been keeping track of changes lately. If we think it won't happen again or won't happen in prod (e.g. maybe it didn't restart because puppet is erroring somewhere in varnish code on this box?) then I guess we can close this

Nov 14 2020, 3:07 PM · SRE, Traffic, HTTPS, Beta-Cluster-Infrastructure
Krenair added a comment to T267858: The certificate for upload.beta.wmflabs.org expired on November 13, 2020..

For some reason I had to do a full restart of the trafficserver-tls service on the cache-upload06 VM but it has loaded the latest cert now:

root@deployment-cache-upload06:~# openssl s_client -connect upload.beta.wmflabs.org:443 2>/dev/null | openssl x509 -noout -text | grep After
            Not After : Jan 12 06:00:26 2021 GMT
Nov 14 2020, 3:01 PM · SRE, Traffic, HTTPS, Beta-Cluster-Infrastructure
Restricted Application added a project to T267858: The certificate for upload.beta.wmflabs.org expired on November 13, 2020.: SRE.

Cert was renewed:

root@deployment-acme-chief03:~# openssl x509 -in /var/lib/acme-chief/certs/unified/live/rsa-2048.crt -noout -text | grep After
            Not After : Jan 12 05:01:51 2021 GMT
Nov 14 2020, 2:56 PM · SRE, Traffic, HTTPS, Beta-Cluster-Infrastructure

Nov 13 2020

Krenair removed a watcher for Wikimedia-production-error: Krenair.
Nov 13 2020, 9:14 PM

Sep 19 2020

Krenair added a comment to T263328: Agents can view watched tickets outside of assigned queues.

This worked in the past on OTRS 5 with e.g. oversight queues. I assumed it
was deliberate - the most sensitive part of a ticket is almost always going
to be the first article and in this case the agent has already seen it.

Sep 19 2020, 5:19 PM · OTRS

Sep 14 2020

Krenair added a comment to T262816: The certificate for en.wikipedia.beta.wmflabs.org expired on 2020-09-14.

It's possible - if acme chief has got a new cert issued but the cache-text
box hasn't run puppet since, you'll see this. Check whether acme-chief has
a new one and if it does, fix puppet on cache-text. If not investigate why.
Am having lunch and then working again but I can look this evening if no
one has fixed it by then.

Sep 14 2020, 12:12 PM · Beta-Cluster-Infrastructure

Sep 2 2020

Krenair awarded T261900: Request for floating IP / DNS for gitlab-test.wmcloud.org a Like token.
Sep 2 2020, 9:10 PM · User-brennen, Release-Engineering-Team, GitLab-Test, Cloud-VPS (Quota-requests)
Krenair added a comment to T261900: Request for floating IP / DNS for gitlab-test.wmcloud.org.

Yeah that would work and is the mechanism that allows people direct access to e.g. the bastions and the tools login machines. The other potential option is just to require people using the test setup to use some custom SSH config to proxy through the bastions to get there.

Sep 2 2020, 9:10 PM · User-brennen, Release-Engineering-Team, GitLab-Test, Cloud-VPS (Quota-requests)

Aug 31 2020

Krenair awarded T261656: Grant merge rights (+2) on MediaWiki Core to Martin Urbanec a Like token.
Aug 31 2020, 4:52 PM · MediaWiki-Gerrit-Group-Requests

Aug 30 2020

Krenair added a comment to T261551: https://meet.wmflabs.org creates a redirect loop.

maybe you can look for an X-Forwarded-Proto: https header which I think the proxy should be setting? if it's set then treat the request as if you would on port 443, if it's not set than issue redirect?

Aug 30 2020, 1:07 AM · User-Ladsgroup, Wikimedia Meet

Aug 28 2020

Krenair added a comment to T251414: Support TLSv1.3 in IABot.

This is not something I believe I have control over.

Aug 28 2020, 11:42 PM · Traffic, InternetArchiveBot, SRE

Aug 24 2020

Krenair added a comment to T261133: Ban IP edits on pt.wiki.

Note to those I see in the ptwiki comments proposing AbuseFilters: Abuse Filter has emergency checks that will disable a filter matching 5% or more of edits.

Aug 24 2020, 11:38 PM · Growth-Team, Anti-Harassment, Wikimedia-Site-requests
Krenair added a comment to T261133: Ban IP edits on pt.wiki.

Yeah this should probably be added to https://meta.wikimedia.org/wiki/Limits_to_configuration_changes

Aug 24 2020, 11:26 PM · Growth-Team, Anti-Harassment, Wikimedia-Site-requests

Aug 19 2020

Krenair created T260835: Stop using letsencrypt::cert::integrated on toolserver-legacy.
Aug 19 2020, 6:06 PM · User-bd808, cloud-services-team (Kanban)
Krenair updated the task description for T252199: Stop using letsencrypt::cert::integrated.
Aug 19 2020, 6:06 PM · cloud-services-team (Kanban), Mail
Krenair created T260834: Stop using letsencrypt::cert::integrated on mx-out*.cloudinfra.
Aug 19 2020, 6:05 PM · Patch-For-Review, cloud-services-team (Kanban), Mail

Aug 18 2020

Krenair added a comment to T260732: ORES icinga alerts.

modules/icinga/manifests/monitor/ores_labs_web_node.pp has check_command => "check_ores_workers!oresweb/node/${title}", which would be e.g. check_ores_workers!oresweb/node/ores-web-04. Also host ores.wmflabs.org.
modules/nagios_common/files/check_commands/check_ores_workers.cfg says this is $USER4$/check_ores_workers $HOSTADDRESS$ '$ARG1$'
So it becomes /usr/local/lib/nagios/plugins/check_ores_workers ores.wmflabs.org 'check_ores_workers!oresweb/node/ores-web-04'
./modules/nagios_common/files/check_commands/check_ores_workers turns that into /usr/local/lib/nagios/plugins/check_http -f follow -H "ores.wmflabs.org" -I "ores.wmflabs.org" -A "wmf-icinga/something (root@wikimedia.org)" -u "http://oresweb/node/ores-web-04/v3/scores/fakewiki/$(/bin/date +%s)/"

Aug 18 2020, 9:46 PM · Patch-For-Review, ORES, SRE, Machine-Learning-Team

Aug 17 2020

Krenair added a comment to T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.

I don't have OTRS access, sorry. Is this a new reported issue with Jio users?

Aug 17 2020, 6:06 PM · SRE, netops, Traffic

Aug 4 2020

Krenair awarded T88258: Convert WikibaseRepository, WikibaseClient, WikibaseLib and WikibaseView to use extension registration a Barnstar token.
Aug 4 2020, 5:50 PM · Wikidata-Campsite, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), MW-1.34-notes (1.34.0-wmf.23; 2019-09-17), Patch-For-Review, Wikidata-Trailblazing-Exploration, Story, Technical-Debt, wdwb-tech-focus, Wikidata-Turtles-Tech-Debt, Wikidata-Ministry-Of-Magic-Tech-Debt, Wikidata-Sprint-2017-12-20, Wikidata-Sprint-2015-08-11, Wikidata-Sprint-2015-06-30, Wikidata-Sprint-2015-06-16, Wikidata-Sprint-2015-06-02, MediaWiki-extensions-WikibaseRepository, Wikidata, MediaWiki-extensions-WikibaseClient

Aug 3 2020

Krenair claimed T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled.

replacing with a medium instance, deployment-puppetdb04

Aug 3 2020, 11:57 PM · Patch-For-Review, Developer Productivity, Puppet, Beta-Cluster-Infrastructure
Krenair added a project to T259540: deployment-perfapt01 seems to be broken: Beta-Cluster-Infrastructure.
Aug 3 2020, 5:57 PM · Beta-Cluster-Infrastructure
Krenair created T259540: deployment-perfapt01 seems to be broken.
Aug 3 2020, 5:57 PM · Beta-Cluster-Infrastructure

Aug 2 2020

Krenair added a comment to T259444: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28.

As I recall with the meet project the project itself in OpenStack was named meet, therefore you automatically got a meet.wmflabs.org designate zone. Could get one for lists created too I guess (similar to the beta zone in deployment-prep). This way you could administer it without going through more tickets in future

Aug 2 2020, 10:30 PM · User-bd808, cloud-services-team (Kanban), VPS-Projects, SRE, User-Ladsgroup, Wikimedia-Mailing-lists
Krenair added a comment to T259444: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28.

This should probably just be a record under mailman.wmcloud.org ?

Aug 2 2020, 9:39 PM · User-bd808, cloud-services-team (Kanban), VPS-Projects, SRE, User-Ladsgroup, Wikimedia-Mailing-lists

Jul 30 2020

Krenair added a comment to T255249: acme-chief: support for generating a concatenated cert/key file.

I think the keys are generated first and the certs appear when acme-chief
has gone through the ACME API to get stuff signed by the CA

Jul 30 2020, 5:47 PM · Patch-For-Review, Acme-chief

Jul 17 2020

Krenair added a comment to T257968: Certificate for *.beta.wmflabs.org has expired.

I'm still getting the cert error on https://upload.beta.wmflabs.org . Other subdomains, e.g. https://en.wikisource.beta.wmflabs.org , are working fine now.

Jul 17 2020, 12:01 AM · Beta-Cluster-Infrastructure

Jul 15 2020

Krenair created P11917 fixes for `puppet` hostname serving on a new labs central puppetmaster in codfw1dev.
Jul 15 2020, 6:29 PM · Cloud-VPS

Jul 14 2020

Krenair updated subscribers of T257968: Certificate for *.beta.wmflabs.org has expired.

@Vgutierrez: I'm guessing puppet had failed to run the reload exec itself due to the errors connecting to acme-chief (Error 400 on SERVER: part must be in ['ec-prime256v1.crt', 'ec-prime256v1.chain.crt', 'ec-prime256v1.chained.crt', 'ec-prime256v1.key', 'ec-prime256v1.ocsp', 'rsa-2048.crt', 'rsa-2048.chain.crt', 'rsa-2048.chained.crt', 'rsa-2048.key', 'rsa-2048.ocsp'] from puppet and requests like /puppet/v3/file_content/acmedata/mx/bfcd4752e6b346289533bcb6934671a2/rsa-2048.crt.key?environment=production& showing up in the uwsgi-acme-chief logs) - it had new puppet classes and was making the new .crt.key CERTIFICATE_TYPE calls to acme-chief, and the acme-chief instance had v0.26 installed, but the uwsgi-acme-chief service on the acme-chief box had not been restarted. Wonder if we should automatically restart uwsgi-acme-chief on upgrading the acme-chief package somehow (puppet?)

Jul 14 2020, 9:23 PM · Beta-Cluster-Infrastructure
Krenair lowered the priority of T257968: Certificate for *.beta.wmflabs.org has expired from Unbreak Now! to High.

the immediate problem is solved by me manually doing the cert reload (something like touch /srv/trafficserver/tls/etc/ssl_multicert.config && /bin/systemctl reload trafficserver except there are two different ssl_multicert.config files on the system and two different trafficserver services)

Jul 14 2020, 8:50 PM · Beta-Cluster-Infrastructure
Krenair added a comment to T257968: Certificate for *.beta.wmflabs.org has expired.

it's UBN because beta is down and this task is the beta project, not due to perceived security risk (it's only beta)
initial glance: certs on the box look fine:

root@deployment-cache-text06:/etc/acmecerts/unified/live# openssl x509 -in /etc/acmecerts/unified/live/rsa-2048.chained.crt -noout -text | grep After
            Not After : Sep 14 05:29:00 2020 GMT
root@deployment-cache-text06:/etc/acmecerts/unified/live# openssl x509 -in /etc/acmecerts/unified/live/ec-prime256v1.chained.crt -noout -text | grep After
            Not After : Sep 14 05:29:37 2020 GMT
Jul 14 2020, 8:44 PM · Beta-Cluster-Infrastructure
Krenair claimed T257968: Certificate for *.beta.wmflabs.org has expired.

looking

Jul 14 2020, 8:33 PM · Beta-Cluster-Infrastructure

Jul 11 2020

Krenair awarded T255697: Offboard valhallasw as vps/toolforge admin a Dislike token.
Jul 11 2020, 12:10 AM · User-bd808, cloud-services-team (Kanban), Toolforge

Jun 30 2020

Krenair added a comment to T256806: Mailserver TLS is broken, root certificates are not present for sent intermediate certificates.

This is probably because this:
modules/profile/templates/toolforge/mail-relay.exim4.conf.erb:tls_certificate = /etc/acmecerts/<%= @cert_name %>/live/ec-prime256v1.crt
should use .chained.crt more like these:

Jun 30 2020, 8:15 PM · cloud-services-team (Kanban)

Jun 18 2020

Krenair awarded T255731: Create #acl*wmcs-team a Like token.
Jun 18 2020, 12:13 AM · cloud-services-team (Kanban), Project-Admins

Jun 13 2020

Krenair updated subscribers of T232521: Clicking on images takes you to a black screen due to JS error from MediaViewer using mw.Title internals which have changed.

The patches above have fixed some known uses, but I'm concerned that the ext property should've be deprecated first as we may not have caught every case.

Jun 13 2020, 12:25 PM · MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), MediaViewer, Multimedia

Jun 10 2020

Krenair added a comment to T252734: Consider moving tools away from acme-chief.

It sounds like we've decided to keep acme-chief and set it up in toolsbeta. Shall we close this?

Jun 10 2020, 9:15 PM · cloud-services-team (Kanban), Tools
Krenair added a comment to T252721: cloud-vps solution for Let's Encrypt.

One issue we will have is that we can't create another instance due to quota though.

Jun 10 2020, 9:13 PM · cloud-services-team (Kanban), Cloud-VPS
Krenair added a comment to T254801: Logstash-Beta cannot be accessed: 504 Gateway Time-out.

Can I suggest that efforts may be better placed on getting logstash03 into
operation rather than continuing to resurrect or keep logstash2 on life
support.

Jun 10 2020, 5:03 PM · Release-Engineering-Team, observability, Beta-Cluster-Infrastructure

Jun 8 2020

Krenair added a comment to T254801: Logstash-Beta cannot be accessed: 504 Gateway Time-out.

An image is used to create the VM in the first place, once that's done we just keep it updated. If we replaced instances because they were based on images that were deprecated we'd either have automated the whole thing, stopped running special VMs and gone for a container-on-ephemeral-VM model, or gathered a small army of people to spend all their time replacing instances (this may be an exaggeration but you get the gist).
Stretch is not banned yet, we're still trying to get rid of jessie. Production logstash hosts run stretch.

Jun 8 2020, 7:35 PM · Release-Engineering-Team, observability, Beta-Cluster-Infrastructure
Krenair added a comment to T254801: Logstash-Beta cannot be accessed: 504 Gateway Time-out.

It's supposed to be on 03 but I think it got moved back to 2 at some point.

Jun 8 2020, 6:50 PM · Release-Engineering-Team, observability, Beta-Cluster-Infrastructure

Jun 7 2020

Krenair added a comment to T52864: Upgrade GNU Mailman from 2.1 to Mailman3.

I went to have a look but both security groups and iptables on the box looked fine and exim was listening on that port, then realised it works for me anyway, I can connect to it:

alex@alex-laptop:~$ telnet lists.beta.wmflabs.org 25
Trying 185.15.56.7...
Connected to lists.beta.wmflabs.org.
Escape character is '^]'.
220 deployment-mailman01.deployment-prep.eqiad.wmflabs ESMTP Exim 4.92 Sun, 07 Jun 2020 02:38:38 +0000

Unless that's been fixed in the past 25 minutes, maybe your ISP is blocking you from connecting out to port 25?

Jun 7 2020, 2:39 AM · Security-Team, SRE, Wikimedia-Mailing-lists

Jun 3 2020

Krenair added a comment to T253584: Strikethrough in Reply tool adds <s> tags with href attribute.

Hmm. I haven't been able to reproduce this either. [1]

@Krenair, do you remember what steps you took to produce the diffs linked above?

...I suspect it'll be hard to recall at this point, but figured it's worth asking.

Jun 3 2020, 11:53 AM · Skipped QA, User-Ryasmeen, MW-1.35-notes (1.35.0-wmf.37; 2020-06-16), Editing-team (Q3 2019-2020 Kanban Board), OWC2020 (OWC2020 Replying 2.0), DiscussionTools

Jun 2 2020

Krenair added a comment to T245937: tools-acme-chief-01 is attempting to validate DNS challenge against cloud authdns IPv6 addresses.

@Krenair, can you summarize the results here? It looks resolved but it's not clear if or how :)

Jun 2 2020, 6:38 PM · cloud-services-team (Kanban), Patch-For-Review, IPv6, Acme-chief

May 29 2020

Krenair awarded Blog Post: Celebrating 600,000 commits for Wikimedia a Party Time token.
May 29 2020, 11:07 PM
Krenair created T254043: cdanis-etcd101.puppet.eqiad.wmflabs not permitting access from cloud-cumin-01.
May 29 2020, 9:57 PM · Cloud-VPS
Krenair created T254042: Investigate peek0[12].orch.eqiad.wmflabs not allowing SSH connections from cloud-cumin-01.
May 29 2020, 9:50 PM · Peek, Security-Team, Cloud-VPS
Krenair created T254041: monitoring and swift project instances not permitting access from cloud-cumin-01.
May 29 2020, 9:48 PM · Cloud-VPS
Krenair created T254040: rec-wiki-buster.recommendation-api.eqiad.wmflabs out of disk space.
May 29 2020, 9:43 PM · Recommendation-API

May 27 2020

Krenair reopened T252762: tools/toolsbeta: improve acme-chief integration as "Open".
May 27 2020, 8:54 PM · Acme-chief, cloud-services-team (Kanban)

May 26 2020

Krenair added a comment to T253584: Strikethrough in Reply tool adds <s> tags with href attribute.

I also ran into this in:

May 26 2020, 5:25 PM · Skipped QA, User-Ryasmeen, MW-1.35-notes (1.35.0-wmf.37; 2020-06-16), Editing-team (Q3 2019-2020 Kanban Board), OWC2020 (OWC2020 Replying 2.0), DiscussionTools

May 20 2020

Krenair placed T165874: Implement queue filter that allows people to specify range of score of edits they want to see up for grabs.

Sorry, missed this message and forgot about this task a long while ago :(

May 20 2020, 7:08 PM · WorkType-NewFunctionality, Huggle
Krenair added a comment to T252762: tools/toolsbeta: improve acme-chief integration.

we might still do this, we'll see :)

May 20 2020, 5:27 PM · Acme-chief, cloud-services-team (Kanban)

May 16 2020

Krenair added a comment to T252721: cloud-vps solution for Let's Encrypt.

WIP puppetisation of this on krenair-t252721-test.testlabs.eqiad.wmflabs, has successfully issued a cert

May 16 2020, 1:09 AM · cloud-services-team (Kanban), Cloud-VPS

May 15 2020

Krenair added a comment to T252721: cloud-vps solution for Let's Encrypt.

figured out roughly how this can work

May 15 2020, 11:23 PM · cloud-services-team (Kanban), Cloud-VPS
Krenair closed T252732: Create a service account to manage testlabs DNS, a subtask of T252721: cloud-vps solution for Let's Encrypt, as Resolved.
May 15 2020, 8:07 PM · cloud-services-team (Kanban), Cloud-VPS
Krenair closed T252732: Create a service account to manage testlabs DNS as Resolved.

now it can authenticate and read zones etc.

May 15 2020, 8:07 PM · cloud-services-team (Kanban), Cloud-VPS

May 14 2020

Krenair reopened T252732: Create a service account to manage testlabs DNS, a subtask of T252721: cloud-vps solution for Let's Encrypt, as Open.
May 14 2020, 11:59 PM · cloud-services-team (Kanban), Cloud-VPS
Krenair reopened T252732: Create a service account to manage testlabs DNS as "Open".

we missed a bit, have been wondering why this wasn't working

May 14 2020, 11:59 PM · cloud-services-team (Kanban), Cloud-VPS