Page MenuHomePhabricator

wikitech-static cert renewal seems to stop apache2
Closed, ResolvedPublic

Description

"Certificate wikitech-static.wikimedia.org valid until 2019-01-29 23:00:34 +0000 (expires in 5 days)"

Event Timeline

acccess to this host is just

ssh root@wikitech-static.wikimedia.org

with the password in pwstore.

Mentioned in SAL (#wikimedia-operations) [2019-01-24T22:08:20Z] <mutante> wikitech-static - certbot was already installed but it wasn't used to generate the existing certs so just running certbot renew did not work, attempted to use certbot to renew but apache plugin missing, installed python-certbot-apache (T214640)

So... certbot was already installed on this system but it had not been used to create the existing certificates. This meant simply running certbot renew would not work as certbot did not know about them and there was no auto-renewing.

I started out trying to manually specify the domain name and using the apache plugin to detect them.

certbot --dry-run --apache certonly -n -d wikitech-static.wikimedia.org

This failed with The requested apache plugin does not appear to be installed.

So i apt-get install python-certbot-apache to get the plugin.

Then i tried to use that combined with the certonly option.

certbot --dry-run --apache certonly -n -d wikitech-static.wikimedia.org and this time it said

You should register before running non-interactively, or provide --agree-tos and --email <email_address> flags.

and also

Client with the currently selected authenticator does not support any combination of challenges that will satisfy the CA.

I removed the "-n" to be interactive.

certbot --apache certonly -d wikitech-static.wikimedia.org

Now i got asked to enter an email address, i entered noc@wikimedia.org and after that accepted the ToS of letsencrypt.

This created an account and credentials in /etc/letsencrypt and certbot said it's a good idea to make a backup so i did and made /root/letsencrypt-backup.tar.gz but have not moved it off the machine. ("If you lose your account credentials, you can recover through e-mails sent to noc@wikimedia.org").

But we were still stuck at "selected authenticator does not support any combination of challenges" . This leads to https://github.com/certbot/certbot/issues/5405

" Let's Encrypt has stopped offering the mechanism that Certbot's Apache and Nginx plugins use to prove you control a domain due to a security issue" https://community.letsencrypt.org/t/2018-01-11-update-regarding-acme-tls-sni-and-shared-hosting-infrastructure/50188

This recommended to use the --authenticator webroot method for the challenge. I tried that, got asked to enter the webroot pathes manually, and it failed because both domain names would have to serve files out of their respective document root. The challenge failed because webserver responded with 404s.

Next option was the --authenticator standalone method combined with a pre and post hook which takes the webserver down, requests the cert and then starts it again. so:

certbot --authenticator standalone --installer apache --pre-hook "service apache2 stop" --post-hook "service apache2 start" and i was asked:

No names were found in your configuration files. Please enter in your domain

This generated a key but ultimately:

We were unable to find a vhost with a ServerName or Address of status.wikimedia.org.

It turned out certbot isn't able to detect the ServerNames if there is more than one in a single file and our setup had the http and https config combined in a single file.

So next i split up the files in sites-available and created new symlinks in sites-enabled so that there is one ServerName per file.

Despite not being able to install the certificates certbot had created them.

- Unable to install the certificate
- Congratulations! Your certificate and chain have been saved at
  /etc/letsencrypt/live/wikitech-static.wikimedia.org/fullchain.pem.
  Your cert will expire on 2019-04-24.

Re-running the command and now certbot detected the Virtual Hosts:

Which names would you like to activate HTTPS for?
-------------------------------------------------------------------------------
1: status.wikimedia.org
2: wikitech-static.wikimedia.org
-------------------------------------------------------------------------------

and it also noticed i had an existing cert for the same domain names:

What would you like to do?
-------------------------------------------------------------------------------
1: Attempt to reinstall this existing certificate
2: Renew & replace the cert (limit ~5 per 7 days)

picked 1 and it also gave me a choice to enforce https (which we were already doing)

Please choose whether HTTPS access is required or optional.
-------------------------------------------------------------------------------
1: Easy - Allow both HTTP and HTTPS access to these sites
2: Secure - Make all requests redirect to secure HTTPS access

picked 2, certbot added redirects to the SSL vhosts and then:

Congratulations! You have successfully enabled https://status.wikimedia.org and
https://wikitech-static.wikimedia.org

and

You should test your configuration at:
https://www.ssllabs.com/ssltest/analyze.html?d=status.wikimedia.org
https://www.ssllabs.com/ssltest/analyze.html?d=wikitech-static.wikimedia.org

Which i clicked and they are both A+.

Now for the future this should all be much easier. You can ran certbot certificates to see the current status and certbot renew will renew it manually.

Finally there is /etc/letsencrypt/renewal/wikitech-static.wikimedia.org.conf which configures # renew_before_expiry = 30 days.

So this means it should just never happen again and we are done here.

root@wikitech-static-ord:/# certbot certificates
Saving debug log to /var/log/letsencrypt/letsencrypt.log

-------------------------------------------------------------------------------
Found the following certs:
  Certificate Name: wikitech-static.wikimedia.org
    Domains: wikitech-static.wikimedia.org status.wikimedia.org
    Expiry Date: 2019-04-24 20:59:28+00:00 (VALID: 89 days)
    Certificate Path: /etc/letsencrypt/live/wikitech-static.wikimedia.org/fullchain.pem
    Private Key Path: /etc/letsencrypt/live/wikitech-static.wikimedia.org/privkey.pem
-------------------------------------------------------------------------------
grep renew_before /etc/letsencrypt/renewal/wikitech-static.wikimedia.org.conf 
# renew_before_expiry = 30 days
17:26 <+icinga-wm> RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate wikitech-static.wikimedia.org valid until 2019-04-24 20:59:28 +0000 (expires in 89 days)

Thanks! It would be nice to add instructions about how to do this in the future to https://wikitech.wikimedia.org/wiki/Wikitech-static#How_do_we_maintain_it?

Oh, sorry I didn't read to the end -- looks like it automatically renews! So, nevermind :)

Yes, we should not have to do anything. And even if it would fail auto-renew for some reason the maximum should be to run certbot renew now :)

CDanis subscribed.

Looks like certbot renews the cert but doesn't restart apache correctly?

2019-03-26 00:04:23 <+icinga-wm> PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static
2019-03-26 00:04:23 <+icinga-wm> PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/
2019-03-26 00:04:25 <+icinga-wm> PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static
2019-03-26 00:05:17 <+icinga-wm> PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/

Logged in. Found that apache2 wasn't running; seemed to have failed?

● apache2.service - The Apache HTTP Server
   Loaded: loaded (/lib/systemd/system/apache2.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2019-03-26 00:01:38 UTC; 12min ago
  Process: 17265 ExecStop=/usr/sbin/apachectl stop (code=exited, status=0/SUCCESS)
  Process: 25813 ExecReload=/usr/sbin/apachectl graceful (code=exited, status=0/SUCCESS)
  Process: 17256 ExecStart=/usr/sbin/apachectl start (code=exited, status=0/SUCCESS)
 Main PID: 32449 (code=exited, status=0/SUCCESS)

Mar 26 00:01:38 wikitech-static.wikimedia.org systemd[1]: Starting The Apache HTTP Server...
Mar 26 00:01:38 wikitech-static.wikimedia.org apachectl[17256]: httpd (pid 17234) already running
Mar 26 00:01:38 wikitech-static.wikimedia.org systemd[1]: Started The Apache HTTP Server.

Restarted it by hand and Icinga was happy again / things were serving again.

The time that systemctl says it died does correlate with certbot executing: https://phabricator.wikimedia.org/P8268

CDanis renamed this task from wikitech-static cert about to expire to wikitech-static cert renewal seems to stop apache2.Mar 26 2019, 2:46 PM

:( Sad. Such a fight to get certbot to take over and not have to manually deal with renewals anymore and now this.

I can find some other users reporting it:

https://community.letsencrypt.org/t/certificate-renewal-kills-apache2/27503

https://theadminzone.com/threads/certbot-for-lets-encrypt-kills-apache-when-renewing-certificate.145642/

Our certbot crontab lines are:

@monthly /usr/local/sbin/acme-setup -i wikitech-static -s wikitech-static.wikimedia.org -m acme -w apache2
@monthly /usr/local/sbin/acme-setup -i status -s status.wikimedia.org -m acme -w apache2

Manually running them did NOT reproduce the issue.

Can't we have it use webroot-based authorisation with the existing web server?

Mentioned in SAL (#wikimedia-operations) [2019-03-28T15:15:42Z] <mutante> wikitech-static - removing acme-setup cron jobs from root's crontab. this was used before the switch to certbot, is unrelated and added to confusion and maybe the problem (T214640)

Soo.. the acme-setup crons were unrelated and are removed now and the actual cron that comes with certbot, from the Debian package of certbot, so done by Debian, is:

0 */12 * * * root test -x /usr/bin/certbot -a \! -d /run/systemd/system && perl -e 'sleep int(rand(3600))' && certbot -q renew

So if /run/systemd/system is a directory it will NOT run. Because in that cause it uses the systemd timer instead of the cron.

/lib/systemd/system/certbot.timer

This will just start certbot and the command line of that is:

ExecStart=/usr/bin/certbot -q renew

Since --standlone is false by default and renew uses the method used the last time it succesfully renewed.. this uses Apache (httpd) and not the standlone webserver.

Mentioned in SAL (#wikimedia-operations) [2019-03-28T16:36:27Z] <mutante> wikitech-static - changing [renewalparams] authenticator = to 'apache' from 'standalone' (installer = was already apache) (T214640)

This has since been set to standalone, and new certs were generated. See T204840#5243222 for the context. Should this task remain open?

Mentioned in SAL (#wikimedia-operations) [2019-07-16T22:26:44Z] <mutante> wikitech-static ran certbot with --dry-run renew to confirm cert renewal works and it was just fine .. 2 minutes later apache errors which were fixed by restarting apache2 (T214640)

Mentioned in SAL (#wikimedia-operations) [2019-07-16T23:23:21Z] <mutante> wikitech-static - testing cert renewal with dry-run option - getting some temp icinga alerts is now expected again because renewal method was changed back from 'apache' to 'standalone' (not by me -> T204840#5243222 i previously did the opposite change in T214640#4907685 to fix it) and that takes down apache during the renewal (T214640)

Mentioned in SAL (#wikimedia-operations) [2019-07-16T23:23:21Z] <mutante> wikitech-static - testing cert renewal with dry-run option - getting some temp icinga alerts is now expected again because renewal method was changed back from 'apache' to 'standalone' (not by me -> T204840#5243222 i previously did the opposite change in T214640#4907685 to fix it) and that takes down apache during the renewal (T214640)

Mentioned in SAL (#wikimedia-operations) [2019-07-16T23:26:34Z] <mutante> wikitech-static - current status with method 'standalone' is that it's broken on cert renewal and gets fixed by restarting apache, which makes no sense since the previous fixes were the straight opposite and the ticket claims the fix was moving back from apache to standalone (T214640)

Mentioned in SAL (#wikimedia-operations) [2019-07-17T00:01:28Z] <mutante> wikitech-static changing certbot renewalparams: authenticator = webroot (changed from standalone), install = apache (unchanged) (T214640)

Mentioned in SAL (#wikimedia-operations) [2019-07-17T00:01:55Z] <mutante> wikitech-static certbot --dry-run renew (T214640)

Mentioned in SAL (#wikimedia-operations) [2019-07-17T00:12:13Z] <mutante> wikitech-static - adding (undocumented!) option webroot-map to certbot config to use webroot authenticator with different document roots per domain while using the config file and not cli params (T214640)

This is still not working right but i will continue debugging tomorrow. Cert doesn't expire until September and i commented the cron job for now.

Mentioned in SAL (#wikimedia-operations) [2019-07-27T00:33:27Z] <mutante> wikitech-static - fix /etc/letsencrypt/renewal/wikitech-static.wikimedia.org.conf - remove webroot_map and and line for status.wm.org that caused errors when doing a renewal dry-run. now dry run finishes succesfully and we are using "webroot" authenticator and not "apache" anymore. This should have resolved what this ticket was about. No more Apache kills/restarts on renewal. (T214640)

Change 526200 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/wikitech-static@master] add certbot renewal config for Letsencrypt

https://gerrit.wikimedia.org/r/526200

Change 526200 merged by Dzahn:
[operations/wikitech-static@master] add certbot renewal config for Letsencrypt

https://gerrit.wikimedia.org/r/526200