Page MenuHomePhabricator

Create NRPE check to alert when cergen certificates are due to expire
Closed, DeclinedPublic

Description

We currently have no automated way to alert us when certificates in the private repo generated using cergen are due to expire. We should create an nrpe check to alerts us well in advance of any expiry so we are able to generate new certificates before any impact is observed.

A change has already been drafted as part of T236277 however questions where raised which are best here, below is relevant parts of the conversation from the Gerrit task (volans quoted , me responding)

Also I don't see in the task a specific design of what we're trying to achieve here.

Something to alert us if certs are due to expire, i dont think we have anything in place for this

This would be a single check that will fire saying say 10 certs are
expiring and the day after might re-fire with 11 certs are expiring
(because the error message changed) and so on.

I thought icinga only fired alerts if the Status changed i don't think it fired again if the message changes. otherwise diskspace checks, load or the BGP Status[1] checks would constantly alert

I'm not sure of the usefulness vs noise of such a check instead that a per-certificate check.

I actually think the noise would be much worse with a per certificate checks as many certificates expire on the same day. e.g. we have 112 certs expiring in 343 days.

Also why not checking the live ones instead of those in the repo?

simplicity and ultimately the repo is the source of truth, or should be

We could have a cert in the repo that is not yet/anymore deployed/live but we might keep it around for a bit just in case.

i'm not sure i can think of a valid reason to keep an expired certificate around, especially considering it would still be in the git history

I'd like a bit more of problem statement -> possible solutions -> agreement on a plan in the task tbh.

[1]https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr3-ulsfo&service=BGP+status

Event Timeline

Change 552260 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] cergen: add Icinga check to validate the expiry date on certificates

https://gerrit.wikimedia.org/r/552260

It's hard to reply from the description, there is no quote task description button AFAIK.

Summarizing:

  • for the noise, sorry I misremembered our icinga config, we've obsess disabled hence it will fire only once. That means that if ack-ed when alerting for 10 certs it will not alert anymore if there are other 100 certs expiring the day after, risking to go unnoticed.
  • the case of multiple certs is for example for our global wildcard cert. When we renew it we issue the new one, add it to the repo but is not yet deployed, then deploy and swap it when is the right time. During this period the alert could fire for the old one although the new one is already prepared and someone actively working on it Not a common use case, but still we do it twice a year.

The main concern for me is security wise, we've a flat permission set for the private repo that is already not optimal because we add files as root and then the next puppet run "fixes" the permissions (maybe should be moved to the post-commit hook instead?). But with the current scheme NRPE could have access also to the private data. Yes, we could just allow NRPE to run sudo the specific script, better although still not optimal IMHO.
The other option could be to check those signed by the CA directly from /var/lib/puppet/server/ssl/ca/signed for example.

In general I'm wondering if we should invest much time on all this given we should move soon to a proper PKI that surely will have a better way to manage those use cases.

Summarizing:

  • the case of multiple certs is for example for our global wildcard cert. When we renew it we issue the new one, add it to the repo but is not yet deployed, then deploy and swap it when is the right time. During this period the alert could fire for the old one although the new one is already prepared and someone actively working on it Not a common use case, but still we do it twice a year.

I think you are refering to certificates provided by third parties i.e. globalsignm, digicert. I believe theses files are are stored directly in the puppet repo (with the private parte under secrets/ssl)

The main concern for me is security wise, we've a flat permission set for the private repo that is already not optimal because we add files as root and then the next puppet run "fixes" the permissions (maybe should be moved to the post-commit hook instead?). But with the current scheme NRPE could have access also to the private data. Yes, we could just allow NRPE to run sudo the specific script, better although still not optimal IMHO.

I agree neither of theses are optimal but cant think of any obvious ways a user could escalate there sudo privilege via the python script (although it wouldn't surprise me if there was some obscure trick)

The other option could be to check those signed by the CA directly from /var/lib/puppet/server/ssl/ca/signed for example.

I think this directory is more likely to have incorrect/odl data. A quick check indicates it definitely has different data

In general I'm wondering if we should invest much time on all this given we should move soon to a proper PKI that surely will have a better way to manage those use cases.

the other side to that. is the script is here and works, if we could live with the suboptimal sudo option no more time needs to be invested

Ottomata renamed this task from Create NRPE check to alert when certgen certificates are due to expire to Create NRPE check to alert when cergen certificates are due to expire.Nov 21 2019, 5:15 PM

Why does this script need sudo? Doesn't it just need to read the expiry out of the public certs?

@Ottomata you are correct the script just needs to read the public certificates however the directory with the public certificate also includes the private certificate. We could try to manage the permissions of the certs and exclude the keys with something like the following puppet code (not perfect as it doesn't set +x on the directory)

file {'/srv/private/secret/secrets/certificate':
    ensure  => directory,
    mode    => 'o+r',
    recurse => true,
    ignore  => ['*.jks', '*.key.private.pem', '*.p12'],
}

however this doesn't look pretty and would also cause a change every time something is committed to the private repo (were the certs are stored) . I guess we could also look at managing theses permissions via the git commit hook; however this has its own problems.

  • Either the git commit hook would have to make a commit to its self which is not advice do to the recursive nature
  • Or we leave the git repo in a dirty state after the commit hook runs

Just so that you aren't caught off guard

file {'/srv/private/secret/secrets/certificate':

a form of this kind of already exists in a different form, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/puppetmaster/manifests/gitclone.pp#98

Could we make the cergen script itself modify the permissions after it creates the files? It won't ensure things are right ongoing, but at least they'd be right from the start.

Could we make the cergen script itself modify the permissions after it creates the files? It won't ensure things are right ongoing, but at least they'd be right from the start.

+1 to this, also +1 to quickly modifying cergen to have an 'update certificate' mode that automates the steps at https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate so we don't have to make sure we're deleting the correct half-a-dozen files

Hm, would running cergen --generate with --force be enough?

-F --force                      If not provied --force, any existing files will not be
                                overwritten.  If want to overwrite files, provide --force.

Also, each group of certificate files is rendered to its own directory, so you could easily just remove the offending directory and regenerate.

I'll try to find some time soon to make cergen chmod after creating files.

I will also update that doc for --force who wrote that!? ò_ô

I'll try to find some time soon to make cergen chmod after creating files.

FYI With the current puppettization the permissions would be rewritten.

FYI With the current puppettization the permissions would be rewritten.

Really? where?

Just so that you aren't caught off guard
file {'/srv/private/secret/secrets/certificate':

I think this idea would work with the existing puppetization. Since puppet is managing this as well as '/srv/private', I think it will know to use the more specific rule in the right place. I could be wrong though....

Change 559443 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ca nrpe checking: exclude managing public certificates

https://gerrit.wikimedia.org/r/559443

Just so that you aren't caught off guard
file {'/srv/private/secret/secrets/certificate':

I think this idea would work with the existing puppetization. Since puppet is managing this as well as '/srv/private', I think it will know to use the more specific rule in the right place. I could be wrong though....

I double checked and puppet will recurse up to any files that are also managed with both File['/srv/private'] {recurse => true} and File['/srv/private/secret/secrets/certificate']. The fist statement would apply to everything under /srv/private excluding anything under /srv/private/secret/secrets/certificate. This would mean that the using the code above would no longer lock down the private keys. As such it might be better to just exclude the public certs from management and leave that to cergen.

Ok! As is now, cergen doesn't do anything to chmod after it creates the files, so they should be created with the default umask, which I'm guessing will give them 644. That should do ya?

@jbond could you confirm that this is still a valid request? If yes, do you think there is a better match for it than Infrastructure-Foundations? Or does it perhaps fit SRE-Unowned?

@LSobanski i have added observability alerting. Im not real sure if that's the best group but it is about monitoring so its seems the best we have. Im not sure how much we still rely on certgen @elukey has been migrating a lot of things to pki.discovery.wmnet they or @Ottomata may have a good idea on how much use we still have and if this is worth pursuing.

IMHO between the migration to pki and the fact that we monitor cert expiration directly from probes to the service I think this task could be resolved/declined, what do you think @Ottomata ?

Change 559443 abandoned by Jbond:

[operations/puppet@production] ca nrpe checking: exclude managing public certificates

Reason:

task was declined

https://gerrit.wikimedia.org/r/559443

Change 552260 abandoned by Jbond:

[operations/puppet@production] cergen: add Icinga check to validate the expiry date on certificates

Reason:

not desired

https://gerrit.wikimedia.org/r/552260