Page MenuHomePhabricator

ssl expiry tracking in icinga - we don't monitor that many domains
Closed, ResolvedPublic

Description

With the unplanned expiration of a couple of SSL certificates, T112542 was generated. Since then, we've listed all of the domains we purchased certificates from.

Now we need to decide if we are going to put in icinga checks for all of them, or just some, and how to differentiate.

We only have that icinga check on the primary unified cert, which covers the production endpoints for:

  • wikipedia.org
  • mediawiki.org
  • wikibooks.org
  • wikidata.org
  • wikimediafoundation.org
  • wikimedia.org
  • wikinews.org
  • wikiquote.org
  • wikisource.org
  • wikiversity.org
  • wikivoyage.org
  • wiktionary.org

... and all of their mobile subdomains and whatnot. It's a pretty verbose check, validates functional SSL for all of the SAN domains, checks the cert expiry, etc.

But we don't have any kind of checking in place for the various other misc certs we own that are deployed for smaller or one-off services, or deployed to third parties (or in some cases, rare today but important later - not deployed at all but still critical). Just looking at puppet's files/ssl/ today, that list is something like:

archiva.wikimedia.org.crt
blog.wikimedia.org.crt
dumps.wikimedia.org.crt
ecc-star.wmfusercontent.org.crt
eventdonations.wikimedia.org.crt
ganglia.wikimedia.org.crt
gerrit.wikimedia.org.crt
icinga.wikimedia.org.crt
labvirt-star.eqiad.wmnet.crt
ldap-codfw.wikimedia.org.crt
ldap-eqiad.wikimedia.org.crt
ldap-mirror.wikimedia.org.crt
librenms.wikimedia.org.crt
lists.wikimedia.org.crt
policy.wikimedia.org.crt
rt.wikimedia.org.crt
star.planet.wikimedia.org.crt
star.wmflabs.org.crt
star.wmfusercontent.org.crt
stream.wikimedia.org.crt
tendril.wikimedia.org.crt
ticket.wikimedia.org.crt
toolserver.org.crt
virt-star.eqiad.wmnet.crt
wikitech.wikimedia.org.crt

Of those, I can see in our icinga config direct expiry checks only for:

lists.wikimedia.org
ticket.wikimedia.org
ldap-codfw.wikimedia.org
ldap-eqiad.wikimedia.org

Additionally: https://docs.google.com/a/wikimedia.org/spreadsheets/d/1yT5rvoEEUHhNeJAQRVamr8ECqN3TLsMaO8N_At4Ki3I/edit?usp=sharing lists all the certificates and expiry info.

We need to determine which of these will get icinga checks.

Event Timeline

RobH created this task.Sep 28 2015, 11:22 PM
RobH raised the priority of this task from to Needs Triage.
RobH updated the task description. (Show Details)
RobH added a project: acl*sre-team.
RobH added subscribers: RobH, BBlack.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 28 2015, 11:22 PM
Dzahn added a subscriber: Dzahn.EditedOct 8 2015, 10:47 PM

The following are covered already:

  • lists
  • otrs

because they use the check_ssl_http check_command which is defined in modules/nagios_common/files/check_commands/check_ssl.cfg and set to warn at 60 and go CRIT at 30 days before expiry, it uses check_ssl, written by Faidon, which has a sub ssl_expiry_check.

i'm going to upload a change or more to add checks for other services using the same method, in a monitoring::service in the role class.

Change 244610 had a related patch set uploaded (by Dzahn):
wikitech: add SSL cert expiry monitoring

https://gerrit.wikimedia.org/r/244610

Change 244614 had a related patch set uploaded (by Dzahn):
icinga: add ssl cert expiry for icinga itself

https://gerrit.wikimedia.org/r/244614

Change 244617 had a related patch set uploaded (by Dzahn):
dumps: add cert expiry check

https://gerrit.wikimedia.org/r/244617

Change 244618 had a related patch set uploaded (by Dzahn):
gerrit: add cert expiry check

https://gerrit.wikimedia.org/r/244618

Dzahn triaged this task as High priority.Oct 19 2015, 11:22 PM

Change 244618 merged by Dzahn:
gerrit: add cert expiry check

https://gerrit.wikimedia.org/r/244618

Dzahn added a comment.EditedOct 20 2015, 12:33 AM

check for gerrit cert added:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ytterbium&service=HTTPS

"SSL OK - Certificate gerrit.wikimedia.org valid until 2018-05-25 21:30:06 +0000 (expires in 948 days)"

Dzahn claimed this task.Oct 20 2015, 12:34 AM

Change 244610 merged by Faidon Liambotis:
wikitech: add SSL cert expiry monitoring

https://gerrit.wikimedia.org/r/244610

Dzahn added a comment.Oct 20 2015, 2:42 PM

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=silver&service=HTTPS

SSL OK - Certificate wikitech.wikimedia.org valid until 2016-01-25 04:44:13 +0000 (expires in 96 days)

Change 244614 merged by Dzahn:
icinga: add cert expiry check for icinga itself

https://gerrit.wikimedia.org/r/244614

Dzahn added a comment.Oct 20 2015, 7:28 PM

and let's also have meta monitoring. icinga itself should have a working cert :)

added:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=neon&service=HTTPS

SSL OK - Certificate icinga.wikimedia.org valid until 2016-01-08 03:00:46 +0000 (expires in 79 days)

Change 244617 merged by Dzahn:
dumps: add cert expiry check

https://gerrit.wikimedia.org/r/244617

Change 247744 had a related patch set uploaded (by Dzahn):
planet: add ssl cert expiry check

https://gerrit.wikimedia.org/r/247744

Change 247744 merged by Dzahn:
planet: add ssl cert expiry check

https://gerrit.wikimedia.org/r/247744

Change 247905 had a related patch set uploaded (by Dzahn):
icinga: ssl cert monitoring for external services

https://gerrit.wikimedia.org/r/247905

Change 247905 merged by Dzahn:
icinga: ssl cert monitoring for external services

https://gerrit.wikimedia.org/r/247905

Dzahn added a comment.EditedOct 21 2015, 11:23 PM

ok, all done, except these remnants.

can you help me here? what's the status?

ecc-star.wmfusercontent.org.crt
labvirt-star.eqiad.wmnet.crt
ldap-mirror.wikimedia.org.crt
star.wmfusercontent.org.crt
virt-star.eqiad.wmnet.crt

@Andrew @chasemp @yuvipanda can you tell me if these certs above are actually used and on which FQDN they are expected?

re: LDAP-mirror: nothing to monitor here, because Icinga said:

"SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused "

and puppet code says:

port => '389', # Yes, explicitly not supporting LDAPS (port 636)

(but we still have the cert for it? oh well)

Dzahn added a comment.EditedOct 22 2015, 1:17 AM

@Andrew which service on which port uses the virt-star cert? i see virt100x compute nodes had it, and exist in site.pp but _not in DNS_ and labvirt1001 should have it now but none of the services on thse high ports 59xx ? seem to respond to openssl

Andrew added a comment.EditedOct 22 2015, 1:57 PM

virt-star is used by the nova-compute services to talk to each other, for example when migrating instances from one place to another. It's almost certainly self-signed. I'll try to figure out what port you can test it on.

(edit: labvirt-star is the interesting one, virt-star isn't used in production.)

We agreed to split this special case into a non-blocking subtask -> T116332

Dzahn closed this task as Resolved.Oct 22 2015, 10:08 PM

All the certs in the original list in the ticket are covered now. The only exception is the special case for labvirtstar.

Restricted Application added a project: Traffic. · View Herald TranscriptMar 23 2016, 4:30 PM