Page MenuHomePhabricator

Failing HTTP check on WDQS servers after latest deployment
Closed, ResolvedPublic

Description

The Icinga "WDQS SPARQL" check has been failing since the last WDQS deployment.

This check is defined in puppet as:

check_http!query.wikidata.org!/bigdata/namespace/wdq/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&format=json!"xsd:dateTime"

The intent seems to be to check that the full traffic stack is configured correctly by running a SPARQL query via the public SPARQL endpoint, going through all the layers (caching, LVS, local nginx, blazegraph, etc...). This check is defined for each WDQS host, but is executed against the common public endpoint, which does not make sense. Instead, we want to locally check just the local path (nginx -> blazegraph) and have a single common check for the common endpoint.

I have a hard time navigating the Icinga configuration. The check definition (above) seems to match the the check_http command definition in /etc/icinga/commands.cfg. It seems that the intent was to use the check_http_url_for_string command instead.

define command {
    command_name    check_http
    command_line    $USER1$/check_http -H $HOSTADDRESS$
    }

But that command takes no argument. I'm probably missing something.
Note: another check ("WDQS HTTP") was also failing. That check has been removed already since it was checking for the WDQS UI which is now deployed as a micro-site and thus should not be checked on the WDQS servers anymore.

Event Timeline

Change 657848 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] query_service: fix failing WDQS SPARQL icinga check.

https://gerrit.wikimedia.org/r/657848

Change 657848 merged by Gehel:
[operations/puppet@production] query_service: fix failing WDQS SPARQL icinga check.

https://gerrit.wikimedia.org/r/657848

Change 657861 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] query_service: use a SPARQL query that is agnostic of the updater.

https://gerrit.wikimedia.org/r/657861

Change 657861 merged by Gehel:
[operations/puppet@production] query_service: use a SPARQL query that is agnostic of the updater.

https://gerrit.wikimedia.org/r/657861

The check is now working on the public WDQS cluster. For some reason, we don't deploy envoy on the internal WDQS cluster, and thus the HTTPS check is failing.

We should either deploy envoy (there is no good reason for not encrypting internal traffic) or make the check use HTTPS only on the public server and HTTP on the private one.

I've downtimed the WDQS sparql alerts until next week.

I think deploying envoy makes sense. I have a stub patch open here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/657913 but first I think we need a global_cert certificate provisioned, unless we can use wdqs.discovery.wmnet (the one public is currently using) for both the public and internal environments.

Change 658548 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] wdqs: add cert for wdqs-internal

https://gerrit.wikimedia.org/r/658548

Change 658550 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[labs/private@master] wdqs: add dummy key for new wdqs-internal cert

https://gerrit.wikimedia.org/r/658550

Finished generating new cert. Here's a (password-redacted) log of the changes made:

(0) Creation of YAML file
wdqs-internal.discovery.wmnet:
  authority: puppet_ca
  expiry: null
  alt_names: ["wdqs-internal.discovery.wmnet","wdqs-internal.svc.eqiad.wmnet","wdqs-internal.svc.codfw.wmnet"]
  key:
    password: REDACTED
    algorithm: ec
(1) Generation of cert
ryankemper@puppetmaster1001:/srv/private$  sudo cergen -c 'wdqs-internal.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d
2021-01-26 07:54:05,834 INFO     cergen                                   Generating certificates ['wdqs-internal.discovery.wmnet'] with force=False
2021-01-26 07:54:05,834 INFO     Certificate(wdqs-internal.discovery.wmnet) Generating all files, force=False...
2021-01-26 07:54:05,836 INFO     Certificate(wdqs-internal.discovery.wmnet) Generating certificate file
/usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
2021-01-26 07:54:07,363 INFO     Certificate(wdqs-internal.discovery.wmnet) Generating CA certificate file
2021-01-26 07:54:07,363 INFO     Certificate(wdqs-internal.discovery.wmnet) Generating PKCS12 keystore file
2021-01-26 07:54:07,663 INFO     Certificate(wdqs-internal.discovery.wmnet) Generating Java keystore file
2021-01-26 07:54:08,718 INFO     Certificate(wdqs-internal.discovery.wmnet) Importing PuppetCA(puppetmaster1001.eqiad.wmnet_8140) cert into Java keystore
2021-01-26 07:54:09,724 INFO     Certificate(wdqs-internal.discovery.wmnet) Generating Java truststore file with CA certificate PuppetCA(puppetmaster1001.eqiad.wmnet_8140)

Status of certificates ['wdqs-internal.discovery.wmnet']

Certificate(wdqs-internal.discovery.wmnet, authorities=[PuppetCA(puppetmaster1001.eqiad.wmnet_8140)]):
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.key.private.pem: PRESENT (mtime: 2021-01-26T07:54:05.832437)
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.key.public.pem: PRESENT (mtime: 2021-01-26T07:54:05.832437)
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.crt.pem: PRESENT (mtime: 2021-01-26T07:54:07.360435)
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/ca.crt.pem: PRESENT (mtime: 2021-01-26T07:54:07.360435)
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.keystore.p12: PRESENT (mtime: 2021-01-26T07:54:07.376435)
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.keystore.jks: PRESENT (mtime: 2021-01-26T07:54:09.132432)
        /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/truststore.jks: PRESENT (mtime: 2021-01-26T07:54:10.080431)

(2) Use secret password for wdqs-internal (REDACTED) to decrypt, and place file where it needs to go
sudo openssl ec -in /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.key.private.pem -out /srv/private/modules/secret/secrets/ssl/wqds-internal.discovery.wmnet.key
ryankemper@puppetmaster1001:/srv/private$ sudo openssl ec -in /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.key.private.pem -out /srv/private/modules/secret/secrets/ssl/wqds-internal.discovery.wmnet.key
read EC key
Enter PEM pass phrase:
writing EC key

(3) chown the newly created files; not sure if necessary
## My own hack - not sure if this is necessary but it's good to be safe:
sudo chown -R gitpuppet:gitpuppet /srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet
sudo chown -R gitpuppet:gitpuppet /srv/private/modules/secret/secrets/ssl/wqds-internal.discovery.wmnet.key

(4) Copy over the pubkey and stick it in the puppet repo
# copy `/srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.crt.pem` to `files/ssl/wdqs-internal.discovery.wmnet.crt` under `operations/puppet`
scp 'ryankemper@puppetmaster1001.eqiad.wmnet:/srv/private/modules/secret/secrets/certificates/wdqs-internal.discovery.wmnet/wdqs-internal.discovery.wmnet.crt.pem' "$HOME/wmf/puppet/files/ssl/wdqs-internal.discovery.wmnet.crt"

(5) Commit a dummy key to the "public private" repo: `modules/secret/secrets/ssl/SERVICENAME.discovery.wmnet.key`

(6) Commit to the actual private repo (make sure everything looks right first)

Since resolving this monitoring issue is one of our highest priorities, here's a handoff for Tues Jan 26 so that Europe can make headway:

Current state
After generating a new cert and committing the new files to the private puppet repo (which is already done), two things need to be done:

(1) dummy key added to the "public private" repo labs/private
(2) pubkey added to puppet repo

(1) and (2) correspond to the following two patches respectively, so they need to be merged. The third and final patch is the one that actually enables envoy.

One small uncertainty
When generating the wdqs-internal cert I used the following for alt_names (note that alt_names becomes the SAM):

alt_names: ["wdqs-internal.discovery.wmnet","wdqs-internal.svc.eqiad.wmnet","wdqs-internal.svc.codfw.wmnet"]

compare to the config for wdqs (the public cluster);
alt_names: ["wdqs.discovery.wmnet","wdqs.svc.eqiad.wmnet","wdqs.svc.codfw.wmnet","wdqs.wikimedia.org","wdqs1005.eqiad.wmnet","query.wikidata.org"]

If we end up needing a different list of alt_names for wdqs-internal than the one chosen above, then we'll just need to amend the wdqs-internal yaml config file (/srv/private/modules/secret/secrets/certificates/certificate.manifests.d/wdqs-internal.certs.yaml on puppetmaster1001) and regenerate the cert per https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate

One small uncertainty
When generating the wdqs-internal cert I used the following for alt_names (note that alt_names becomes the SAM):

alt_names: ["wdqs-internal.discovery.wmnet","wdqs-internal.svc.eqiad.wmnet","wdqs-internal.svc.codfw.wmnet"]

compare to the config for wdqs (the public cluster);
alt_names: ["wdqs.discovery.wmnet","wdqs.svc.eqiad.wmnet","wdqs.svc.codfw.wmnet","wdqs.wikimedia.org","wdqs1005.eqiad.wmnet","query.wikidata.org"]

The proposed alt_names should be good. The public cluster is slightly more complex as it is exposed publicly (query.wikidata.org) and wdqs1005 is used directly as the LDF endpoint (without going through service discovery / LVS).

Change 658548 merged by Dzahn:
[operations/puppet@production] wdqs: add cert for wdqs-internal

https://gerrit.wikimedia.org/r/658548

Change 658550 merged by Dzahn:
[labs/private@master] wdqs: add dummy key for new wdqs-internal cert

https://gerrit.wikimedia.org/r/658550

Change 657913 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] wdqs: use envoy for wdqs-internal

https://gerrit.wikimedia.org/r/657913

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs1003.eqiad.wmnet

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs1008.eqiad.wmnet

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs1011.eqiad.wmnet

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs2004.codfw.wmnet

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs2005.codfw.wmnet

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs2006.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-01-27T00:15:26Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] Downtimed all wdqs-internal hosts on icinga

Change 657913 merged by Ryan Kemper:
[operations/puppet@production] wdqs: use envoy for wdqs-internal

https://gerrit.wikimedia.org/r/657913

Icinga downtime set by ryankemper@cumin1001 for 2:00:00 1 host(s) and their services with reason: Enabling envoy for wdqs-internal

wdqs2008.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-01-27T00:20:00Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] Disabled puppet on all wdqs-internal hosts; merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/657913

Mentioned in SAL (#wikimedia-operations) [2021-01-27T00:44:57Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] ...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-01-27T00:45:30Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] Discovered source of the above failure; the secret key in the puppetmaster /srv/private repo has a typo in its name (my error): it had wqds instead of wdqs. Opening up a patch now

Mentioned in SAL (#wikimedia-operations) [2021-01-27T00:51:39Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] Fixed typo in private key in commit ea152df802b55e939d34494a4965ed83a80a24f2. Puppet run on wdqs1003 was successful as a result. Monitoring...

Mentioned in SAL (#wikimedia-operations) [2021-01-27T01:21:23Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] Test queries to wdqs1003.eqiad.wmnet passed, and metrics in Grafana (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs-internal&from=1611706751381&to=1611710190405) look good. Rolling out to rest of fleet

Mentioned in SAL (#wikimedia-operations) [2021-01-27T01:23:59Z] <ryankemper> T272713 [Deploy envoy for wdqs-internal] Roll-out complete. Will monitor wdqs-internal for any issues. All the remaining WDQS SPARQL alerts should clear shortly

Barring any further issues cropping up, this is done.