Page MenuHomePhabricator

Cirrus search is broken on beta (April 2023, second occurence)
Closed, ResolvedPublic

Description

It seems that cirrus search is broken again on beta (at least on betawikidata).

The following API request:
https://wikidata.beta.wmflabs.org/w/api.php?action=wbsearchentities&search=string&format=json&errorformat=plaintext&language=en&uselang=en&type=property

gives me this error:

{
   "errors":[
      {
         "code":"cirrussearch-backend-error",
         "module":"wbsearchentities",
         "*":"We could not complete your search due to a temporary problem. Please try again later."
      }
   ],
   "servedby":"deployment-mediawiki11",
   "*":"See https://wikidata.beta.wmflabs.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
}

Note that there was a similar error earlier this month, not sure if that is related: T333952: cirrussearch-backend-error on beta cluster

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think the cert expired again (already), like in T333952#8759072. To check this, I SSHed into deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud, ran sudo lsof -iTCP -sTCP:LISTEN -n -P to find some port numbers the server is listening on, and ran openssl s_client -connect localhost:9643; it reports “notAfter=Apr 19 10:33:00 2023 GMT” for the certificate, which is about two days ago.

Hm, but the certificate in the nginx config (compare T333952#8760541) looks like it’s still valid:

lucaswerkmeister-wmde@deployment-elastic09:~$ grep -rF ssl_certificate /etc/nginx/
/etc/nginx/snippets/snakeoil.conf:ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
/etc/nginx/snippets/snakeoil.conf:ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
/etc/nginx/sites-available/beta-search:    ssl_certificate /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud.chained.pem;
/etc/nginx/sites-available/beta-search:    ssl_certificate_key /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud-key.pem;
/etc/nginx/sites-available/beta-search-omega:    ssl_certificate /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud.chained.pem;
/etc/nginx/sites-available/beta-search-omega:    ssl_certificate_key /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud-key.pem;
/etc/nginx/sites-available/beta-search-psi:    ssl_certificate /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud.chained.pem;
/etc/nginx/sites-available/beta-search-psi:    ssl_certificate_key /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud-key.pem;
lucaswerkmeister-wmde@deployment-elastic09:~$ sudo cat /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud.chained.pem | openssl x509 -noout -text | grep -A2 Validity
        Validity
            Not Before: Apr  8 10:03:00 2023 GMT
            Not After : May  6 10:03:00 2023 GMT

Mentioned in SAL (#wikimedia-releng) [2023-04-21T10:04:50Z] <Lucas_WMDE> sudo systemctl reload nginx on deployment-elastic09, deployment-elastic10, deployment-elastic11 # T335181

Looks like exactly the same solution worked again 🤷 but it’s hardly sustainable if the servers need to be manually reloaded every few weeks…

I think the cert expired again (already), like in T333952#8759072. To check this, I SSHed into deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud, ran sudo lsof -iTCP -sTCP:LISTEN -n -P to find some port numbers the server is listening on, and ran openssl s_client -connect localhost:9643; it reports “notAfter=Apr 19 10:33:00 2023 GMT” for the certificate, which is about two days ago.

(An easier way to check would have been something like curl https://deployment-elastic10.deployment-prep.eqiad1.wikimedia.cloud:9643 in T333952#8760541, which can also be run on other Beta cluster servers, e.g. deployment-mwmaint02 – the main information I was missing was the port number.)

Change 910758 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] tlsproxy: Fix Nginx reload when cfssl certs get renewed

https://gerrit.wikimedia.org/r/910758

Ah, I was about to create a separate task for tracking the repeating failure, but nevermind :)

Change 910758 merged by Jbond:

[operations/puppet@production] tlsproxy: Fix Nginx reload when cfssl certs get renewed

https://gerrit.wikimedia.org/r/910758

taavi claimed this task.