Page MenuHomePhabricator

cirrussearch-backend-error on beta cluster
Closed, ResolvedPublicBUG REPORT

Description

What happens?:
Pywikibot CI fails with cirrussearch-backend-error on beta cluster:

======================================================================
ERROR: test_search (tests.site_generators_tests.SearchTestCase)
Test the site.search() method.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/pywikibot/pywikibot/tests/site_generators_tests.py", line 1080, in test_search
    se = list(mysite.search('wiki', total=100, namespaces=0))
  File "/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/_collections_abc.py", line 317, in __next__
    return self.send(None)
  File "/home/runner/work/pywikibot/pywikibot/pywikibot/tools/collections.py", line 275, in send
    return next(self._started_gen)
  File "/home/runner/work/pywikibot/pywikibot/pywikibot/data/api/_generators.py", line 610, in generator
    self.data = self.request.submit()
  File "/home/runner/work/pywikibot/pywikibot/pywikibot/data/api/_requests.py", line 1094, in submit
    raise pywikibot.exceptions.APIError(**error)
pywikibot.exceptions.APIError: cirrussearch-backend-error: We could not complete your search due to a temporary problem. Please try again later.
[servedby: deployment-mediawiki11;
 help: See https://en.wikisource.beta.wmflabs.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.]

Software version:
MediaWiki version: 1.41.0-alpha
Pywikibot 8.1.0.dev3

Event Timeline

LucasWerkmeister subscribed.

This is also breaking the daily AC/DC browser tests running against Beta Commons.

{"error":{"code":"cirrussearch-backend-error","info":"We could not complete your search due to a temporary problem. Please try again later.","docref":"See https://commons.wikimedia.beta.wmflabs.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."},"servedby":"deployment-mediawiki11"}

Seems to have broken at some point between 2023-04-01T20:52 and 2023-04-02T20:52 (both UTC) if I’m not mistaken.

dcausse added subscribers: bking, dcausse.

Certificates of the elastic machines seem to have expired:

* Server certificate:
*  subject: CN=deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud
*  start date: Mar  5 11:03:00 2023 GMT
*  expire date: Apr  2 11:03:00 2023 GMT
*  issuer: C=US; L=San Francisco; O=Wikimedia Foundation, Inc; OU=Cloud Services; CN=discovery
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.

I'm not sure how to fix this. @bking do you have an idea on how to fix these for the 3 hosts: deployment-elastic(09|10|11).deployment-prep.eqiad1.wikimedia.cloud ?

Last time I tried this, I was advised to generate certificates with the Puppet-ECDSA script, which did not produce trusted certificates. I was able to work around that with some trial and error . I'll try this again now.

Looks like we also need to set up some alerts for certificate expiry, so I'll look in to that as well.

bking claimed this task.
bking moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Upon further review, it looks like beta cluster changed how it handles TLS certificates. Nginx TLS config on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud:

ssl_certificate /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud.chained.pem;
ssl_certificate_key /etc/cfssl/ssl/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud/discovery__deployment-elastic09_deployment-prep_eqiad1_wikimedia_cloud-key.pem;

Contrast with production elastic host:

ssl_certificate /etc/ssl/localcerts/search.discovery.wmnet.chained.crt;
ssl_certificate_key /etc/ssl/private/search.discovery.wmnet.key;

It looks like whenever this changed on the beta hosts, nginx was not reloaded. I reloaded nginx on all 3 instances and it appears that nginx is serving a validate certificate now.

before reload:

bking@deployment-mwmaint02:~$ curl https://deployment-elastic10.deployment-prep.eqiad1.wikimedia.cloud:9643
curl: (60) SSL certificate problem: certificate has expired

after reload:

bking@deployment-mwmaint02:~$ curl https://deployment-elastic10.deployment-prep.eqiad1.wikimedia.cloud:9643
{
  "name" : "deployment-elastic10-beta-search-psi",
  "cluster_name" : "beta-search-psi",
  "cluster_uuid" : "3qgHSCOrSZeGVeurqL6fGA",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "oss",
    "build_type" : "deb",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Thus, I believe the issue is resolved. Please feel free to reopen this ticket if you do not agree.

Yeah, now it’s working again for me. Thanks!