Page MenuHomePhabricator

beta cluster down
Closed, ResolvedPublic

Description

beta cluster is down since T329535. For now there are two issues:

  • deployment-db10 has disk issues, it needs replacement: T329577
  • haproxy refuses to start on deployment-cache-text07

Event Timeline

- Logs begin at Tue 2023-02-14 02:48:28 UTC, end at Tue 2023-02-14 15:42:43 UTC. --
Feb 14 15:38:30 deployment-cache-text07 systemd[1]: Starting HAProxy Load Balancer...
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: Traceback (most recent call last):
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 291, in <module>
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:     main()
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 284, in main
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:     certs_fetch_ocsp(out_tempfile, args)
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 209, in certs_fetch_ocsp
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:     (ocsp_text, ocsp_err) = check_output_errtext(cmd)
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 102, in check_output_errtext
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]:     (" ".join(args), p.returncode, p_err))
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: Exception: Command openssl ocsp -resp_text -respout /var/cache/ocsp/update-ocsp-_omnngyi.tmp/digicert-2021-ecdsa-unified.ocsp -issuer /etc/ssl/certs/ebc232bc.0 -verify_other /etc/ssl/certs/ebc232bc.0 -url http://ocsp.digicert.com -header Host=ocsp.digicert.com -cert /etc/ssl/localcerts/digicert-2021-ecdsa-unified.crt failed with exit code 1, stderr:
Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: OCSP update failed for /etc/update-ocsp.d/digicert-2021-ecdsa-unified.conf
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: Traceback (most recent call last):
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 291, in <module>
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:     main()
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 284, in main
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:     certs_fetch_ocsp(out_tempfile, args)
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 209, in certs_fetch_ocsp
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:     (ocsp_text, ocsp_err) = check_output_errtext(cmd)
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:   File "/usr/local/sbin/update-ocsp", line 102, in check_output_errtext
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]:     (" ".join(args), p.returncode, p_err))
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: Exception: Command openssl ocsp -resp_text -respout /var/cache/ocsp/update-ocsp-xst48s9v.tmp/digicert-2021-rsa-unified.ocsp -issuer /etc/ssl/certs/e83d98dd.0 -verify_other /etc/ssl/certs/e83d98dd.0 -url http://ocsp.digicert.com -header Host=ocsp.digicert.com -cert /etc/ssl/localcerts/digicert-2021-rsa-unified.crt failed with exit code 1, stderr:
Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: OCSP update failed for /etc/update-ocsp.d/digicert-2021-rsa-unified.conf
Feb 14 15:38:31 deployment-cache-text07 systemd[1]: haproxy.service: Control process exited, code=exited, status=1/FAILURE
Feb 14 15:38:31 deployment-cache-text07 systemd[1]: haproxy.service: Failed with result 'exit-code'.
Feb 14 15:38:31 deployment-cache-text07 systemd[1]: Failed to start HAProxy Load Balancer.

hrm, so what's changed about deployment-cache-text07?

The haproxy systemd unit file definition contains an ExecStartPre that's failing:

ExecStartPre=/usr/local/sbin/update-ocsp-all

And that file was edited on Feb 9th:

thcipriani@deployment-cache-text07:/etc/systemd/system$ ls -lhA | grep haproxy
-rw-r--r-- 1 root root  710 Feb  9 16:53 haproxy.service

But the configuration for the ocsp check has been on deployment-cache-text-07 since Oct:

thcipriani@deployment-cache-text07:~$ ls -lhA /etc/update-ocsp.d
total 12K
-r--r--r-- 1 root root  138 Oct 17 11:40 digicert-2021-ecdsa-unified.conf
-r--r--r-- 1 root root  134 Oct 17 11:40 digicert-2021-rsa-unified.conf
dr-xr-xr-x 2 root root 4.0K Oct 17 11:41 hooks

We could set profile::cache::haproxy::do_ocsp: false, but I have no idea if that's the right thing—I'm a bit out of my depth troubleshooting haproxy. It looks like the role giving us trouble was updated in Oct of 2021 by @Vgutierrez, maybe it's obvious to @Vgutierrez what's wrong or he could point us in the right direction?

hrm, so what's changed about deployment-cache-text07?

Was this part of the T293585 amelioration (aka T320930)? I think that (Oct 2022) was the last time we did anything major to that box.

The culprit was actually the following hiera config for the deployment-cache puppet prefix:

profile::cache::haproxy::unified_certs:
- digicert-2021-ecdsa-unified
- digicert-2021-rsa-unified

I got rid of it and manually cleaned up /etc/update-ocsp.d. deployment-prep instances use acme-chief managed TLS material and can perform OCSP stapling as acme-chief provides the .ocsp files:

root@deployment-cache-text07:~# ls -alh /etc/acmecerts/unified/live/*.ocsp
lrwxrwxrwx 1 root haproxy  74 Feb  9 16:53 /etc/acmecerts/unified/live/ec-prime256v1.alt.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/ec-prime256v1.ocsp
lrwxrwxrwx 1 root haproxy  74 Feb  9 16:53 /etc/acmecerts/unified/live/ec-prime256v1.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/ec-prime256v1.ocsp
-rw-r----- 1 root haproxy 503 Feb 13 13:08 /etc/acmecerts/unified/live/ec-prime256v1.ocsp
lrwxrwxrwx 1 root haproxy  69 Feb  9 16:53 /etc/acmecerts/unified/live/rsa-2048.alt.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/rsa-2048.ocsp
lrwxrwxrwx 1 root haproxy  69 Feb  9 16:53 /etc/acmecerts/unified/live/rsa-2048.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/rsa-2048.ocsp
-rw-r----- 1 root haproxy 503 Feb 11 11:09 /etc/acmecerts/unified/live/rsa-2048.ocsp

same was happening with deployment-cache-upload07, both instances are now happy.

>>! In T329592#8615340, @thcipriani wrote:
> hrm, so what's changed about `deployment-cache-text07`?
> 
> The haproxy systemd unit file definition contains an `ExecStartPre` that's failing:
> 
> `ExecStartPre=/usr/local/sbin/update-ocsp-all`
> 
> And that file was edited on Feb 9th:
> 
>

thcipriani@deployment-cache-text07:/etc/systemd/system$ ls -lhA | grep haproxy
-rw-r--r-- 1 root root 710 Feb 9 16:53 haproxy.service

But the configuration for the ocsp check has been on deployment-cache-text-07 since Oct:

thcipriani@deployment-cache-text07:~$ ls -lhA /etc/update-ocsp.d
total 12K
-r--r--r-- 1 root root 138 Oct 17 11:40 digicert-2021-ecdsa-unified.conf
-r--r--r-- 1 root root 134 Oct 17 11:40 digicert-2021-rsa-unified.conf
dr-xr-xr-x 2 root root 4.0K Oct 17 11:41 hooks

We could set `profile::cache::haproxy::do_ocsp: false`, but I have no idea if that's the right thing—I'm a bit out of my depth troubleshooting haproxy. It looks like the role giving us trouble was updated in Oct of 2021 by @Vgutierrez, maybe it's obvious to @Vgutierrez what's wrong or he could point us in the right direction?
Vgutierrez updated the task description. (Show Details)

From the CI on restbase that is using beta cluster I am getting this error for mathoid:

{
  "type": "internal_http_error",
  "detail": "connect ECONNREFUSED 185.15.56.41:443",
  "internalStack": "Error: connect ECONNREFUSED 185.15.56.41:443\n    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1144:16)\n    at TCPConnectWrap.callbackTrampoline (internal/async_hooks.js:126:14)",
  "internalURI": "https://mathoid.beta.math.wmflabs.org/texvcinfo",
  "internalQuery": "{}",
  "internalErr": "connect ECONNREFUSED 185.15.56.41:443",
  "internalMethod": "post"
}

according to the DNS records associated with 185.15.56.41, mathoid.beta.math.wmflabs.org seems to be handled by math19.math.wmflabs.org, I don't have access to that WMCS project, somebody else can take a look?

I am still getting this error both on tests and when i try to login:

[Y@z-FeLrRRvlf4qIMcOt4AAAAFc] /w/index.php?returnto=Main+Page&title=Special:UserLogin FileBackendError: Iterator page I/O error.

Backtrace:

from /srv/mediawiki/php-master/includes/libs/filebackend/SwiftFileBackend.php(952)
#0 /srv/mediawiki/php-master/includes/libs/filebackend/fileiteration/SwiftFileBackendDirList.php(39): SwiftFileBackend->getDirListPageInternal(string, string, NULL, integer, array)
#1 /srv/mediawiki/php-master/includes/libs/filebackend/fileiteration/SwiftFileBackendList.php(112): SwiftFileBackendDirList->pageFromList(string, string, NULL, integer, array)
#2 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(233): SwiftFileBackendList->rewind()
#3 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(203): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->pickImageDir(string, integer, integer)
#4 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(466): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->pickImage()
#5 /srv/mediawiki/php-master/extensions/ConfirmEdit/SimpleCaptcha/SimpleCaptcha.php(1225): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->getCaptcha()
#6 /srv/mediawiki/php-master/extensions/ConfirmEdit/includes/Auth/CaptchaPreAuthenticationProvider.php(76): MediaWiki\Extension\ConfirmEdit\SimpleCaptcha\SimpleCaptcha->createAuthenticationRequest()
#7 /srv/mediawiki/php-master/includes/auth/AuthManager.php(2272): MediaWiki\Extension\ConfirmEdit\Auth\CaptchaPreAuthenticationProvider->getAuthenticationRequests(string, array)
#8 /srv/mediawiki/php-master/includes/auth/AuthManager.php(2250): MediaWiki\Auth\AuthManager->getAuthenticationRequestsInternal(string, array, array, User)
#9 /srv/mediawiki/php-master/includes/specialpage/AuthManagerSpecialPage.php(277): MediaWiki\Auth\AuthManager->getAuthenticationRequests(string, User)
#10 /srv/mediawiki/php-master/includes/specialpage/LoginSignupSpecialPage.php(145): AuthManagerSpecialPage->loadAuth(NULL)
#11 /srv/mediawiki/php-master/includes/specialpage/LoginSignupSpecialPage.php(236): LoginSignupSpecialPage->load(NULL)
#12 /srv/mediawiki/php-master/includes/specialpage/SpecialPage.php(700): LoginSignupSpecialPage->execute(NULL)
#13 /srv/mediawiki/php-master/includes/specialpage/SpecialPageFactory.php(1460): SpecialPage->run(NULL)
#14 /srv/mediawiki/php-master/includes/MediaWiki.php(324): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, RequestContext)
#15 /srv/mediawiki/php-master/includes/MediaWiki.php(917): MediaWiki->performRequest()
#16 /srv/mediawiki/php-master/includes/MediaWiki.php(573): MediaWiki->main()
#17 /srv/mediawiki/php-master/index.php(50): MediaWiki->run()
#18 /srv/mediawiki/php-master/index.php(46): wfIndexMain()
#19 /srv/mediawiki/w/index.php(3): require(string)
#20 {main}

I am still getting this error both on tests and when i try to login:

[Y@z-FeLrRRvlf4qIMcOt4AAAAFc] /w/index.php?returnto=Main+Page&title=Special:UserLogin FileBackendError: Iterator page I/O error.

I just logged out and logged back in without issue; perhaps it was a temporary gremlin?

I am still getting this error both on tests and when i try to login:

[Y@z-FeLrRRvlf4qIMcOt4AAAAFc] /w/index.php?returnto=Main+Page&title=Special:UserLogin FileBackendError: Iterator page I/O error.

Backtrace:

from /srv/mediawiki/php-master/includes/libs/filebackend/SwiftFileBackend.php(952)
#0 /srv/mediawiki/php-master/includes/libs/filebackend/fileiteration/SwiftFileBackendDirList.php(39): SwiftFileBackend->getDirListPageInternal(string, string, NULL, integer, array)
#1 /srv/mediawiki/php-master/includes/libs/filebackend/fileiteration/SwiftFileBackendList.php(112): SwiftFileBackendDirList->pageFromList(string, string, NULL, integer, array)
#2 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(233): SwiftFileBackendList->rewind()
#3 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(203): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->pickImageDir(string, integer, integer)
#4 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(466): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->pickImage()
#5 /srv/mediawiki/php-master/extensions/ConfirmEdit/SimpleCaptcha/SimpleCaptcha.php(1225): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->getCaptcha()
#6 /srv/mediawiki/php-master/extensions/ConfirmEdit/includes/Auth/CaptchaPreAuthenticationProvider.php(76): MediaWiki\Extension\ConfirmEdit\SimpleCaptcha\SimpleCaptcha->createAuthenticationRequest()
#7 /srv/mediawiki/php-master/includes/auth/AuthManager.php(2272): MediaWiki\Extension\ConfirmEdit\Auth\CaptchaPreAuthenticationProvider->getAuthenticationRequests(string, array)
#8 /srv/mediawiki/php-master/includes/auth/AuthManager.php(2250): MediaWiki\Auth\AuthManager->getAuthenticationRequestsInternal(string, array, array, User)
#9 /srv/mediawiki/php-master/includes/specialpage/AuthManagerSpecialPage.php(277): MediaWiki\Auth\AuthManager->getAuthenticationRequests(string, User)
#10 /srv/mediawiki/php-master/includes/specialpage/LoginSignupSpecialPage.php(145): AuthManagerSpecialPage->loadAuth(NULL)
#11 /srv/mediawiki/php-master/includes/specialpage/LoginSignupSpecialPage.php(236): LoginSignupSpecialPage->load(NULL)
#12 /srv/mediawiki/php-master/includes/specialpage/SpecialPage.php(700): LoginSignupSpecialPage->execute(NULL)
#13 /srv/mediawiki/php-master/includes/specialpage/SpecialPageFactory.php(1460): SpecialPage->run(NULL)
#14 /srv/mediawiki/php-master/includes/MediaWiki.php(324): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, RequestContext)
#15 /srv/mediawiki/php-master/includes/MediaWiki.php(917): MediaWiki->performRequest()
#16 /srv/mediawiki/php-master/includes/MediaWiki.php(573): MediaWiki->main()
#17 /srv/mediawiki/php-master/index.php(50): MediaWiki->run()
#18 /srv/mediawiki/php-master/index.php(46): wfIndexMain()
#19 /srv/mediawiki/w/index.php(3): require(string)
#20 {main}

Given that this stack-track ends at a problem with the swift backend, I wonder if it was related to T329787: An unknown error occurred in storage backend "global-swift-eqiad" on Beta Cluster ?

I just re-run the tests and looks like the relevant test is passing now. Thanks!