Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T329592 beta cluster down | |||
Resolved | Zabe | T329577 deployment-db10 databases are broken | |||
Resolved | BUG REPORT | dcaro | T329535 Cloud Ceph outage 2023-02-13 | ||
Resolved | dcaro | T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner | |||
Resolved | dcaro | T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity | |||
Open | dcaro | T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network | |||
Resolved | Request | Papaul | T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet | ||
Resolved | cmooney | T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU | |||
Resolved | Physikerwelt | T329747 Mathoid on beta cluster is down |
Event Timeline
- Logs begin at Tue 2023-02-14 02:48:28 UTC, end at Tue 2023-02-14 15:42:43 UTC. -- Feb 14 15:38:30 deployment-cache-text07 systemd[1]: Starting HAProxy Load Balancer... Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: Traceback (most recent call last): Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 291, in <module> Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: main() Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 284, in main Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: certs_fetch_ocsp(out_tempfile, args) Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 209, in certs_fetch_ocsp Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: (ocsp_text, ocsp_err) = check_output_errtext(cmd) Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 102, in check_output_errtext Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: (" ".join(args), p.returncode, p_err)) Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: Exception: Command openssl ocsp -resp_text -respout /var/cache/ocsp/update-ocsp-_omnngyi.tmp/digicert-2021-ecdsa-unified.ocsp -issuer /etc/ssl/certs/ebc232bc.0 -verify_other /etc/ssl/certs/ebc232bc.0 -url http://ocsp.digicert.com -header Host=ocsp.digicert.com -cert /etc/ssl/localcerts/digicert-2021-ecdsa-unified.crt failed with exit code 1, stderr: Feb 14 15:38:30 deployment-cache-text07 update-ocsp-all[15462]: OCSP update failed for /etc/update-ocsp.d/digicert-2021-ecdsa-unified.conf Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: Traceback (most recent call last): Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 291, in <module> Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: main() Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 284, in main Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: certs_fetch_ocsp(out_tempfile, args) Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 209, in certs_fetch_ocsp Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: (ocsp_text, ocsp_err) = check_output_errtext(cmd) Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: File "/usr/local/sbin/update-ocsp", line 102, in check_output_errtext Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: (" ".join(args), p.returncode, p_err)) Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: Exception: Command openssl ocsp -resp_text -respout /var/cache/ocsp/update-ocsp-xst48s9v.tmp/digicert-2021-rsa-unified.ocsp -issuer /etc/ssl/certs/e83d98dd.0 -verify_other /etc/ssl/certs/e83d98dd.0 -url http://ocsp.digicert.com -header Host=ocsp.digicert.com -cert /etc/ssl/localcerts/digicert-2021-rsa-unified.crt failed with exit code 1, stderr: Feb 14 15:38:31 deployment-cache-text07 update-ocsp-all[15462]: OCSP update failed for /etc/update-ocsp.d/digicert-2021-rsa-unified.conf Feb 14 15:38:31 deployment-cache-text07 systemd[1]: haproxy.service: Control process exited, code=exited, status=1/FAILURE Feb 14 15:38:31 deployment-cache-text07 systemd[1]: haproxy.service: Failed with result 'exit-code'. Feb 14 15:38:31 deployment-cache-text07 systemd[1]: Failed to start HAProxy Load Balancer.
hrm, so what's changed about deployment-cache-text07?
The haproxy systemd unit file definition contains an ExecStartPre that's failing:
ExecStartPre=/usr/local/sbin/update-ocsp-all
And that file was edited on Feb 9th:
thcipriani@deployment-cache-text07:/etc/systemd/system$ ls -lhA | grep haproxy -rw-r--r-- 1 root root 710 Feb 9 16:53 haproxy.service
But the configuration for the ocsp check has been on deployment-cache-text-07 since Oct:
thcipriani@deployment-cache-text07:~$ ls -lhA /etc/update-ocsp.d total 12K -r--r--r-- 1 root root 138 Oct 17 11:40 digicert-2021-ecdsa-unified.conf -r--r--r-- 1 root root 134 Oct 17 11:40 digicert-2021-rsa-unified.conf dr-xr-xr-x 2 root root 4.0K Oct 17 11:41 hooks
We could set profile::cache::haproxy::do_ocsp: false, but I have no idea if that's the right thing—I'm a bit out of my depth troubleshooting haproxy. It looks like the role giving us trouble was updated in Oct of 2021 by @Vgutierrez, maybe it's obvious to @Vgutierrez what's wrong or he could point us in the right direction?
The culprit was actually the following hiera config for the deployment-cache puppet prefix:
profile::cache::haproxy::unified_certs: - digicert-2021-ecdsa-unified - digicert-2021-rsa-unified
I got rid of it and manually cleaned up /etc/update-ocsp.d. deployment-prep instances use acme-chief managed TLS material and can perform OCSP stapling as acme-chief provides the .ocsp files:
root@deployment-cache-text07:~# ls -alh /etc/acmecerts/unified/live/*.ocsp lrwxrwxrwx 1 root haproxy 74 Feb 9 16:53 /etc/acmecerts/unified/live/ec-prime256v1.alt.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/ec-prime256v1.ocsp lrwxrwxrwx 1 root haproxy 74 Feb 9 16:53 /etc/acmecerts/unified/live/ec-prime256v1.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/ec-prime256v1.ocsp -rw-r----- 1 root haproxy 503 Feb 13 13:08 /etc/acmecerts/unified/live/ec-prime256v1.ocsp lrwxrwxrwx 1 root haproxy 69 Feb 9 16:53 /etc/acmecerts/unified/live/rsa-2048.alt.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/rsa-2048.ocsp lrwxrwxrwx 1 root haproxy 69 Feb 9 16:53 /etc/acmecerts/unified/live/rsa-2048.chained.crt.key.ocsp -> /etc/acmecerts/unified/4dad0611440d4460a49cebb9f6b92c6d/rsa-2048.ocsp -rw-r----- 1 root haproxy 503 Feb 11 11:09 /etc/acmecerts/unified/live/rsa-2048.ocsp
same was happening with deployment-cache-upload07, both instances are now happy.
>>! In T329592#8615340, @thcipriani wrote: > hrm, so what's changed about `deployment-cache-text07`? > > The haproxy systemd unit file definition contains an `ExecStartPre` that's failing: > > `ExecStartPre=/usr/local/sbin/update-ocsp-all` > > And that file was edited on Feb 9th: > >
thcipriani@deployment-cache-text07:/etc/systemd/system$ ls -lhA | grep haproxy
-rw-r--r-- 1 root root 710 Feb 9 16:53 haproxy.serviceBut the configuration for the ocsp check has been on deployment-cache-text-07 since Oct:thcipriani@deployment-cache-text07:~$ ls -lhA /etc/update-ocsp.d
total 12K
-r--r--r-- 1 root root 138 Oct 17 11:40 digicert-2021-ecdsa-unified.conf
-r--r--r-- 1 root root 134 Oct 17 11:40 digicert-2021-rsa-unified.conf
dr-xr-xr-x 2 root root 4.0K Oct 17 11:41 hooksWe could set `profile::cache::haproxy::do_ocsp: false`, but I have no idea if that's the right thing—I'm a bit out of my depth troubleshooting haproxy. It looks like the role giving us trouble was updated in Oct of 2021 by @Vgutierrez, maybe it's obvious to @Vgutierrez what's wrong or he could point us in the right direction?
From the CI on restbase that is using beta cluster I am getting this error for mathoid:
{ "type": "internal_http_error", "detail": "connect ECONNREFUSED 185.15.56.41:443", "internalStack": "Error: connect ECONNREFUSED 185.15.56.41:443\n at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1144:16)\n at TCPConnectWrap.callbackTrampoline (internal/async_hooks.js:126:14)", "internalURI": "https://mathoid.beta.math.wmflabs.org/texvcinfo", "internalQuery": "{}", "internalErr": "connect ECONNREFUSED 185.15.56.41:443", "internalMethod": "post" }
according to the DNS records associated with 185.15.56.41, mathoid.beta.math.wmflabs.org seems to be handled by math19.math.wmflabs.org, I don't have access to that WMCS project, somebody else can take a look?
I am still getting this error both on tests and when i try to login:
[Y@z-FeLrRRvlf4qIMcOt4AAAAFc] /w/index.php?returnto=Main+Page&title=Special:UserLogin FileBackendError: Iterator page I/O error. Backtrace: from /srv/mediawiki/php-master/includes/libs/filebackend/SwiftFileBackend.php(952) #0 /srv/mediawiki/php-master/includes/libs/filebackend/fileiteration/SwiftFileBackendDirList.php(39): SwiftFileBackend->getDirListPageInternal(string, string, NULL, integer, array) #1 /srv/mediawiki/php-master/includes/libs/filebackend/fileiteration/SwiftFileBackendList.php(112): SwiftFileBackendDirList->pageFromList(string, string, NULL, integer, array) #2 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(233): SwiftFileBackendList->rewind() #3 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(203): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->pickImageDir(string, integer, integer) #4 /srv/mediawiki/php-master/extensions/ConfirmEdit/FancyCaptcha/includes/FancyCaptcha.php(466): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->pickImage() #5 /srv/mediawiki/php-master/extensions/ConfirmEdit/SimpleCaptcha/SimpleCaptcha.php(1225): MediaWiki\Extension\ConfirmEdit\FancyCaptcha\FancyCaptcha->getCaptcha() #6 /srv/mediawiki/php-master/extensions/ConfirmEdit/includes/Auth/CaptchaPreAuthenticationProvider.php(76): MediaWiki\Extension\ConfirmEdit\SimpleCaptcha\SimpleCaptcha->createAuthenticationRequest() #7 /srv/mediawiki/php-master/includes/auth/AuthManager.php(2272): MediaWiki\Extension\ConfirmEdit\Auth\CaptchaPreAuthenticationProvider->getAuthenticationRequests(string, array) #8 /srv/mediawiki/php-master/includes/auth/AuthManager.php(2250): MediaWiki\Auth\AuthManager->getAuthenticationRequestsInternal(string, array, array, User) #9 /srv/mediawiki/php-master/includes/specialpage/AuthManagerSpecialPage.php(277): MediaWiki\Auth\AuthManager->getAuthenticationRequests(string, User) #10 /srv/mediawiki/php-master/includes/specialpage/LoginSignupSpecialPage.php(145): AuthManagerSpecialPage->loadAuth(NULL) #11 /srv/mediawiki/php-master/includes/specialpage/LoginSignupSpecialPage.php(236): LoginSignupSpecialPage->load(NULL) #12 /srv/mediawiki/php-master/includes/specialpage/SpecialPage.php(700): LoginSignupSpecialPage->execute(NULL) #13 /srv/mediawiki/php-master/includes/specialpage/SpecialPageFactory.php(1460): SpecialPage->run(NULL) #14 /srv/mediawiki/php-master/includes/MediaWiki.php(324): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, RequestContext) #15 /srv/mediawiki/php-master/includes/MediaWiki.php(917): MediaWiki->performRequest() #16 /srv/mediawiki/php-master/includes/MediaWiki.php(573): MediaWiki->main() #17 /srv/mediawiki/php-master/index.php(50): MediaWiki->run() #18 /srv/mediawiki/php-master/index.php(46): wfIndexMain() #19 /srv/mediawiki/w/index.php(3): require(string) #20 {main}
I just logged out and logged back in without issue; perhaps it was a temporary gremlin?
Given that this stack-track ends at a problem with the swift backend, I wonder if it was related to T329787: An unknown error occurred in storage backend "global-swift-eqiad" on Beta Cluster ?