Page MenuHomePhabricator

Beta SWIFT seems to be broken
Closed, ResolvedPublic

Event Timeline

Beta PrivateSettings refers to a nonexistent server deployment-ms-fe02:

$wmfSwiftConfig = array( 'eqiad' => array(
        'authUrl'            => 'http://deployment-ms-fe02.deployment-prep.eqiad.wmflabs/auth',
        'cirrusAuthUrl'      => 'http://deployment-ms-fe02.deployment-prep.eqiad.wmflabs/auth/v1.0',

I changed it to point to deployment-ms-fe03. I'm not sure if it helped, but FileOperation.log has now HTTP 404s instead of 503s.

An attempt to upload a file fails with Could not create directory "mwstore://local-multiwrite/local-public/c/c7".

Manually, the current Swift config in PrivateSettings seems to work:

tgr@deployment-ms-fe03:~$  curl -v http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/auth/v1.0 -H x-auth-user:mw:media -H x-auth-key:...
*   Trying 172.16.5.163...
* TCP_NODELAY set
* Connected to deployment-ms-fe03.deployment-prep.eqiad.wmflabs (172.16.5.163) port 80 (#0)
> GET /auth/v1.0 HTTP/1.1
> Host: deployment-ms-fe03.deployment-prep.eqiad.wmflabs
> User-Agent: curl/7.52.1
> Accept: */*
> x-auth-user:mw:media
> x-auth-key:...
> 
< HTTP/1.1 200 OK
< X-Storage-Url: http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/v1/AUTH_mw
< X-Auth-Token-Expires: 28639
< X-Auth-Token: AUTH_...
< Content-Type: text/html; charset=UTF-8
< X-Storage-Token: AUTH_...
< X-Trans-Id: ...
< Content-Length: 0
< Date: Fri, 05 Mar 2021 12:58:28 GMT
< 
* Curl_http_done: called premature == 0
* Connection #0 to host deployment-ms-fe03.deployment-prep.eqiad.wmflabs left intact

I changed it to point to deployment-ms-fe03. I'm not sure if it helped, but FileOperation.log has now HTTP 404s instead of 503s.

It's all 503s again.

$wmfSwiftConfig['eqiad']['thumborSecret'] does not match swift::params::account_keys::mw_thumbor (or any other hiera value I could find); I'm not sure whether it should, though, and changing it didn't yield any improvement.

Trying to view https://upload.beta.wmflabs.org/wikipedia/en/f/fb/Green_Park_tube_station.jpeg using https://hidester.com/proxy/ results in a form:

401
Authorization Required
The site https://upload.beta.wmflabs.org is requesting a username and password to access the realm "AUTH_mw".

Beta PrivateSettings refers to a nonexistent server deployment-ms-fe02:

$wmfSwiftConfig = array( 'eqiad' => array(
        'authUrl'            => 'http://deployment-ms-fe02.deployment-prep.eqiad.wmflabs/auth',
        'cirrusAuthUrl'      => 'http://deployment-ms-fe02.deployment-prep.eqiad.wmflabs/auth/v1.0',

I changed it to point to deployment-ms-fe03. I'm not sure if it helped, but FileOperation.log has now HTTP 404s instead of 503s.

How was it ever able to work? Did the server exist in the past? Or maybe it's actually using a different configuration file?

Note: the 301 redirect to commons.wikimedia.org at https://upload.beta.wmflabs.org/ still works.

It's all 503s again.

deployment-mwlog01:/srv/mw-log/FileOperation.log at least has 404s when trying to access URLs under http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/v1/AUTH_mw/.

$wmfSwiftConfig['eqiad']['thumborSecret'] does not match swift::params::account_keys::mw_thumbor (or any other hiera value I could find); I'm not sure whether it should, though, and changing it didn't yield any improvement.

Not sure either if they should match, but the keys seem to be spread all over operations/puppet, Horizon, and labs/private (both the public repository and private commits on deployment-puppetmaster04).

How was it ever able to work? Did the server exist in the past? Or maybe it's actually using a different configuration file?

If the current one is named 03 I'm guessing 02 did exist at some point. No clue when it was deleted.

Notably deployment-ms-fe03:/var/log/swift/server.log has errors like these:

Mar 17 17:53:18 deployment-ms-fe03 proxy-server: ERROR Insufficient Storage 172.16.7.114:6002/lv-a1 (txn: [not sure what this is, redacted just in case])
Mar 17 17:53:18 deployment-ms-fe03 proxy-server: Account HEAD returning 503 for [507, 507] (txn: [not sure what this is, redacted just in case])
Mar 17 17:53:18 deployment-ms-fe03 proxy-server: Account HEAD returning 503 for [] (txn: [not sure what this is, redacted just in case])

172.16.7.114 is deployment-ms-be05.

Need to go now, I'll try to investigate more when I have time for that.

ms-be-05:

taavi@deployment-ms-be05:~$ sudo systemctl
[...]
● srv-swift\x2dstorage-lv\x2da1.mount                                       loaded failed failed    /srv/swift-storage/lv-a1
[...]

I tried to mount it, no luck:

taavi@deployment-ms-be05:~$ sudo systemctl status srv-swift\\x2dstorage-lv\\x2da1.mount
● srv-swift\x2dstorage-lv\x2da1.mount - /srv/swift-storage/lv-a1
   Loaded: loaded (/etc/fstab; generated; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2021-03-18 06:19:41 UTC; 7s ago
    Where: /srv/swift-storage/lv-a1
     What: /dev/disk/by-label/swift-lv-a1
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 19963 ExecMount=/bin/mount /dev/disk/by-label/swift-lv-a1 /srv/swift-storage/lv-a1 -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 (code=exited, status=32)

Mar 18 06:19:41 deployment-ms-be05 systemd[1]: Mounting /srv/swift-storage/lv-a1...
Mar 18 06:19:41 deployment-ms-be05 systemd[1]: srv-swift\x2dstorage-lv\x2da1.mount: Mount process exited, code=exited status=32
Mar 18 06:19:41 deployment-ms-be05 systemd[1]: Failed to mount /srv/swift-storage/lv-a1.
Mar 18 06:19:41 deployment-ms-be05 systemd[1]: srv-swift\x2dstorage-lv\x2da1.mount: Unit entered failed state.

and found this on syslog:

Mar 18 06:19:41 deployment-ms-be05 kernel: XFS (dm-0): unknown mount option [nobarrier].

nobarrier was removed on kernel 4.19. deployment-ms-be05 is running Stretch and 4.19, prod puppet manifest says that Buster is on 4.19 and Stretch should be on a lower kernel. Hmm.

After speaking with WMCS staff on IRC I believe most (all?) cloud vps machines have had their kernel updated to 4.19. The options are modifying the puppet manifest to check the kernel version instead of Debian release or downgrade the kernel, not sure which one is better. Thoughts?

It seems a bit silly to downgrade just to avoid a version check in Puppet (although I have no idea how hard that is to implement).

After speaking with WMCS staff on IRC I believe most (all?) cloud vps machines have had their kernel updated to 4.19. The options are modifying the puppet manifest to check the kernel version instead of Debian release or downgrade the kernel, not sure which one is better. Thoughts?

If you are inferring the kernel version from the Debian version, well that's not ideal and you should check the kernel version instead. But also, why not just remove the nobarrier option while/until you figure it out? I have exploits to test, and I imagine that testing the Commons app (which regularly happens on betacommons) hasn't been a breeze lately.

And if performance isn't affected too badly, just leave out the nobarrier option permanently. No need to do whatever version check.

Change 673250 had a related patch set uploaded (by Majavah; owner: Majavah):
[operations/puppet@production] swift: compare kernel version directly

https://gerrit.wikimedia.org/r/673250

The above patch should fix this issue. I scheduled it to the puppet request window later today instead of just cherry-picking because I'm not familiar with swift or its file storage and want that someone who actually knows what it does looks at it.

Change 673250 merged by Jbond:
[operations/puppet@production] swift: compare kernel version directly

https://gerrit.wikimedia.org/r/673250

Swift seems to be working now, please re-open or ping me on IRC if you notice problems with it.

Mentioned in SAL (#wikimedia-releng) [2021-03-18T13:20:33Z] <Majavah> manually systemctl daemon-reload && systemctl start srv-swift\\x2dstorage-lv\\x2da1.mount on deployment-ms-be* nodes for T276179