Page MenuHomePhabricator

Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself
Closed, ResolvedPublicBUG REPORT

Description

Users on commons are reporting experiencing an increased number of upload errors, particularly with chunked upload. https://commons.wikimedia.org/wiki/Commons:Village_pump#Problems_with_uploading_files_above_100_MiB

On logstash, there is a large spike in HTTP 502 (Bad Gateway) in 'SwiftFileBackend::doGetLocalCopyMulti' (and othe swift rfunctions) errors, about 10,000 per hour, starting around July 12 16:00 UTC. https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=h@3f6e3a2&_a=h@4a68d83

All this suggests something is partially broken with swift.

Event Timeline

tstarling triaged this task as Unbreak Now! priority.Jul 15 2022, 12:06 AM

Logstash search for SwiftFileBackend

SwiftFileBackend.png (298×544 px, 20 KB)

You can see it in the nginx log sizes on ms-fe1010:

-rw-r----- 1 www-data www-data   6904870 Jul 15 00:24 unified.error.log
-rw-r----- 1 www-data www-data 667552422 Jul 15 00:00 unified.error.log.1
-rw-r----- 1 www-data www-data  89145055 Jul 14 00:00 unified.error.log.2.gz
-rw-r----- 1 www-data www-data  25893050 Jul 13 00:00 unified.error.log.3.gz
-rw-r----- 1 www-data www-data       741 Jul 11 21:43 unified.error.log.4.gz
-rw-r----- 1 www-data www-data      1741 Jul 10 21:29 unified.error.log.5.gz

Typical log entry:

2022/07/15 00:23:11 [error] 738#738: *1027446107 connect() failed (110: Connection timed out) while connecting to upstream, client: 10.64.48.104, server: ms-fe.svc.eqiad.wmnet, request: "GET /wikipedia/commons/thumb/8/8f/Reese_Air_Force_Base_-_Welcome_with_T-33_Shooting_Star_Jet_Trainer.jpg/240px-Reese_Air_Force_Base_-_Welcome_with_T-33_Shooting_Star_Jet_Trainer.jpg HTTP/1.1", upstream: "http://10.64.16.92:80/wikipedia/commons/thumb/8/8f/Reese_Air_Force_Base_-_Welcome_with_T-33_Shooting_Star_Jet_Trainer.jpg/240px-Reese_Air_Force_Base_-_Welcome_with_T-33_Shooting_Star_Jet_Trainer.jpg", host: "upload.wikimedia.org", referrer: "https://commons.wikimedia.org/"

I restarted swift-proxy on ms-fe1010, which I think has fixed it. Here's how I realised it was a problem specific to ms-fe1010:

ms-fe-cpu.png (537×743 px, 88 KB)

tstarling lowered the priority of this task from Unbreak Now! to Medium.Jul 15 2022, 12:43 AM

Logstash, CPU usage and nginx logs all show recovery. I will leave it open at reduced priority until the relevant SRE folks see it, for post mortem analysis and followup.

ms-fe1010 was flapping saying it's front & backend was critical yesterday.

Example:

01:19:08 <icinga-wm> PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.094 second response time https://wikitech.wikimedia.org/wiki/Swift
00:53:52 <icinga-wm> PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.297 second response time https://wikitech.wikimedia.org/wiki/Swift

This also stopped as soon as Tim restarted swift-proxy

@tstarling thanks for fixing.

What is surprising to me at least is that the grafana swift dashboards don't reflect this - you can see a brief spike in read errors around the failure on 12th July, but then back to normal (and no write errors reported at all) - @fgiunchedi is that all as expected?

Also, pybal should stop sending requests to broken servers, so I'm not sure why that didn't happen here.

The combination is particularly perplexing - our Swift dashboard is what I'd expect to see (brief spike in errors as the node starts misbehaving, then back to normal since pybal should have stop sending it traffic), but that's seemingly not what happened :-/

The Swift dashboard only shows errors reported by Swift, but it was not reporting any errors. It was not aware of its own brokenness. Nginx was seeing connection timeouts, and MediaWiki was seeing the 502s generated by nginx, and both were logging those errors but those error rates were not shown on the Swift dashboard.

Pybal was flapping, you can see that in its logs. Typical cycle:

Jul 14 07:20:15 lvs1019 pybal[14003]: [swift-https_443] INFO: Server ms-fe1010.eqiad.wmnet (enabled/partially up/not pooled) is up
Jul 14 07:20:50 lvs1019 pybal[14003]: [swift-https_443 ProxyFetch] WARN: ms-fe1010.eqiad.wmnet (enabled/up/pooled): Fetch failed (https://localhost/monitoring/frontend), 5.001 s
Jul 14 07:20:50 lvs1019 pybal[14003]: [swift-https_443] ERROR: Monitoring instance ProxyFetch reports server ms-fe1010.eqiad.wmnet (enabled/up/pooled) down: Getting https://localhost/monitoring/frontend took longer than 5 seconds.
Jul 14 07:21:00 lvs1019 pybal[14003]: [swift-https_443] INFO: Server ms-fe1010.eqiad.wmnet (enabled/partially up/not pooled) is up
Jul 14 07:22:05 lvs1019 pybal[14003]: [swift-https_443 ProxyFetch] WARN: ms-fe1010.eqiad.wmnet (enabled/up/pooled): Fetch failed (https://localhost/monitoring/frontend), 5.001 s
Jul 14 07:22:05 lvs1019 pybal[14003]: [swift-https_443] ERROR: Monitoring instance ProxyFetch reports server ms-fe1010.eqiad.wmnet (enabled/up/pooled) down: Getting https://localhost/monitoring/frontend took longer than 5 seconds.
Jul 14 07:22:15 lvs1019 pybal[14003]: [swift-https_443] INFO: Server ms-fe1010.eqiad.wmnet (enabled/partially up/not pooled) is up

So it would be down for 10 seconds, then up for 45 or 65 seconds, then repeat. swift-proxy wasn't down, it was pathologically slow, so it would apparently recover each time it was depooled.

Maybe I should have done a root cause analysis instead of just restarting it. Timeline:

  • 15:44:20: Spike of "Inactive DC request HTTP error: 503 Service Unavailable" begins in ms-fe1010:server.log. 9111 log entries beginning at 15:44:20 and ending at 16:33:14
  • 16:33:08 - 16:33:50: 20 log entries like: Inactive DC request HTTP error: 429 Too Many Requests http://thumbor.svc.codfw.wmnet:8800/wikipedia/commons/thumb/8/87/Carte_des_duch%C3%A9z_de_Lorraine_et_Bar_-_%28Tassin%29_-_btv1b8592615x.jpg/1920px-Carte_des_duch%C3%A9z_de_Lorraine_et_Bar_-_%28Tassin%29_-_btv1b8592615x.jpg
  • 16:34:27: well there's your problem
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: Traceback (most recent call last):
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3/dist-packages/eventlet/hubs/poll.py", line 111, in wait#012    listener.cb(fileno)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3/dist-packages/eventlet/greenthread.py", line 221, in main#012    result = function(*args, **kwargs)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/local/lib/python3.9/dist-packages/wmf/rewrite.py", line 60, in inactivedc_request#012    opener.open(url)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 517, in open#012    response = self._open(req, data)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 534, in _open#012    result = self._call_chain(self.handle_open, protocol, protocol +
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: Traceback (most recent call last):
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3/dist-packages/eventlet/hubs/poll.py", line 111, in wait#012    listener.cb(fileno)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3/dist-packages/eventlet/greenthread.py", line 221, in main#012    result = function(*args, **kwargs)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/local/lib/python3.9/dist-packages/wmf/rewrite.py", line 60, in inactivedc_request#012    opener.open(url)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 517, in open#012    response = self._open(req, data)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 534, in _open#012    result = self._call_chain(self.handle_open, protocol, protocol +
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain#012    result = func(*args)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 1375, in http_open#012    return self.do_open(http.client.HTTPConnection, req)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/urllib/request.py", line 1350, in do_open#012    r = h.getresponse()
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse#012    response.begin()
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/http/client.py", line 307, in begin#012    version, status, reason = self._read_status()
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/http/client.py", line 268, in _read_status#012    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3.9/socket.py", line 704, in readinto#012    return self._sock.recv_into(b)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 371, in recv_into#012    return self._recv_loop(self.fd.recv_into, 0, buffer, nbytes, flags)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 349, in _recv_loop#012    return recv_meth(*args)
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: TimeoutError: [Errno 110] Connection timed out
Jul 12 16:34:27 ms-fe1010 proxy-server: STDERR: Removing descriptor: 75
  • 16:34: CPU usage started declining
  • 16:35:28: First MediaWiki timeout (Excimer request timeout with Swift in the backtrace)
  • 16:37:27: First nginx timeout logged

There's no sign of the 503 error spike in the Grafana dashboard.

tstarling renamed this task from Spike in Swift errors to Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.Jul 16 2022, 6:30 AM
tstarling raised the priority of this task from Medium to High.Jul 18 2022, 8:14 AM

Increasing priority to high since it's an accident waiting to happen.

The WMF rewrite middleware always makes me sad, and here it is doing so again...

AIUI how this is meant to work is that if we've asked thumbor to make us a new thumbnail (inside handle404), we then call eventlet.spawn to make a new eventlet.greenthread.GreenThread which attempts to copy the newly-minted thumbnail to the other DC. This is meant to be lightweight and throwaway, as the code notes:

# We successfully generated a thumbnail on the active DC, send the same request
# blindly to the inactive DC to populate Swift there, not waiting for the response
inactivedc_encodedurl = self.thumborify_url(reqorig, self.inactivedc_thumborhost)
eventlet.spawn(self.inactivedc_request, thumbor_opener, inactivedc_encodedurl)

(we also use eventlet.spawn elsewhere to update the expiry header).

These are meant to be lightweight threads that shouldn't get in the way of serving requests...

@fgiunchedi the obvious starter-for-ten approach to stopping this happening again would be to just ditch the cross-DC-thumbnail copying, since it seems to be sometimes causing problems. What would you feel about that as an approach?

@fgiunchedi the obvious starter-for-ten approach to stopping this happening again would be to just ditch the cross-DC-thumbnail copying, since it seems to be sometimes causing problems. What would you feel about that as an approach?

For context, IIRC we implemented the "shadow requests to inactive site" for a couple of reasons:

  • have the inactive site thumbnails "primed" already, so there's nothing to do at switchover time
  • have Thumbor in the inactive site receive realistic traffic to e.g. test upgrades (more) safely

None are essential to keep Thumbor working of course, especially the former though begs the question of what to do at switchover time (maybe nothing is fine too and Thumbor can take the incoming traffic? not sure)

tl;dr yes it'd work in the immediate term, with (at least) the caveats I've outlined above

Are there other teams you think we should talk to before turning this off, then?

Are there other teams you think we should talk to before turning this off, then?

Indeed, I know @hnowlan is working on Thumbor on k8s, on the performance side perhaps @Krinkle would have some insights re: switchover

I'm working on T279664. Active/active multi-DC mode for MediaWiki is coming very soon. About a month ago I did a quick review of multi-DC support in the image stack and concluded that it was done already, since ATS already forwards to Swift in the nearest DC. In the Swift configuration on ms-fe2012 I see

thumborhost = thumbor.svc.codfw.wmnet:8800
inactivedc_thumborhost = thumbor.svc.eqiad.wmnet:8800

ms-fe2012 is serving real traffic and forwards to the local Thumbor. This configuration calls eqiad the inactive DC, but really it is just the remote DC. The Grafana dashboards show that the Swift and Thumbor clusters in both DCs are busy.

T201858 contains a generous dose of clue. Gilles said "I suspect that making thumbnail traffic active/active might actually require less effort than the request replication, but we can of course do both as well." Doing both was the chosen option. Request replication was deployed on August 30, 2018, and active/active mode was deployed on September 4.

None of this is related to MediaWiki's multi-DC project. The image stack had its own multi-DC project which was completed in 2018.

@tstarling thanks for finding that history. That is pushing me harder towards "just disable the request replication", particularly given it looks like it is sometimes causing us problems.

Thank you indeed for the context @tstarling, +1 to disable the request replication @MatthewVernon

Change 816726 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: stop flinging thumbnails at other DC in rewrite.py

https://gerrit.wikimedia.org/r/816726

Change 816726 merged by MVernon:

[operations/puppet@production] swift: stop flinging thumbnails at other DC in rewrite.py

https://gerrit.wikimedia.org/r/816726

Mentioned in SAL (#wikimedia-operations) [2022-07-28T09:24:55Z] <Emperor> rolling restart of swift proxies to apply wmf/rewrite update T313102

MatthewVernon claimed this task.

I've deployed this fix now, so closing this issue.

Change 879827 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Alerts: stop alerting on thumb number mismatch

https://gerrit.wikimedia.org/r/879827