Page MenuHomePhabricator

Thanos bucket operations sporadic errors
Closed, ResolvedPublic

Description

I noticed the "thanos compact errors" alert firing (and recovering shortly after):

PROBLEM - Thanos compact has high percentage of failures on           alert1001 is CRITICAL: job=thanos-compact

And indeed there have been sporadic errors the last couple of days:

https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact?viewPanel=12&orgId=1&from=1624729924435&to=1625040554813

The timing nicely aligns with sdf failing on thanos-be1003 (T285664 + T285662)

The error on the thanos-compact side has to do with multipart uploads, e.g.

Jun 29 20:55:36 thanos-fe2001 thanos-compact[5024]: level=error ts=2021-06-29T20:55:36.083160442Z caller=compact.go:386 msg="retriable error" err="compaction: group 0@10531109435386935375: upload of 01F9CRN1CWYQJVEV8Q86EHRX4V failed: upload chunks: upload file /srv/thanos-compact/compact/0@10531109435386935375/01F9CRN1CWYQJVEV8Q86EHRX4V/chunks/000001 as 01F9CRN1CWYQJVEV8Q86EHRX4V/chunks/000001: upload s3 object: One or more of the specified parts could not be found. The part might not have been uploaded, or the specified entity tag might not have matched the part's entity tag.

And I found this bug (fixed in swift 2.21, we're running 2.19 on buster) which sounds promising: https://bugs.launchpad.net/swift/+bug/1636663

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-06-30T08:31:17Z] <godog> remove sdf1 from thanos-be1003 in swift - T285835

I've removed the faulty disk from swift ring, however the errors didn't stop and multipart uploads for big blocks are still happening. The timing of this behavior starting is still suspicious to me and it is pretty close to the sdf failure, which perhaps triggered some other condition (?). Uploads still eventually succeed though, will keep digging for other mitigations as well

The compactor eventually does make progress and is able to upload a block, I'm assuming it has to do with large blocks (i.e. I've seen codfw/eqiad compactions fails but not other sites).

herron triaged this task as High priority.Jul 1 2021, 5:17 PM

Status update: the error isn't new (as in, it didn't start appearing on Jun 27th) and thanos-sidecar also sometimes experiences the error. We have sporadic errors dating back from April (see also P16757 for the full logs) although only recently the errors show up persistently in thanos-compact. Also note that both for sidecar and compact the problems show up only in codfw/eqiad for the "ops" instance, which leads me to believe the root cause has something to do with the data size.

Also WRT the upstream bug https://bugs.launchpad.net/swift/+bug/1636663, it could be related but I couldn't get positive confirmation: there are connection timeouts (and related async pendings) in the logs although only for wdqs_updater operations, not for thanos multipart uploads. A possible strategy to test the new swift version (minimally) would be:

  1. finish/verify that swift on bullseye in pontoon works as expected
  2. depool thanos-fe2001
  3. reimage thanos-fe2001 as bullseye
  4. run thanos-compact on thanos-fe2001 and see if the errors persist

Mentioned in SAL (#wikimedia-operations) [2021-07-02T09:24:15Z] <godog> test thanos 0.21.1 locally on thanos-fe2001 and depool the host - T285835

Mentioned in SAL (#wikimedia-operations) [2021-07-02T09:24:15Z] <godog> test thanos 0.21.1 locally on thanos-fe2001 and depool the host - T285835

This is working as expected, the compactor is compacting again and was able to upload the block it couldn't before. I believe the root cause is a fix for a data race in minio-go for multipart uploads, which would sort of explain the problem on large blocks. I'll let thanos-compact do its thing and schedule a thanos upgrade fleetwide.

The error One or more of the specified parts could not be found. The part might not have been uploaded, or the specified entity tag might not have matched the part's entity tag has appeared sporadically but the compactor is much more resilient to it now and it is able to make progress again.

There are still upload errors from the compactor as noted above, especially when uploading large blocks.

Not sure if related yet but I noticed that swift-recon --time reports very small time drift consistently on thanos-fe2001 and thanos-fe2002 but not on any other host:

# cumin -s4 -b1 'thanos-fe*' 'swift-recon --time | grep -v "Checking time-sync"'
6 hosts will be targeted:
thanos-fe[2001-2003].codfw.wmnet,thanos-fe[1001-1003].eqiad.wmnet
Ok to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit y
Ok to proceed on 6 hosts? Enter the number of affected hosts to confirm or "q" to quit 6
===== NODE GROUP =====                                                                                                                                                                                              
(4) thanos-fe2003.codfw.wmnet,thanos-fe[1001-1003].eqiad.wmnet                                                                                                                                                      
----- OUTPUT of 'swift-recon --ti...cking time-sync"' -----                                                                                                                                                         
===============================================================================                                                                                                                                     
--> Starting reconnaissance on 111 hosts (object)                                                                                                                                                                   
===============================================================================
111/111 hosts matched, 0 error[s] while checking hosts.
===============================================================================
===== NODE GROUP =====                                                                                                                                                                                              
(1) thanos-fe2002.codfw.wmnet                                                                                                                                                                                       
----- OUTPUT of 'swift-recon --ti...cking time-sync"' -----                                                                                                                                                         
===============================================================================                                                                                                                                     
--> Starting reconnaissance on 111 hosts (object)                                                                                                                                                                   
===============================================================================
!! http://10.192.48.27:6011/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0000 sec
!! http://10.192.48.27:6022/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0002 sec
!! http://10.192.48.27:6019/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0003 sec
!! http://10.192.48.27:6013/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0001 sec
!! http://10.192.32.107:6015/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0002 sec
!! http://10.192.48.27:6010/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0003 sec
!! http://10.192.48.27:6017/recon/time current time is 2021-07-12 08:16:48, but remote is 2021-07-12 08:16:48, differs by 0.0002 sec
104/111 hosts matched, 0 error[s] while checking hosts.
===============================================================================
===== NODE GROUP =====                                                                                                                                                                                              
(1) thanos-fe2001.codfw.wmnet                                                                                                                                                                                       
----- OUTPUT of 'swift-recon --ti...cking time-sync"' -----                                                                                                                                                         
===============================================================================                                                                                                                                     
--> Starting reconnaissance on 111 hosts (object)                                                                                                                                                                   
===============================================================================
!! http://10.192.48.27:6011/recon/time current time is 2021-07-12 08:16:42, but remote is 2021-07-12 08:16:42, differs by 0.0004 sec
!! http://10.192.32.107:6011/recon/time current time is 2021-07-12 08:16:42, but remote is 2021-07-12 08:16:42, differs by 0.0001 sec
!! http://10.192.48.27:6019/recon/time current time is 2021-07-12 08:16:42, but remote is 2021-07-12 08:16:42, differs by 0.0000 sec
!! http://10.192.32.107:6018/recon/time current time is 2021-07-12 08:16:42, but remote is 2021-07-12 08:16:42, differs by 0.0001 sec
!! http://10.192.48.27:6013/recon/time current time is 2021-07-12 08:16:42, but remote is 2021-07-12 08:16:42, differs by 0.0002 sec
!! http://10.192.48.27:6020/recon/time current time is 2021-07-12 08:16:42, but remote is 2021-07-12 08:16:42, differs by 0.0000 sec
105/111 hosts matched, 0 error[s] while checking hosts.
===============================================================================
================

Mentioned in SAL (#wikimedia-operations) [2021-07-12T08:38:11Z] <godog> test a single frontend for thanos-swift / thanos-query to test "bad host" theory - T285835

Mentioned in SAL (#wikimedia-operations) [2021-07-02T09:24:15Z] <godog> test thanos 0.21.1 locally on thanos-fe2001 and depool the host - T285835

This is working as expected, the compactor is compacting again and was able to upload the block it couldn't before. I believe the root cause is a fix for a data race in minio-go for multipart uploads, which would sort of explain the problem on large blocks. I'll let thanos-compact do its thing and schedule a thanos upgrade fleetwide.

The thanos upgrade on prometheus hosts does require a prometheus upgrade too, cfr T222113

Mentioned in SAL (#wikimedia-operations) [2021-07-12T10:01:34Z] <godog> test thanos-compact upload with smaller part size - T285835

I think I was able to mitigate the problem by using a part_size: 32mb setting for multi-part uploads from the compactor. This will create more objects in swift (default part size is 64mb) although it seems that basically all uploads from the compactor were successful since yesterday. After the latest experiments I think this is a (data?) race of some kind within thanos/minio s3 client and/or a timeout.

I think I was able to mitigate the problem by using a part_size: 32mb setting for multi-part uploads from the compactor. This will create more objects in swift (default part size is 64mb) although it seems that basically all uploads from the compactor were successful since yesterday. After the latest experiments I think this is a (data?) race of some kind within thanos/minio s3 client and/or a timeout.

Yesterday I've enabled minio's http trace and these are the results, this is the POST to finalize the multi part upload, after all parts have been uploaded:

---------START-HTTP---------
POST /thanos/01FAG8610APZBY5MPR3923AR5N/chunks/000006?uploadId=MGNiOGVlZDktNmEyOC00M2NlLWI0ODEtMDA3NWM3NjRkMDUx HTTP/1.1
Host: thanos-swift.discovery.wmnet
User-Agent: MinIO (linux; amd64) minio-go/v7.0.10 thanos-compact/0.21.1 (go1.15.9)
Content-Length: 1155
Authorization: AWS **REDACTED**:**REDACTED**
Date: Tue, 13 Jul 2021 15:53:33 GMT
Accept-Encoding: gzip

HTTP/1.1 400 Bad Request
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Tue, 13 Jul 2021 15:53:33 GMT
Server: envoy
X-Amz-Id-2: tx87b5174180a04eb8a3f14-0060edb6fd
X-Amz-Request-Id: tx87b5174180a04eb8a3f14-0060edb6fd
X-Envoy-Upstream-Service-Time: 57
X-Openstack-Request-Id: tx87b5174180a04eb8a3f14-0060edb6fd
X-Trans-Id: tx87b5174180a04eb8a3f14-0060edb6fd

1a0
<?xml version='1.0' encoding='UTF-8'?>
<Error><Code>InvalidPart</Code><Message>One or more of the specified parts could not be found. The part might not have been upl
oaded, or the specified entity tag might not have matched the part's entity tag.</Message><RequestId>tx87b5174180a04eb8a3f14-00
60edb6fd</RequestId><PartNumber>5</PartNumber><UploadId>MGNiOGVlZDktNmEyOC00M2NlLWI0ODEtMDA3NWM3NjRkMDUx</UploadId></Error>
0
---------END-HTTP---------

And part 5 for that upload has just been uploaded:

---------START-HTTP---------
PUT /thanos/01FAG8610APZBY5MPR3923AR5N/chunks/000006?partNumber=5&uploadId=MGNiOGVlZDktNmEyOC00M2NlLWI0ODEtMDA3NWM3NjRkMDUx HTT
P/1.1
Host: thanos-swift.discovery.wmnet
User-Agent: MinIO (linux; amd64) minio-go/v7.0.10 thanos-compact/0.21.1 (go1.15.9)
Content-Length: 67108864
Authorization: AWS **REDACTED**:**REDACTED**
Date: Tue, 13 Jul 2021 15:53:31 GMT
Accept-Encoding: gzip

HTTP/1.1 200 OK
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Date: Tue, 13 Jul 2021 15:53:33 GMT
Etag: "01047d2171455d5db11577caf4011a02"
Last-Modified: Tue, 13 Jul 2021 15:53:32 GMT
Server: envoy
X-Amz-Id-2: txe5dc003d1d634fd493c6e-0060edb6fb
X-Amz-Request-Id: txe5dc003d1d634fd493c6e-0060edb6fb
X-Envoy-Upstream-Service-Time: 797
X-Openstack-Request-Id: txe5dc003d1d634fd493c6e-0060edb6fb
X-Trans-Id: txe5dc003d1d634fd493c6e-0060edb6fb
---------END-HTTP---------

So it seems to me that things are working as expected, however from time to time the containers are not updated yet with the new part and thus the upload fails. This makes me think that https://bugs.launchpad.net/swift/+bug/1636663 would indeed fix this problem.

To achieve that we need a Thanos frontend with Bullseye (which we have to do anyways at some point). Note that for this issue's purposes we can limit Bullseye to one host (the compactor) and even leave the host depooled from traffic, and upgrade the rest at some later point in time.

I'll be doing the following steps to achieve the above:

  • Set up a Bullseye thanos::frontend instance in swift project
  • Iterate/change the role until applies cleanly
  • Verify the changes don't affect (or affect minimally) also Buster (thanos-fe/thanos-be) and Stretch (ms-fe/ms-be) hosts.
  • Merge changes and reimage thanos-fe2001 with Bullseye
  • Verify the issue is fixed (i.e. thanos compact is happy and no uploads fail)

Change 704513 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: test moving compactor to thanos-fe2003

https://gerrit.wikimedia.org/r/704513

Change 704514 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: undeploy statsite from Thanos

https://gerrit.wikimedia.org/r/704514

Change 704513 abandoned by Filippo Giunchedi:

[operations/puppet@production] thanos: test moving compactor to thanos-fe2003

Reason:

Not needed

https://gerrit.wikimedia.org/r/704513

Change 704514 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: undeploy statsite from Thanos

https://gerrit.wikimedia.org/r/704514

Change 704530 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Remove statsite from swift/thanos, replaced by statsd-exporter

https://gerrit.wikimedia.org/r/704530

Change 704531 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: use python3 packages as needed

https://gerrit.wikimedia.org/r/704531

Change 704532 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: move stats scripts to python3

https://gerrit.wikimedia.org/r/704532

Change 704533 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: split wmf rewrite into python3

https://gerrit.wikimedia.org/r/704533

Change 704530 merged by Filippo Giunchedi:

[operations/puppet@production] Remove statsite from swift/thanos, replaced by statsd-exporter

https://gerrit.wikimedia.org/r/704530

Change 704531 merged by Filippo Giunchedi:

[operations/puppet@production] swift: use python3 packages as needed

https://gerrit.wikimedia.org/r/704531

Change 704532 merged by Filippo Giunchedi:

[operations/puppet@production] swift: move stats scripts to python3

https://gerrit.wikimedia.org/r/704532

Change 704533 merged by Filippo Giunchedi:

[operations/puppet@production] swift: split wmf rewrite into python3

https://gerrit.wikimedia.org/r/704533

Change 704551 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: handle bytes vs str for subprocess' output

https://gerrit.wikimedia.org/r/704551

Change 704551 merged by Filippo Giunchedi:

[operations/puppet@production] swift: handle bytes vs str for subprocess' output

https://gerrit.wikimedia.org/r/704551

Change 704747 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use Bullseye on thanos-fe2001

https://gerrit.wikimedia.org/r/704747

Change 704747 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use Bullseye on thanos-fe2001

https://gerrit.wikimedia.org/r/704747

Mentioned in SAL (#wikimedia-operations) [2021-07-15T07:31:27Z] <godog> reimage thanos-fe2001 with bullseye - T285835

Ok so Bullseye d-i can't detect the link up on the broadcom 10G nic, which in the past meant we have to upgrade the NIC's firmware and I'll do that next

Mentioned in SAL (#wikimedia-operations) [2021-07-15T10:56:23Z] <volans> commented out cron-spam entries on thanos-fe2001, puppet is disabled, thanos-store.service fails to start - T285835

@fgiunchedi I've commented the cron-spam entries in /var/spool/cron/crontabs/root because it was sending spam every minute.
AFAICT thanos-store.service fail to start because can't connect to https://thanos-swift.discovery.wmnet/thanos/?location= that resolves to 10.2.1.54 that is allocated on lo and there is envoy listening on 443.

[edited as I hit submit too soon by mistake]

[edited previous message as I hit submit too soon by mistake]

Icinga downtime set by volans@cumin2002 for 3:00:00 1 host(s) and their services with reason: Extending downtime post-reimage

thanos-fe2001.codfw.wmnet

I've added 3 more downtime hours to the host as the original one from the reimage is about to expire.

Ok so Bullseye d-i can't detect the link up on the broadcom 10G nic, which in the past meant we have to upgrade the NIC's firmware and I'll do that next

This worked as expected, after the nic fw upgrade bullseye was able to be installed no problem

Change 704777 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: use addresses for memcached

https://gerrit.wikimedia.org/r/704777

Change 704780 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: move thanos swift stats reports

https://gerrit.wikimedia.org/r/704780

Change 704780 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: move thanos swift stats reports

https://gerrit.wikimedia.org/r/704780

I've added 3 more downtime hours to the host as the original one from the reimage is about to expire.

Thank you! Appreciate this and the cronspam fix

The host remains depooled but thanos-compact is running fine. Despite the fact that s3api doesn't seem to work out of the box in a mixed cluster (I'm talking with upstream about it), I've temporarily patched swift on thanos-fe2001 and now things seem to work as expected. I'll let thanos-compact run and see if we run into problems

Change 704920 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: enable listing_formats on bullseye

https://gerrit.wikimedia.org/r/704920

There are a couple of puppet patches pending but otherwise things seem to work fine on thanos-fe2001!

Change 704777 merged by Filippo Giunchedi:

[operations/puppet@production] swift: use addresses for memcached

https://gerrit.wikimedia.org/r/704777

Change 704920 merged by Filippo Giunchedi:

[operations/puppet@production] swift: enable listing_formats on bullseye

https://gerrit.wikimedia.org/r/704920

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

This is complete, errors are gone. As a nice side effect we have the first Swift + Bullseye running in production (thanos-fe2001)