Page MenuHomePhabricator

thanos compact crash during downsampling and restart on invalid checksum for large block
Closed, ResolvedPublic

Description

There were alerts about thanos compact restarting, it looks like Thanos can't fully download a large block to process for downsampling, and fails on the partial block:

Aug 05 14:52:22 thanos-fe2001 thanos-compact[217300]: level=info ts=2021-08-05T14:52:22.348331551Z caller=downsample.go:257 msg="downloaded block" id=01FCAJMWC7R34X9SAW9BBG3KP4 duration=27m27.437833254s
Aug 05 14:52:32 thanos-fe2001 thanos-compact[217300]: level=warn ts=2021-08-05T14:52:32.719729769Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="error executing compaction: firs>
Aug 05 14:52:32 thanos-fe2001 thanos-compact[217300]: level=info ts=2021-08-05T14:52:32.720338438Z caller=http.go:65 service=http/server component=compact msg="internal server is shutting down" err="error executi>
Aug 05 14:52:33 thanos-fe2001 thanos-compact[217300]: level=info ts=2021-08-05T14:52:33.221321273Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err="error e>
Aug 05 14:52:33 thanos-fe2001 thanos-compact[217300]: level=info ts=2021-08-05T14:52:33.221462386Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="error executing compaction: fi>
Aug 05 14:52:33 thanos-fe2001 thanos-compact[217300]: level=error ts=2021-08-05T14:52:33.222682204Z caller=main.go:197 err="read TOC: read TOC: invalid checksum\nopen index file\ngithub.com/thanos-io/thanos/pkg

A few of the blocks that failed

root@thanos-fe2001:~# grep -e 01FCAZKHWX3M2PZSKPAGXHT1F6 -e 01FCAJMWC7R34X9SAW9BBG3KP4 -e 01FCAVKB28XZVJPD42YSK70T9X  thanos-bucket 
| 01FCAJMWC7R34X9SAW9BBG3KP4 | 22-07-2021 00:00:00 | 05-08-2021 00:00:00 | 336h0m0s       | -296h0m0s       | 6,195,310 | 95,136,077,478  | 811,225,626   | 4          | false       | prometheus=ops,replica=a,site=codfw         | 0s         | compactor |
| 01FCAVKB28XZVJPD42YSK70T9X | 22-07-2021 00:00:00 | 05-08-2021 00:00:00 | 336h0m0s       | -296h0m0s       | 6,187,613 | 95,153,384,053  | 811,187,071   | 4          | false       | prometheus=ops,replica=b,site=codfw         | 0s         | compactor |
| 01FCAZKHWX3M2PZSKPAGXHT1F6 | 22-07-2021 00:00:00 | 05-08-2021 00:00:00 | 336h0m0s       | -296h0m0s       | 7,262,333 | 119,362,059,700 | 1,010,579,012 | 4          | false       | prometheus=ops,replica=b,site=eqiad         | 0s         | compactor |

Notice it took 27m to download a ~200G block, which is too long, and from swift's perspective the client (thanos compact) gave up and disconnected

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-08-06T07:58:24Z] <godog> test thanos 0.21 on thanos-fe2001 - T288326

One last failure, e.g. for block 01FCAQD458HC0SHPRS3DF1S3TD:

Aug 06 10:53:08 thanos-fe2001 thanos-compact[2505018]: level=warn ts=2021-08-06T10:53:08.649705285Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="error executing compaction: fir
st pass of downsampling failed: downsampling to 5 min: download block 01FCAQD458HC0SHPRS3DF1S3TD: copy object to file: unexpected EOF"
Aug 06 10:53:08 thanos-fe2001 thanos-compact[2505018]: level=info ts=2021-08-06T10:53:08.649881649Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err="error execut
ing compaction: first pass of downsampling failed: downsampling to 5 min: download block 01FCAQD458HC0SHPRS3DF1S3TD: copy object to file: unexpected EOF"
Aug 06 10:53:09 thanos-fe2001 thanos-compact[2505018]: level=info ts=2021-08-06T10:53:09.150238153Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err="error
executing compaction: first pass of downsampling failed: downsampling to 5 min: download block 01FCAQD458HC0SHPRS3DF1S3TD: copy object to file: unexpected EOF"
Aug 06 10:53:09 thanos-fe2001 thanos-compact[2505018]: level=info ts=2021-08-06T10:53:09.150455964Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="error executing compaction: f
irst pass of downsampling failed: downsampling to 5 min: download block 01FCAQD458HC0SHPRS3DF1S3TD: copy object to file: unexpected EOF"
Aug 06 10:53:09 thanos-fe2001 thanos-compact[2505018]: level=error ts=2021-08-06T10:53:09.151441712Z caller=main.go:156 err="unexpected EOF\ncopy object to file\ngithub.com/thanos-io/tha

Looking at swift logs, the failure is during index download, the client (thanos-compact) disconnects according to swift (499 status code) on one of the chunks in the multipart upload (chunk 67) though the whole thing has 71 chunks

Aug  6 10:52:58 thanos-fe2001 proxy-server: - - 06/Aug/2021/10/52/58 GET /v1/AUTH_thanos/thanos%252Bsegments/01FCAQD458HC0SHPRS3DF1S3TD/index/MDg0NWZiNTItOTY5NS00YTg5LTgyZGEtMjE4MGIxZjA3OGE3/64%3Fmultipart-manifest%3Dget HTTP/1.0 200 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29%20%20SLO%20MultipartGET - - 134217728 - txca0da5a1f2b840658f91f-00610d1451 - 0.6507 SLO - 1628247177.883436441 1628247178.534160376 0
Aug  6 10:53:05 thanos-fe2001 proxy-server: - - 06/Aug/2021/10/53/05 GET /v1/AUTH_thanos/thanos%252Bsegments/01FCAQD458HC0SHPRS3DF1S3TD/index/MDg0NWZiNTItOTY5NS00YTg5LTgyZGEtMjE4MGIxZjA3OGE3/65%3Fmultipart-manifest%3Dget HTTP/1.0 200 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29%20%20SLO%20MultipartGET - - 134217728 - txca0da5a1f2b840658f91f-00610d1451 - 6.4785 SLO - 1628247178.535385132 1628247185.013934135 0
Aug  6 10:53:05 thanos-fe2001 proxy-server: - - 06/Aug/2021/10/53/05 GET /v1/AUTH_thanos/thanos%252Bsegments/01FCAQD458HC0SHPRS3DF1S3TD/index/MDg0NWZiNTItOTY5NS00YTg5LTgyZGEtMjE4MGIxZjA3OGE3/66%3Fmultipart-manifest%3Dget HTTP/1.0 200 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29%20%20SLO%20MultipartGET - - 134217728 - txca0da5a1f2b840658f91f-00610d1451 - 0.8108 SLO - 1628247185.015271902 1628247185.826087236 0
Aug  6 10:53:06 thanos-fe2001 proxy-server: - - 06/Aug/2021/10/53/06 GET /v1/AUTH_thanos/thanos%252Bsegments/01FCAQD458HC0SHPRS3DF1S3TD/index/MDg0NWZiNTItOTY5NS00YTg5LTgyZGEtMjE4MGIxZjA3OGE3/67%3Fmultipart-manifest%3Dget HTTP/1.0 499 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29%20%20SLO%20MultipartGET - - 97058816 - txca0da5a1f2b840658f91f-00610d1451 - 0.5190 SLO - 1628247185.827379704 1628247186.346405745 0
Aug  6 10:53:13 thanos-fe2001 proxy-server: 10.2.1.54 10.192.0.192 06/Aug/2021/10/53/13 HEAD /v1/AUTH_thanos/thanos/01FCAQD458HC0SHPRS3DF1S3TD/meta.json HTTP/1.0 200 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29 - - - - txde2d0b834aa740b796af6-00610d1499 - 0.0115 S3 - 1628247193.955774784 1628247193.967234612 0
Aug  6 10:53:13 thanos-fe2001 proxy-server: 10.2.1.54 10.192.0.192 06/Aug/2021/10/53/13 GET /v1/AUTH_thanos/thanos/01FCAQD458HC0SHPRS3DF1S3TD/meta.json HTTP/1.0 200 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29 - - 6771 - tx711639594d1d4633af8af-00610d1499 - 0.0208 S3 - 1628247193.977103710 1628247193.997915030 0
Aug  6 10:53:17 thanos-fe2001 proxy-server: 10.2.1.54 10.192.0.192 06/Aug/2021/10/53/17 HEAD /v1/AUTH_thanos/thanos/01FCAQD458HC0SHPRS3DF1S3TD/meta.json HTTP/1.0 200 - MinIO%20%28linux%3B%20amd64%29%20minio-go/v7.0.10%20thanos-compact/0.21.1%20%28go1.15.9%29 - - - - txa3601141c56a42baac5fc-00610d149d - 0.0161 S3 - 1628247197.139356375 1628247197.155418158 0
This comment was removed by fgiunchedi.

I can download the whole block (index and thanos chunks (not swift chunks for multipart uploads)) with s3cmd just fine:

...
download: 's3://thanos/01FCAQD458HC0SHPRS3DF1S3TD/index' -> './01FCAQD458HC0SHPRS3DF1S3TD/index'  [313 of 314]
 9466906386 of 9466906386   100% in   44s   204.18 MB/s  done

Mentioned in SAL (#wikimedia-operations) [2021-08-06T12:56:11Z] <godog> test thanos 0.22 on thanos-fe2001 - T288326

So 0.22 didn't work (opened an upstream issue for it: https://github.com/thanos-io/thanos/issues/4531) but 0.21.1 does seem to work as expected and without errors/crashes

Mentioned in SAL (#wikimedia-operations) [2021-08-10T08:06:46Z] <godog> upload thanos 0.21.1-1 and upgrade prometheus1004 / thanos-fe2001 to it - T288326

fgiunchedi claimed this task.
fgiunchedi closed subtask T288604: Upgrade Thanos to 0.21.1 as Resolved.

Boldly resolving, upgrade completed