Page MenuHomePhabricator

several purgeds badly backlogged (> 10 days)
Closed, ResolvedPublic

Description

Looking across the fleet, the purged_event_lag metric looks really bad on four hosts: cp2033, cp1087, cp2037, cp2039.

These servers are now depooled, so there's no present user impact.

I stumbled across this after being forwarded some user reports on Twitter: https://twitter.com/john_overholt/status/1276276247602044933
The reporter was on the US East Coast (actually just a few miles from me), so I looked at eqiad:

✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙🍺 sudo cumin 'A:cp-text and A:eqiad' 'curl -s --resolve en.wikipedia.org:443:127.0.0.1 https://en.wikipedia.org/wiki/Samuel_Osborne | grep Legacy||echo FAIL'
8 hosts will be targeted:                                               
cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet              
Confirm to continue [y/n]? y                                        
===== NODE GROUP =====
(7) cp[1075,1077,1079,1081,1083,1085,1089].eqiad.wmnet
----- OUTPUT of 'curl -s --resolv...egacy||echo FAIL' -----
<h3><span class="mw-headline" id="Legacy">Legacy</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Samuel_Osborne&amp;action=edit&amp;section=1" title="
Edit section: Legacy">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
===== NODE GROUP =====                                               
(1) cp1087.eqiad.wmnet                                              
----- OUTPUT of 'curl -s --resolv...egacy||echo FAIL' -----
FAIL                  
================

https://w.wiki/VDv

Event Timeline

CDanis created this task.Fri, Jun 26, 2:36 AM
Restricted Application added a project: Operations. · View Herald TranscriptFri, Jun 26, 2:36 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis updated the task description. (Show Details)Fri, Jun 26, 3:01 AM
CDanis updated the task description. (Show Details)Fri, Jun 26, 5:13 AM

Mentioned in SAL (#wikimedia-operations) [2020-06-26T09:46:13Z] <ema> cp2033: restart purged T256444

ema triaged this task as High priority.Fri, Jun 26, 9:47 AM
ema moved this task from Triage to Caching on the Traffic board.Fri, Jun 26, 9:49 AM

Mentioned in SAL (#wikimedia-operations) [2020-06-26T09:55:46Z] <ema> cp1087: restart purged T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-26T09:57:57Z] <ema> cp2037: restart purged T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-26T10:03:53Z] <ema> cp2039: restart purged T256444

ema added a comment.Fri, Jun 26, 10:21 AM

I have identified the misbehaving purged instances with rate(purged_events_received_total{cluster="cache_text", topic="eqiad.resource-purge"}[5m]) == 0 and restarted them.

The processes were maxing out one CPU core and logs had this to say:

Jun 13 06:16:49 cp1087 purged[39052]: fatal error: concurrent map read and map write
Jun 13 06:16:49 cp1087 purged[39052]: goroutine 1 [running]:
Jun 13 06:16:49 cp1087 purged[39052]: runtime.throw(0x94c8c6, 0x21)
Jun 13 06:16:49 cp1087 purged[39052]:         /usr/lib/go-1.14/src/runtime/panic.go:1112 +0x72 fp=0xc0002219b8 sp=0xc000221988 pc=0x43b0a2
Jun 13 06:16:49 cp1087 purged[39052]: runtime.mapaccess2_faststr(0x8af200, 0xc00011a630, 0x7ffe4eccbeef, 0x14, 0xc0000d4580, 0x9f86e0)
Jun 13 06:16:49 cp1087 purged[39052]:         /usr/lib/go-1.14/src/runtime/map_faststr.go:116 +0x47c fp=0xc000221a28 sp=0xc0002219b8 pc=0x419adc
Jun 13 06:16:49 cp1087 purged[39052]: main.(*KafkaReader).GetLag(0xc0000de050, 0x7ffe4eccbeef, 0x14, 0xc0000d4580)
Jun 13 06:16:49 cp1087 purged[39052]:         /build/purged-0.15/kafka.go:132 +0x53 fp=0xc000221a88 sp=0xc000221a28 pc=0x839393
Jun 13 06:16:49 cp1087 purged[39052]: main.main()
Jun 13 06:16:49 cp1087 purged[39052]:         /build/purged-0.15/purged.go:346 +0x68e fp=0xc000221f88 sp=0xc000221a88 pc=0x83d01e
Jun 13 06:16:49 cp1087 purged[39052]: runtime.main()
Jun 13 06:16:49 cp1087 purged[39052]:         /usr/lib/go-1.14/src/runtime/proc.go:203 +0x212 fp=0xc000221fe0 sp=0xc000221f88 pc=0x43d6f2
Jun 13 06:16:49 cp1087 purged[39052]: runtime.goexit()
Jun 13 06:16:49 cp1087 purged[39052]:         /usr/lib/go-1.14/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc000221fe8 sp=0xc000221fe0 pc=0x46d631

Followed by the stacktraces of all other goroutines. The process then crashed, as expected, and was successfully restarted by systemd:

Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Failed with result 'exit-code'.
Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Consumed 12h 57min 22.356s CPU time.
Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Service RestartSec=100ms expired, scheduling restart.
Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Scheduled restart job, restart counter is at 12.
Jun 13 06:16:49 cp1087 systemd[1]: Stopped Purger for ATS and Varnish.
Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Consumed 12h 57min 22.356s CPU time.
Jun 13 06:16:49 cp1087 systemd[1]: Started Purger for ATS and Varnish.
Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Listening for topics eqiad.resource-purge,codfw.resource-purge
Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Start consuming topics [eqiad.resource-purge codfw.resource-purge] from kafka
Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Reading from 239.128.0.112,239.128.0.115 with maximum datagram size 4096
Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Process purged started with 48 backend and 4 frontend workers. Metrics at :2112/metrics

The crash might be unrelated considering that (1) the process is restarted properly after the crash and (2) other purged instances show the crashes too but are processing kafka messages just fine.

Change 608019 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] purged: alert in case of high event lag

https://gerrit.wikimedia.org/r/608019

Krinkle added a subscriber: Krinkle.EditedSat, Jun 27, 2:38 AM

I believe this is resolved now, right? Or Is there still user-facing impact in the form of served cache objects eg more than a few minutes past their purge date?

ema added a comment.Sat, Jun 27, 9:54 AM

I believe this is resolved now, right? Or Is there still user-facing impact in the form of served cache objects eg more than a few minutes past their purge date?

The affected hosts are depooled, so no stale cached object should be served now. However the bug should be kept open because the issue has not been solved nor identified, I've just restarted the affected daemons.

ema added a comment.Mon, Jun 29, 8:22 AM

The issue happened again on cp4025 and a few other nodes. It looks like a deadlock in librdkafka to me, the process is spinning on pthread_cond_wait:

$ sudo timeout 10 perf record -p 42663
$ sudo perf report --stdio
# Overhead  Command       Shared Object       Symbol                                        
# ........  ............  ..................  ..............................................
#
    39.98%  purged        libpthread-2.28.so  [.] pthread_cond_timedwait@@GLIBC_2.3.2
    10.08%  purged        librdkafka.so.1     [.] pthread_cond_timedwait@plt
     8.32%  purged        librdkafka.so.1     [.] 0x0000000000077153
     8.15%  purged        librdkafka.so.1     [.] 0x0000000000040e24
     8.06%  purged        librdkafka.so.1     [.] 0x0000000000076d53
     6.90%  purged        librdkafka.so.1     [.] 0x0000000000040d78
     6.69%  purged        librdkafka.so.1     [.] 0x0000000000076d40
     5.77%  purged        librdkafka.so.1     [.] 0x0000000000076d49
     2.66%  purged        librdkafka.so.1     [.] 0x0000000000040e19
     1.91%  purged        librdkafka.so.1     [.] 0x0000000000040d7f

Also 60 seconds of capture with pprof show that the process is stuck polling for kafka messages. Profile data gathered with:

$ curl http://localhost:2112/debug/pprof/profile?seconds=60 > cpuprof
$ go tool pprof -top cpuprof 
File: purged
Build ID: a42d237979c8ff269b5e0c4f71a44b14f5c3ec8c
Type: cpu
Time: Jun 29, 2020 at 10:02am (CEST)
Duration: 1mins, Total samples = 1mins (99.75%)
Showing nodes accounting for 1mins, 100% of 1mins total
Dropped 4 nodes (cum <= 0.01mins)
      flat  flat%   sum%        cum   cum%
     1mins   100%   100%      1mins   100%  runtime.cgocall
         0     0%   100%      1mins   100%  github.com/confluentinc/confluent-kafka-go/kafka.(*handle).eventPoll
         0     0%   100%      1mins   100%  github.com/confluentinc/confluent-kafka-go/kafka.(*handle).eventPoll.func1
         0     0%   100%      1mins   100%  github.com/confluentinc/confluent-kafka-go/kafka._Cfunc__rk_queue_poll
         0     0%   100%      1mins   100%  github.com/confluentinc/confluent-kafka-go/kafka.consumerReader

The callgraph is indicative too:

Change 608019 merged by Ema:
[operations/puppet@production] purged: alert in case of high event lag

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608019

Mentioned in SAL (#wikimedia-operations) [2020-06-29T08:33:29Z] <ema> cp1087, cp2033, cp2037, cp2039: repool after spending (way) more than 24h depooled T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-29T08:36:38Z] <ema> cp4025: restart purged T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-29T08:40:25Z] <ema> cp2034: restart purged T256444

ema added a comment.EditedMon, Jun 29, 10:18 AM

This confluent-kafka-go issue is similar to what we're seeing: https://github.com/confluentinc/confluent-kafka-go/issues/251

Following the various related librdkafka issues, the following two patches seem particularly promising:

In general, lots and lots of changes have happened between librdkafka 0.11.6 (buster) and 1.4.2 (testing). The question now is whether we want to cherry-pick individual patches against 0.11.6 or if we should backport 1.4.2.

elukey added a subscriber: elukey.Mon, Jun 29, 4:57 PM

There may be another solution, namely creating a new apt component to hold 1.4.x and deploy it selectively where needed (as opposed to roll it out everywhere). In theory on cp nodes it should be fine, varnishkafka should be compatible with 1.x.x API (some time ago I opened T210944 to track this and found only minor issues/nits) and if not we could come up with a quick patch to make it work. Then the rest of the nodes could use either what Debian provides, or the more up to date 1.4.x component. Finally, a we could use this as "test" for 1.x.x librdkafka on a specific use case before using it everywhere.

ema added a comment.Tue, Jun 30, 8:22 AM

There may be another solution, namely creating a new apt component to hold 1.4.x and deploy it selectively where needed (as opposed to roll it out everywhere). In theory on cp nodes it should be fine, varnishkafka should be compatible with 1.x.x API (some time ago I opened T210944 to track this and found only minor issues/nits) and if not we could come up with a quick patch to make it work. Then the rest of the nodes could use either what Debian provides, or the more up to date 1.4.x component. Finally, a we could use this as "test" for 1.x.x librdkafka on a specific use case before using it everywhere.

Well, upstream claims that the new versions are API and ABI compatible, but a function that we're using in varnishkafka (rd_kafka_conf_set_stats_cb) got removed according to T210944. This means that we can't follow the route you propose unless we patch, build, and deploy new varnishkafka packages. Other than the work required to do that, we do know that there are a lot of changes between 0.11.6 and 1.4.2, but we don't know if any of those break things for our rdkafka clients, and I'd like not to find that out while addressing a critical issue like this one. :)

I have backported the two patches that seem relevant for this problem, see P11703. More research needed to see if they might fix T224425 as well.

Well, upstream claims that the new versions are API and ABI compatible, but a function that we're using in varnishkafka (rd_kafka_conf_set_stats_cb) got removed according to T210944.

The function got deprecated but I think it is still usable in 1.x.x..

This means that we can't follow the route you propose unless we patch, build, and deploy new varnishkafka packages. Other than the work required to do that, we do know that there are a lot of changes between 0.11.6 and 1.4.2, but we don't know if any of those break things for our rdkafka clients, and I'd like not to find that out while addressing a critical issue like this one. :)

Sure, a lot of work I agree, I was only suggesting another possible road :)

I have backported the two patches that seem relevant for this problem, see P11703. More research needed to see if they might fix T224425 as well.

Nice!

Mentioned in SAL (#wikimedia-operations) [2020-06-30T10:38:35Z] <ema> cp2040: upgrade librdkafka1 to 0.11.6-1.1wmf1 https://phabricator.wikimedia.org/P11703 T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-30T10:41:19Z] <ema> cp2040: restart purged and varnishkafka to use updated librdkafka1 T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-30T10:59:18Z] <ema> upload librdkafka 0.11.6-1.1wmf1 to buster-wikimedia https://phabricator.wikimedia.org/P11703 T256444

Mentioned in SAL (#wikimedia-operations) [2020-06-30T16:48:06Z] <cdanis> T256444 ✔️ cdanis@cp2030.codfw.wmnet ~ 🕐☕ sudo depool

Mentioned in SAL (#wikimedia-operations) [2020-06-30T16:57:30Z] <cdanis> T256444 restarted purged on cp2030 and repooling

Mentioned in SAL (#wikimedia-operations) [2020-07-01T07:39:06Z] <ema> cp2041: restart purged, varnishkafka after librdkafka1 upgrade to 0.11.6-1.1wmf1 T256444

ema added a comment.Wed, Jul 1, 7:43 AM

Mentioned in SAL (#wikimedia-operations) [2020-06-30T10:41:19Z] <ema> cp2040: restart purged and varnishkafka to use updated librdkafka1 T256444

No issues on cp2040 (upload) since yesterday's restart with 0.11.6-1.1wmf1. I have now upgraded librdkafka1 on cp2041 (text) too, if all goes well in the next few hours I'm gonna carry on with the rest of the cache nodes.

Mentioned in SAL (#wikimedia-operations) [2020-07-01T11:47:56Z] <ema> A:cp upgrade librdkafka1 to 0.11.6-1.1wmf1 and restart purged, varnishkafka T256444

ema added a comment.Thu, Jul 2, 8:52 AM

Mentioned in SAL (#wikimedia-operations) [2020-06-30T16:57:30Z] <cdanis> T256444 restarted purged on cp2030 and repooling

This was the last occurrence of the issue, and no other host has been affected since the librdkafka upgrade yesterday at 2020-07-01T11:50. Let's observe things till tomorrow and then I think we can close this.

ema added a comment.Fri, Jul 3, 8:44 AM

This was the last occurrence of the issue, and no other host has been affected since the librdkafka upgrade yesterday at 2020-07-01T11:50. Let's observe things till tomorrow and then I think we can close this.

Nothing in the last 48 hours, closing.

ema closed this task as Resolved.Fri, Jul 3, 8:44 AM
ema claimed this task.