I have seen the following crash in the journal of various purged instances:
Jun 13 06:16:49 cp1087 purged[39052]: fatal error: concurrent map read and map write Jun 13 06:16:49 cp1087 purged[39052]: goroutine 1 [running]: Jun 13 06:16:49 cp1087 purged[39052]: runtime.throw(0x94c8c6, 0x21) Jun 13 06:16:49 cp1087 purged[39052]: /usr/lib/go-1.14/src/runtime/panic.go:1112 +0x72 fp=0xc0002219b8 sp=0xc000221988 pc=0x43b0a2 Jun 13 06:16:49 cp1087 purged[39052]: runtime.mapaccess2_faststr(0x8af200, 0xc00011a630, 0x7ffe4eccbeef, 0x14, 0xc0000d4580, 0x9f86e0) Jun 13 06:16:49 cp1087 purged[39052]: /usr/lib/go-1.14/src/runtime/map_faststr.go:116 +0x47c fp=0xc000221a28 sp=0xc0002219b8 pc=0x419adc Jun 13 06:16:49 cp1087 purged[39052]: main.(*KafkaReader).GetLag(0xc0000de050, 0x7ffe4eccbeef, 0x14, 0xc0000d4580) Jun 13 06:16:49 cp1087 purged[39052]: /build/purged-0.15/kafka.go:132 +0x53 fp=0xc000221a88 sp=0xc000221a28 pc=0x839393 Jun 13 06:16:49 cp1087 purged[39052]: main.main() Jun 13 06:16:49 cp1087 purged[39052]: /build/purged-0.15/purged.go:346 +0x68e fp=0xc000221f88 sp=0xc000221a88 pc=0x83d01e Jun 13 06:16:49 cp1087 purged[39052]: runtime.main() Jun 13 06:16:49 cp1087 purged[39052]: /usr/lib/go-1.14/src/runtime/proc.go:203 +0x212 fp=0xc000221fe0 sp=0xc000221f88 pc=0x43d6f2 Jun 13 06:16:49 cp1087 purged[39052]: runtime.goexit() Jun 13 06:16:49 cp1087 purged[39052]: /usr/lib/go-1.14/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc000221fe8 sp=0xc000221fe0 pc=0x46d631
Followed by the stacktraces of all other goroutines. The process then crashes, as expected, and is successfully restarted by systemd:
Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Failed with result 'exit-code'. Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Consumed 12h 57min 22.356s CPU time. Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Service RestartSec=100ms expired, scheduling restart. Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Scheduled restart job, restart counter is at 12. Jun 13 06:16:49 cp1087 systemd[1]: Stopped Purger for ATS and Varnish. Jun 13 06:16:49 cp1087 systemd[1]: purged.service: Consumed 12h 57min 22.356s CPU time. Jun 13 06:16:49 cp1087 systemd[1]: Started Purger for ATS and Varnish. Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Listening for topics eqiad.resource-purge,codfw.resource-purge Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Start consuming topics [eqiad.resource-purge codfw.resource-purge] from kafka Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Reading from 239.128.0.112,239.128.0.115 with maximum datagram size 4096 Jun 13 06:16:49 cp1087 purged[43392]: 2020/06/13 06:16:49 Process purged started with 48 backend and 4 frontend workers. Metrics at :2112/metrics
It looks like we have a race condition accessing maxts.