Page MenuHomePhabricator

glogger crashes regularly in mw-on-k8s containers
Closed, ResolvedPublic

Description

I happened to be looking at the live logs for some mw-on-k8s pods recently and I noticed several messages like this:

panic: runtime error: index out of range [1300] with length 1300
goroutine 1 [running]:
main.removeControlChars({0xc0002da000, 0x514, 0x2000})
    /go/glogger/main.go:124 +0x258
main.(*Glogger).Run(0xc00011af58)
    /go/glogger/main.go:157 +0x152
main.main()
    /go/glogger/main.go:182 +0xf3
AH00106: piped log program '/usr/bin/glogger -d -S 16384 -n 127.0.0.1 -P 10200' failed unexpectedly

(This was from pod mw-web.codfw.main-789949d94b-p4jvc container mediawiki-main-httpd).

Logstash report: https://logstash.wikimedia.org/goto/fe6f5e097453f003eec24979eadb3a3f (Thanks @Clement_Goubert

Event Timeline

I've tried to debug this unsuccessfully, but just realized that serviceops wasn't tagged. Can someone in the team with better go than me have a look?

The problem seems to arise because we allocate a byte slice of size len(line), but somehow we try to copy over bytes past that point.

This is caused by this code, that sometimes makes a decoded sequence longer than the original sequence.

I think there is a relatively easy fix, but I'll first try to write a test case that fails with the current code.

Joe changed the task status from Open to In Progress.Jun 24 2024, 12:20 PM
Joe claimed this task.
Joe triaged this task as High priority.

Change #1051243 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/docker-images/production-images@master] Rebuild images to pick up a new version of glogger

https://gerrit.wikimedia.org/r/1051243

Change #1051243 merged by Giuseppe Lavagetto:

[operations/docker-images/production-images@master] Rebuild images to pick up a new version of glogger

https://gerrit.wikimedia.org/r/1051243

Mentioned in SAL (#wikimedia-operations) [2024-07-02T06:21:16Z] <_joe_> rebuilding httpd-fcgi, mediawiki-httpd images T363342 T368640

This should be solved with this morning's release.