Replace the custom Deployment and related tooling described at https://wikitech.wikimedia.org/wiki/Tool:Bridgebot#Technical_details with modern Toolforge alternatives by creating a container hosting matterbridge with some custom software that will generate the needed configuration file(s) by interpolating envvar secrets into git hosted templates.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | BUG REPORT | • bd808 | T305487 Bridgebot freaks out and sends double messages from IRC to Telegram | ||
Resolved | Feature | • bd808 | T363028 Replace custom deployment with build service and job service | ||
Resolved | Feature | • bd808 | T353559 Figure out how to deploy ZNC using buildpacks |
Event Timeline
I recently learned that matterbridge uses the https://github.com/spf13/viper library and its AutomaticEnv feature when processing the config file. This allows envvars like MATTERBRIDGE_IRC_LIBERA_BRIDGEBOT_PASSWORD to be used to set secret values at runtime. This is hoped to make conversion to a custom container image a bit simpler by removing the need for a custom interpolation system for the config file.
$ ssh login.toolforge.org $ become bridgebot $ webservice buildservice shell --mount all -m 2G -c 1 $ /layers/heroku_go/go_target/bin/bridgebot -conf /app/etc/testing.toml [0000] INFO router: (/layers/heroku_go/go_deps/cache/gitlab.wikimedia.org/toolforge-repos/bridgebot-matterbridge@v0.0.0-20240424042617-38c64944bf1d/gateway/router.go:66: github.com/42wim/matterbridge/gateway.(*Router).Start) Parsing gateway testing-irc-telegram [0000] INFO router: (/layers/heroku_go/go_deps/cache/gitlab.wikimedia.org/toolforge-repos/bridgebot-matterbridge@v0.0.0-20240424042617-38c64944bf1d/gateway/router.go:75: github.com/42wim/matterbridge/gateway.(*Router).Start) Starting bridge: irc.testing ...
[21:07] < wm-bb> Does it work now? [21:07] < bd808> omg, it did work!
The only thing that didn't seem to work is loading the remotenickformat.tengo script which I assumed was searched for relative to the config file. It looks like it is loaded relative to cwd instead so I will need to update a bit of config.
There is also a bit of trouble with T363417: [builds-builder] golang based images get infinite nested loops for procfile entries but we should be able to work around that too.
bd808 opened https://gitlab.wikimedia.org/toolforge-repos/bridgebot/-/merge_requests/1
Fix issues found during live testing of initial implementation
bd808 merged https://gitlab.wikimedia.org/toolforge-repos/bridgebot/-/merge_requests/1
Fix issues found during live testing of initial implementation
I have seen one crash on startup in testing but it was not repeatable. It looks like it was triggered by something the irc client saw in scrollback when attaching:
[0005] DEBUG irc: (/layers/heroku_go/go_deps/cache/gitlab.wikimedia.org /toolforge-repos/bridgebot-matterbridge@v0.0.0-20240424042617-38c64944bf1d/bridge/irc/handlers.go:117: github.com/42wim/matterbridge/bridge/irc.(*Birc).handleJoinPart) handle girc.Event{Source:(*girc.Source)(0xc0001f5fb0), Tags:girc.Tags{"time":"2024-04-24T23:19:17.667Z"}, Timestamp:time.Date(2024, time.April, 24, 23, 19, 17, 667000000, time.Local), Command:"JOIN", Params:[]string{"#wikimedia-cloud"}, Sensitive:false, Echo:false} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb36c66] goroutine 38 [running]: github.com/lrstanley/girc.(*Client).readLoop(0xc000283880, {0x208ddf8, 0xc000148340}) /layers/heroku_go/go_deps/cache/github.com/lrstanley/girc@v0.0.0-20230729130341-dd5853a5f1a6/conn.go:440 +0x266 github.com/lrstanley/girc/internal/ctxgroup.(*Group).Go.func1() /layers/heroku_go/go_deps/cache/github.com/lrstanley/girc@v0.0.0-20230729130341-dd5853a5f1a6/internal/ctxgroup/ctxgroup.go:58 +0x6e created by github.com/lrstanley/girc/internal/ctxgroup.(*Group).Go /layers/heroku_go/go_deps/cache/github.com/lrstanley/girc@v0.0.0-2023072 9130341-dd5853a5f1a6/internal/ctxgroup/ctxgroup.go:55 +0x8d
The code and config are ready to try switching everything over. I don't want to do this in my evening however due to the possibility of exciting new failure modes cropping up after running for a little while.
Here is the deployment plan:
$ bb.sh stop $ kubectl delete deployment bridgebot.bnc $ kubectl delete service bnc $ kubectl apply --validate=true -f bnc-service.yaml $ toolforge jobs load jobs.yaml --job bnc $ toolforge jobs load jobs.yaml --job bridgebot $ dologmsg 'Switched from legacy system to buildservice and jobs configuration (T363028)'
Here is the rollback plan:
$ toolforge jobs delete bridgebot $ toolforge jobs delete bnc $ kubectl delete service bnc $ kubectl apply --validate=true -f etc/bnc-deployment.yaml $ bb.sh start $ dologmsg 'Switched back to legacy system (T363028)'
The jobs.yaml and bnc-service.yaml files are not in version control yet.
# https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework --- # ZNC bouncer to sit between matterbridge and libra.chat # https://gitlab.wikimedia.org/toolforge-repos/wikibugs2-znc - name: bnc command: bouncer image: tool-bridgebot/znc:latest cpu: 250m mem: 256Mi continuous: true emails: none mount: none no-filelog: true # Matterbridge in our custom container - name: bridgebot command: bot image: tool-bridgebot/tool-bridgebot:latest cpu: 1 mem: 1G continuous: true emails: onfailure # Mount is needed for storing media mount: all no-filelog: true # Delete old media files once per day # https://github.com/42wim/matterbridge/wiki/Mediaserver-setup-(advanced)#sidenote - name: static-cleaner # >- means replace newlines with spaces (folded), no newline at end (strip) command: >- /usr/bin/find /data/project/bridgebot/www/static -mindepth 1 -mtime +30 -delete >/dev/null 2>>/data/project/bridgebot/logs/static-cleaner.err ; true # ^ find exits nonzero due to not deleting nonempty subdirs; # `true` hides this image: bookworm schedule: "@daily" emails: none mount: all no-filelog: true
kind: Service apiVersion: v1 metadata: name: bnc spec: selector: app.kubernetes.io/name: bnc ports: - protocol: TCP port: 6667 targetPort: 6667
Mentioned in SAL (#wikimedia-cloud) [2024-04-25T21:05:37Z] <wmbot~bd808@tools-bastion-12> Switched from legacy system to buildservice and jobs configuration (T363028)
Mentioned in SAL (#wikimedia-cloud) [2024-04-25T21:31:05Z] <wmbot~bd808@tools-bastion-12> built new image from f4022bd9 (T363028)
Mentioned in SAL (#wikimedia-cloud) [2024-04-25T21:32:26Z] <wmbot~bd808@tools-bastion-12> Started bridgebot job (T363028)
The main problem in the deploy was that there was a cut-and-paste error in the bridgebot.toml config file. Once that was spotted and fixed things came up as hoped.
The docs at https://wikitech.wikimedia.org/wiki/Tool:Bridgebot need to be updated to describe how to work with the new deployment. I will also make a new task about trying to find a linter to validate the toml files to add to CI.
Such doc updates! Much wow! https://wikitech.wikimedia.org/w/index.php?title=Tool%3ABridgebot&diff=2172216&oldid=2169061