Page MenuHomePhabricator

Replace custom deployment with build service and job service
Closed, ResolvedPublicFeature

Description

Replace the custom Deployment and related tooling described at https://wikitech.wikimedia.org/wiki/Tool:Bridgebot#Technical_details with modern Toolforge alternatives by creating a container hosting matterbridge with some custom software that will generate the needed configuration file(s) by interpolating envvar secrets into git hosted templates.

Event Timeline

I recently learned that matterbridge uses the https://github.com/spf13/viper library and its AutomaticEnv feature when processing the config file. This allows envvars like MATTERBRIDGE_IRC_LIBERA_BRIDGEBOT_PASSWORD to be used to set secret values at runtime. This is hoped to make conversion to a custom container image a bit simpler by removing the need for a custom interpolation system for the config file.

bd808 changed the task status from Open to In Progress.Apr 24 2024, 4:40 PM
bd808 claimed this task.
bd808 triaged this task as Medium priority.
bd808 moved this task from To Do to In Dev/Progress on the Tool-bridgebot board.
$ ssh login.toolforge.org
$ become bridgebot
$ webservice buildservice shell --mount all -m 2G -c 1
$ /layers/heroku_go/go_target/bin/bridgebot -conf /app/etc/testing.toml
[0000]  INFO router:       (/layers/heroku_go/go_deps/cache/gitlab.wikimedia.org/toolforge-repos/bridgebot-matterbridge@v0.0.0-20240424042617-38c64944bf1d/gateway/router.go:66: github.com/42wim/matterbridge/gateway.(*Router).Start) Parsing gateway testing-irc-telegram
[0000]  INFO router:       (/layers/heroku_go/go_deps/cache/gitlab.wikimedia.org/toolforge-repos/bridgebot-matterbridge@v0.0.0-20240424042617-38c64944bf1d/gateway/router.go:75: github.com/42wim/matterbridge/gateway.(*Router).Start) Starting bridge: irc.testing
...
[21:07]  <    wm-bb> Does it work now?
[21:07]  <    bd808> omg, it did work!

Screenshot 2024-04-24 at 15.18.48.png (208×1 px, 45 KB)

The only thing that didn't seem to work is loading the remotenickformat.tengo script which I assumed was searched for relative to the config file. It looks like it is loaded relative to cwd instead so I will need to update a bit of config.

There is also a bit of trouble with T363417: [builds-builder] golang based images get infinite nested loops for procfile entries but we should be able to work around that too.

I have seen one crash on startup in testing but it was not repeatable. It looks like it was triggered by something the irc client saw in scrollback when attaching:

[0005] DEBUG irc:          (/layers/heroku_go/go_deps/cache/gitlab.wikimedia.org
/toolforge-repos/bridgebot-matterbridge@v0.0.0-20240424042617-38c64944bf1d/bridge/irc/handlers.go:117: github.com/42wim/matterbridge/bridge/irc.(*Birc).handleJoinPart) handle girc.Event{Source:(*girc.Source)(0xc0001f5fb0), Tags:girc.Tags{"time":"2024-04-24T23:19:17.667Z"}, Timestamp:time.Date(2024, time.April, 24, 23, 19, 17, 667000000, time.Local), Command:"JOIN", Params:[]string{"#wikimedia-cloud"}, Sensitive:false, Echo:false}
panic: runtime error: invalid memory address or nil pointer dereference         
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb36c66]
                                                                                
goroutine 38 [running]:                                                         
github.com/lrstanley/girc.(*Client).readLoop(0xc000283880, {0x208ddf8, 0xc000148340})
        /layers/heroku_go/go_deps/cache/github.com/lrstanley/girc@v0.0.0-20230729130341-dd5853a5f1a6/conn.go:440 +0x266
github.com/lrstanley/girc/internal/ctxgroup.(*Group).Go.func1()
        /layers/heroku_go/go_deps/cache/github.com/lrstanley/girc@v0.0.0-20230729130341-dd5853a5f1a6/internal/ctxgroup/ctxgroup.go:58 +0x6e
created by github.com/lrstanley/girc/internal/ctxgroup.(*Group).Go
        /layers/heroku_go/go_deps/cache/github.com/lrstanley/girc@v0.0.0-2023072
9130341-dd5853a5f1a6/internal/ctxgroup/ctxgroup.go:55 +0x8d

The code and config are ready to try switching everything over. I don't want to do this in my evening however due to the possibility of exciting new failure modes cropping up after running for a little while.

Here is the deployment plan:

switch to new containers
$ bb.sh stop
$ kubectl delete deployment bridgebot.bnc
$ kubectl delete service bnc

$ kubectl apply --validate=true -f bnc-service.yaml
$ toolforge jobs load jobs.yaml --job bnc
$ toolforge jobs load jobs.yaml --job bridgebot

$ dologmsg 'Switched from legacy system to buildservice and jobs configuration (T363028)'

Here is the rollback plan:

fallback to the old ways
$ toolforge jobs delete bridgebot
$ toolforge jobs delete bnc
$ kubectl delete service bnc

$ kubectl apply --validate=true -f etc/bnc-deployment.yaml
$ bb.sh start

$ dologmsg 'Switched back to legacy system (T363028)'

The jobs.yaml and bnc-service.yaml files are not in version control yet.

jobs.yaml
# https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
---
# ZNC bouncer to sit between matterbridge and libra.chat
# https://gitlab.wikimedia.org/toolforge-repos/wikibugs2-znc
- name: bnc
  command: bouncer
  image: tool-bridgebot/znc:latest
  cpu: 250m
  mem: 256Mi
  continuous: true
  emails: none
  mount: none
  no-filelog: true

# Matterbridge in our custom container
- name: bridgebot
  command: bot
  image: tool-bridgebot/tool-bridgebot:latest
  cpu: 1
  mem: 1G
  continuous: true
  emails: onfailure
  # Mount is needed for storing media
  mount: all
  no-filelog: true

# Delete old media files once per day
# https://github.com/42wim/matterbridge/wiki/Mediaserver-setup-(advanced)#sidenote
- name: static-cleaner
  # >- means replace newlines with spaces (folded), no newline at end (strip)
  command: >-
    /usr/bin/find
    /data/project/bridgebot/www/static
    -mindepth 1
    -mtime +30
    -delete
    >/dev/null
    2>>/data/project/bridgebot/logs/static-cleaner.err
    ;
    true
  # ^ find exits nonzero due to not deleting nonempty subdirs;
  # `true` hides this
  image: bookworm
  schedule: "@daily"
  emails: none
  mount: all
  no-filelog: true
bnc-service.yaml
kind: Service
apiVersion: v1
metadata:
  name: bnc
spec:
  selector:
    app.kubernetes.io/name: bnc
  ports:
    - protocol: TCP
      port: 6667
      targetPort: 6667

Mentioned in SAL (#wikimedia-cloud) [2024-04-25T21:05:37Z] <wmbot~bd808@tools-bastion-12> Switched from legacy system to buildservice and jobs configuration (T363028)

Mentioned in SAL (#wikimedia-cloud) [2024-04-25T21:31:05Z] <wmbot~bd808@tools-bastion-12> built new image from f4022bd9 (T363028)

Mentioned in SAL (#wikimedia-cloud) [2024-04-25T21:32:26Z] <wmbot~bd808@tools-bastion-12> Started bridgebot job (T363028)

The main problem in the deploy was that there was a cut-and-paste error in the bridgebot.toml config file. Once that was spotted and fixed things came up as hoped.

The docs at https://wikitech.wikimedia.org/wiki/Tool:Bridgebot need to be updated to describe how to work with the new deployment. I will also make a new task about trying to find a linter to validate the toml files to add to CI.

bd808 moved this task from In Dev/Progress to Done on the Tool-bridgebot board.

I will also make a new task about trying to find a linter to validate the toml files to add to CI.

T363529: Add toml linter for config files