Page MenuHomePhabricator

Convert Striker to a container-based deployment
Closed, ResolvedPublic

Description

Striker currently uses scap3 and a relatively elaborate process for preparing the deployment repository to ship code to production. This system was largely copied from deployment tooling for ORES. It is functional, but also complicated and occasionally fragile.

Toolhub is using a newer deployment process based on PipelineLib. This process creates and publishes container images as a post-merge CI step. These images can then be used in the production Kubernetes cluster and elsewhere to deploy the application.

Discussion with @Andrew has revealed that he is willing to allow some experimentation with running Striker as a container on the labweb* production hosts. We are not seeking to deploy into the production Kubernetes cluster at this point.

  • Add pipelinelib config to Striker
  • Switch tests to pipelinelib
  • Rebuild demo environment with Docker
  • Provision Docker container on {cloud,lab}web* hosts
  • Ensure error pages, favicon, robots.txt are served from the container
  • Open firewall for port 8080 on {cloud,lab}web* hosts (but also not needed and should be reverted)
  • Point CDN at port 8080 container (at local envoy layer)
  • Ensure that log events are getting from Docker all the way to the ELK cluster
  • Remove uwsgi version of Striker from {cloud,lab}web* hosts
  • Remove striker from scap3 config

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -2
operations/puppetproduction+2 -1
operations/puppetproduction+1 -0
operations/puppetproduction+13 -452
operations/puppetproduction+0 -2
operations/puppetproduction+0 -14
labs/privatemaster+0 -20
operations/puppetproduction+6 -0
operations/puppetproduction+0 -5
operations/puppetproduction+11 -2
operations/puppetproduction+13 -8
operations/puppetproduction+9 -2
operations/puppetproduction+16 -2
operations/puppetproduction+5 -0
operations/puppetproduction+2 -2
labs/strikermaster+11 -4
labs/strikermaster+12 -0
operations/puppetproduction+18 -8
labs/strikermaster+1 -1
operations/puppetproduction+114 -6
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
labs/privatemaster+11 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We now have a Docker based local development environment, PipelineLib integration to create and publish a container after each git merge, and a demo environment using published containers. The next big step is to make a new Puppet profile using service:docker to provision a runtime based on published containers as well.

Change 790012 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: Add profile to provision docker container

https://gerrit.wikimedia.org/r/790012

Change 790013 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[labs/private@master] striker: add fake hiera secrets

https://gerrit.wikimedia.org/r/790013

Change 790013 merged by Andrew Bogott:

[labs/private@master] striker: add fake hiera secrets

https://gerrit.wikimedia.org/r/790013

Change 790012 merged by Andrew Bogott:

[operations/puppet@production] striker: Add profile to provision docker container

https://gerrit.wikimedia.org/r/790012

Change 809285 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudweb: include ::profile::docker::ferm so docker can mess with iptables

https://gerrit.wikimedia.org/r/809285

Change 809285 merged by Andrew Bogott:

[operations/puppet@production] cloudweb: include ::profile::docker::ferm so docker can mess with iptables

https://gerrit.wikimedia.org/r/809285

Change 809306 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: require ::profile::docker::ferm in ::profile::wmcs::striker::docker

https://gerrit.wikimedia.org/r/809306

Change 809306 merged by Andrew Bogott:

[operations/puppet@production] striker: require ::profile::docker::ferm in ::profile::wmcs::striker::docker

https://gerrit.wikimedia.org/r/809306

Change 809333 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[labs/striker@master] dev: run production variant on port 8080

https://gerrit.wikimedia.org/r/809333

Change 809333 merged by jenkins-bot:

[labs/striker@master] dev: run production variant on port 8080

https://gerrit.wikimedia.org/r/809333

Change 809714 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: connect docker container directly to host network

https://gerrit.wikimedia.org/r/809714

Change 809714 merged by Andrew Bogott:

[operations/puppet@production] striker: connect docker container directly to host network

https://gerrit.wikimedia.org/r/809714

A Docker container hosting Striker on the {cloud,lab}web* hosts. The container is bound to the host network rather than the default Docker bridge network to allow Striker to talk to nutcracker on the localhost. Striker is exposed on port 8080 of each host. This port is currently not open on the local firewall, so the running Striker is only available via localhost at the moment.

Change 810406 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[labs/striker@master] Add 'whitenoise' for static file serving

https://gerrit.wikimedia.org/r/810406

Change 810407 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[labs/striker@master] Handle /robots.txt and /favicon.ico routes

https://gerrit.wikimedia.org/r/810407

Change 810406 merged by jenkins-bot:

[labs/striker@master] Add 'whitenoise' for static file serving

https://gerrit.wikimedia.org/r/810406

Change 810407 merged by jenkins-bot:

[labs/striker@master] Handle /robots.txt and /favicon.ico routes

https://gerrit.wikimedia.org/r/810407

Change 810413 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: Open firewall for Docker-managed service

https://gerrit.wikimedia.org/r/810413

Change 810414 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: Bump container version to 2022-07-01-210101-production

https://gerrit.wikimedia.org/r/810414

Change 810413 merged by Andrew Bogott:

[operations/puppet@production] striker: Open firewall for Docker-managed service

https://gerrit.wikimedia.org/r/810413

Change 810414 merged by Andrew Bogott:

[operations/puppet@production] striker: Bump container version to 2022-07-01-210101-production

https://gerrit.wikimedia.org/r/810414

Change 811332 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: cloudweb-dev: route striker to the docker port

https://gerrit.wikimedia.org/r/811332

Change 811274 had a related patch set uploaded (by BryanDavis; author: BryanDavis):

[operations/puppet@production] Revert "striker: Open firewall for Docker-managed service"

https://gerrit.wikimedia.org/r/811274

Change 811337 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] labweb: point tlsproxy envoy at port 8080 for striker

https://gerrit.wikimedia.org/r/811337

Change 811337 merged by Andrew Bogott:

[operations/puppet@production] labweb: point tlsproxy envoy at port 8080 for striker

https://gerrit.wikimedia.org/r/811337

Change 811381 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] labweb: point tlsproxy envoy at port 8080 for striker

https://gerrit.wikimedia.org/r/811381

Change 811337 merged by Andrew Bogott:

[operations/puppet@production] labweb: point tlsproxy envoy at port 8080 for striker

https://gerrit.wikimedia.org/r/811337

This was reverted with https://gerrit.wikimedia.org/r/c/operations/puppet/+/811277.

Change 811381 merged by Andrew Bogott:

[operations/puppet@production] labweb: point tlsproxy envoy at port 8080 for striker

https://gerrit.wikimedia.org/r/811381

Change 811381 merged by Andrew Bogott:

[operations/puppet@production] labweb: point tlsproxy envoy at port 8080 for striker

https://gerrit.wikimedia.org/r/811381

This was reverted with https://gerrit.wikimedia.org/r/c/operations/puppet/+/811965/.

The problem this time was that apache has an explicit vhost for 127.0.0.1:80 which captured the envoy traffic we intended to go to the *:80 vhosts for horizon and wikitech.

Change 812096 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] labweb: move striker, wikitech, horizon behind envoy

https://gerrit.wikimedia.org/r/812096

Change 812096 merged by Andrew Bogott:

[operations/puppet@production] labweb: move striker, wikitech, horizon behind envoy

https://gerrit.wikimedia.org/r/812096

Change 812096 merged by Andrew Bogott:

[operations/puppet@production] labweb: move striker, wikitech, horizon behind envoy

https://gerrit.wikimedia.org/r/812096

Reverted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/812106/.

Envoy again served unexpected content for wikitech and horizon requests.

Change 812381 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] labweb: point tlsproxy envoy at %{facts.ipaddress}:8080 for striker

https://gerrit.wikimedia.org/r/812381

Change 812381 merged by Andrew Bogott:

[operations/puppet@production] labweb: point tlsproxy envoy at %{facts.ipaddress}:8080 for striker

https://gerrit.wikimedia.org/r/812381

https://toolsadmin.wikimedia.org is now being served from Docker containers running on labweb100[12]. Initial manual testing of everything looks good. I'm going to wait a few days before starting on the cleanup side of this task which will remove the now legacy uwsgi deployment.

Change 811274 merged by Andrew Bogott:

[operations/puppet@production] Revert "striker: Open firewall for Docker-managed service"

https://gerrit.wikimedia.org/r/811274

Change 811332 merged by Andrew Bogott:

[operations/puppet@production] hieradata: cloudweb-dev: route striker to the docker port

https://gerrit.wikimedia.org/r/811332

Change 819116 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[labs/private@master] striker: remove legacy settings

https://gerrit.wikimedia.org/r/819116

Change 819121 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: remove legacy deployment

https://gerrit.wikimedia.org/r/819121

Change 819121 merged by Andrew Bogott:

[operations/puppet@production] striker: remove legacy deployment

https://gerrit.wikimedia.org/r/819121

Change 819116 merged by Andrew Bogott:

[labs/private@master] striker: remove legacy settings

https://gerrit.wikimedia.org/r/819116

Remove uwsgi version of Striker from {cloud,lab}web* hosts

The labweb hosts are in the process of being decommed, so we only need to clean the cloudweb nodes. These commands were used:

$ sudo service uwsgi-striker stop
$ sudo rm /lib/systemd/system/uwsgi-striker.service
$ sudo /bin/systemctl daemon-reload

$ sudo rm /etc/striker/striker.ini
$ sudo rm /etc/apache2/sites-available/50-striker.conf
$ sudo rm /etc/uwsgi/apps-enabled/striker.ini
$ sudo rm /etc/uwsgi/apps-available/striker.ini

$ sudo rm /etc/nagios/nrpe.d/check_uwsgi-striker.cfg

$ sudo rm -r /srv/deployment/striker

$ sudo -i puppet agent -tv
  • cloudweb2002-dev.wikimedia.org
  • cloudweb1004.wikimedia.org
  • cloudweb1003.wikimedia.org

Change 820206 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: Remove uwsgi deployment logging config

https://gerrit.wikimedia.org/r/820206

Change 820207 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: remove from scap::sources

https://gerrit.wikimedia.org/r/820207

Change 820206 merged by Andrew Bogott:

[operations/puppet@production] striker: Remove uwsgi deployment logging config

https://gerrit.wikimedia.org/r/820206

Change 820207 merged by Andrew Bogott:

[operations/puppet@production] striker: remove from scap::sources

https://gerrit.wikimedia.org/r/820207

Change 820237 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] service::docker: Add SyslogIdentifier to systemd unit

https://gerrit.wikimedia.org/r/820237

Change 820238 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] striker: route syslog output to ELK cluster via kafka

https://gerrit.wikimedia.org/r/820238

Change 820237 merged by Cwhite:

[operations/puppet@production] service::docker: Add SyslogIdentifier to systemd unit

https://gerrit.wikimedia.org/r/820237

Change 820238 merged by Cwhite:

[operations/puppet@production] striker: route syslog output to ELK cluster via kafka

https://gerrit.wikimedia.org/r/820238

bd808 updated the task description. (Show Details)

Manually masked the wmf_autorestart services for striker uswgi service (was removed), and reset the failed state:

root@cloudweb1003:~# systemctl mask wmf_auto_restart_uwsgi-striker.timer
Created symlink /etc/systemd/system/wmf_auto_restart_uwsgi-striker.timer → /dev/null.
root@cloudweb1003:~# systemctl mask wmf_auto_restart_uwsgi-striker.service
Created symlink /etc/systemd/system/wmf_auto_restart_uwsgi-striker.service → /dev/null.

root@cloudweb1003:~# systemctl reset-failed

It was triggering some alerts (https://alerts.wikimedia.org/?q=team%3Dwmcs):

critical
Check systemd state
summary: CRITICAL - degraded: The following units failed: wmf_auto_restart_uwsgi-striker.service
3 days ago
instance: cloudweb1004
source: icinga
team: wmcs
@receiver: irc-spam
runbook

Manually masked the wmf_autorestart services for striker uswgi service (was removed), and reset the failed state:

Was removed with https://github.com/wikimedia/puppet/commit/c7be95a4448f8d7cd6b030b8d3047d543996c2de#diff-ebf0da109f631a42e558175237b41ea3477ccedcb01c2a82b980b8d87b0870b2L144 so likely missed from the cleanup in T306469#8129139 and can be completely deleted

Manually masked the wmf_autorestart services for striker uswgi service (was removed), and reset the failed state:

Thank you @dcaro

Yes, this is just more cruft from the manifests I deleted because I was too lazy to figure out how to make them "ensure => absent". It would have been more ideal for me to remove things before the role was applied to the new cloudweb100[34] boxes, but I can't find the keys to the time machine.

Yes, this is just more cruft from the manifests I deleted because I was too lazy to figure out how to make them "ensure => absent". It would have been more ideal for me to remove things before the role was applied to the new cloudweb100[34] boxes, but I can't find the keys to the time machine.

Cleaned up on cloudweb1003, cloudweb1004, & cloudweb2002-dev:

$ sudo rm /lib/systemd/system/wmf_auto_restart_uwsgi-striker.{service,timer}
$ sudo rm /etc/systemd/system/wmf_auto_restart_uwsgi-striker.{service,timer}
$ sudo /bin/systemctl daemon-reload

Change 821733 had a related patch set uploaded (by RhinosF1; author: RhinosF1):

[operations/puppet@production] dsh: remove old labweb hosts

https://gerrit.wikimedia.org/r/821733

Change 821733 merged by Andrew Bogott:

[operations/puppet@production] dsh: remove old labweb hosts

https://gerrit.wikimedia.org/r/821733