Page MenuHomePhabricator

Discourse instance brings 502 Bad Gateway after half-way upgrade
Closed, ResolvedPublic

Description

https://discourse.wmflabs.org/ is down: 502 Bad Gateway

I was in the middle of an upgrade, after asking for permission to try out. After searching a bit in Discourse Meta forums, it seems that the candidate number one is running out of disk space or memory/swap problems, and after that a simple restart should bring the site back.

This is the log I was keeping while upgrading, fwiw:

This is what we had:

Version v1.5.0.beta13b 1.9.0.beta14 A critical update is available. Please upgrade!

First I had to update Docker Manager itself:

docker_manager (549b1d8)
New Version Available!
Remote Version:
Last Updated: 11 days ago
38 new commits

After clicking, a progress bar and a command line window appeared. Process started.

Then, after five minutes and a lot of sensible output:

Upgrade completed successfully!
Note: The web server restarts in the background. It's a good idea to wait 30 seconds or so before refreshing your browser to see the latest version of the application.
docker_manager is at the newest version (6cc836e).

Then, when trying to continue with the upgrade...

502 Bad Gateway
nginx

I have been waiting just in case. Sometimes some parts of the admin interface were still available by visiting URLs in my recent browser history, but now everything seems to be down.

Event Timeline

docker_manager usually doesn’t affect the rails server at all. But in case of failure, use launcher rebuild app. If still with errors, that should give a meaningful output.

Any news on this? Can the webservice be restarted?

Who are the admins of that instance? I don't know who, neither I know how to check.

Does this just need the webservice to be restarted? Has anyone tried that?

(It's sort of frustrating with all this talk about Discourse at the mo' that the functioning installation that we've had for nearly a couple of years is not actually online.)

That sounds like you are volunteering :)

Added you as a projectadmin.

So it seemed to be failing on the upgrade because there were local modifications to the Git repo:

discourse@discourse1002:/srv/discourse$ git diff
diff --git a/templates/web.template.yml b/templates/web.template.yml
index b4fdd92..19ead4b 100644
--- a/templates/web.template.yml
+++ b/templates/web.template.yml
@@ -125,6 +125,39 @@ run:
       from: /client_max_body_size.+$/
       to: client_max_body_size $upload_size ;
 
+  # from https://wiki.mozilla.org/Community_Ops/Discourse/Setup#Edit_web.template.yml_.28Only_for_SSL_Sites.29
+  - replace:
+      filename: "/etc/nginx/conf.d/discourse.conf"
+      from: "server {"
+      to: |+
+        server {
+        #Messy hack to force SSL on only the hostname, not IPs so ELB and Icinga work.
+          set $use_https NO;
+          if ($host ~* 'discourse.wmflabs.org') {
+            set $use_https A;
+          }
+          if ($http_x_forwarded_proto != 'https') {
+            set $use_https "${use_https}B";
+          }
+          if ($use_https = AB) {
+            rewrite ^ https://$host$request_uri? permanent;
+          }
+
+  # make exports publicly visible
+  - replace:
+      filename: "/etc/nginx/conf.d/discourse.conf"
+      from: "    location = /srv/status {"
+      to: |+
+            location = /exports/ {
+              autoindex on;
+            }
+
+            location ~ /exports/.*\.json {
+              add_header Content-type "application/json; charset=utf-8";
+            }
+
+            location = /srv/status {
+
   - exec:
       cmd: echo "done configuring web"
       hook: web_config

Removing these (temporarily), and running ./launcher rebuild app results in:

Your Docker installation is not using a supported storage driver. If we were to proceed you may have a broken install.
aufs is the recommended storage driver, although zfs/btrfs/overlay and overlay2 may work as well.
Other storage drivers are known to be problematic.
You can tell what filesystem you are using by running "docker info" and looking at the 'Storage Driver' line.

If you wish to continue anyway using your existing unsupported storage driver,
read the source code of launcher and figure out how to bypass this check.

So switched to overlay FS by adding this in /etc/docker/daemon.json:

{
    "storage-driver": "overlay"
}

And restarted docker without error. But now:

discourse@discourse1002:/srv/discourse$ ./launcher 
ERROR: Docker version 1.12.6 not supported, please upgrade to at least 17.03.1, or recommended 17.06.2

So upgrade docker...

Thanks @Paladox, those were helpful. No need to upgrade Debian though, just uninstall docker and reinstall (hm yes, sounds weird to me too, but it worked!):

samwilson@discourse1002:~$ docker -v
Docker version 17.10.0-ce, build f4ffd25

So now rebuilding...

Okay, we're back up and running now.

We're running on postgres 9.3 (had to in containers/app.yml change templates/postgres.template.yml to templates/postgres.9.3.template.yml) so that still needs to be fixed up:

Run ./launcher enter app
Run cd /shared/postgres_backup && sudo -u postgres pg_dump discourse > backup.db

Undo the postgres template in your container config
Run: ./launcher stop app
Run: sudo mv /var/discourse/shared/standalone/postgres_data /var/discourse/shared/standalone/postgres_data_old
Run: ./launcher rebuild app

Run: ./launcher enter app
Run: cd /shared/postgres_backup
Run: sv stop unicorn
Run: sudo -iu postgres dropdb discourse
Run: sudo -iu postgres createdb discourse
Run: sudo -iu postgres psql discourse < backup.db
Run: exit
Run: ./launcher rebuild app

But I've got some other stuff to do right now. :-)

No it is! :) Well, it was, then I restarted it again because I was putting the https and /export changes back into the config. So it was down again. But purposefully. :-)

It's a slow beast to rebuild.

Okay, it's up and running again now, and redirecting to https. The custom config is now in templates/web.toolforge.yml so we're no longer modifying core files.

Things left to do:

You can find custom commands in container/app.yml. If you have external PG service, discourse can use that.

Okay, it's up and running again now, and redirecting to https. The custom config is now in templates/web.toolforge.yml so we're no longer modifying core files.

Things left to do: