Page MenuHomePhabricator

Logstash-Beta cannot be accessed: 504 Gateway Time-out
Closed, ResolvedPublic

Description

I cannot connect to https://logstash-beta.wmflabs.org nor ssh to deployment-logstash2.deployment-prep.eqiad.wmflabs either (I can to logstash03). Rebooting deployment-logstash2 and deployment-logstash03 did not worked either.

504 Gateway Time-out
nginx/1.13.6

Possible cause: T254801#6206911

Also, maybe worth another Task but:

  • on which instance is logstash-beta actually running?
  • both logstash2 and logstash03 do run on outdated software, shall we ask for a reimage or just create logstash04 on buster and serve things from there?
  • do we need another instance for data backup?

Event Timeline

MarcoAurelio renamed this task from Logstash-Beta issues to Logstash-Beta cannot be accessed: 504 Gateway Time-out.Jun 8 2020, 6:48 PM
MarcoAurelio triaged this task as High priority.

It's supposed to be on 03 but I think it got moved back to 2 at some point.

What's outdated about 03? It was set up to replace 2 because that was
running Jessie.

I'm not sure the logs stored in here have enough value to be backed up?

Looked outdated to me:

  • logstash2 is running on debian-8.0-jessie (deprecated 2015-06-13)
  • logstash03 is running on debian-9.8-stretch (deprecated 2019-07-25)

Granted, I am not sure if this was intended.

An image is used to create the VM in the first place, once that's done we just keep it updated. If we replaced instances because they were based on images that were deprecated we'd either have automated the whole thing, stopped running special VMs and gone for a container-on-ephemeral-VM model, or gathered a small army of people to spend all their time replacing instances (this may be an exaggeration but you get the gist).
Stretch is not banned yet, we're still trying to get rid of jessie. Production logstash hosts run stretch.

https://openstack-browser.toolforge.org/project/deployment-prep says that logstash-beta.wmflabs.org is pointed at http://172.16.1.184:80 which is deployment-logstash03.deployment-prep.eqiad1.wikimedia.cloud.

There is a process running on port 80 of deployment-logstash03.deployment-prep.eqiad1.wikimedia.cloud, but there is also a ferm managed firewall on the host and it has no ingress rule allowing the outside world to talk to the process on port 80. Seems like this is either something missing in hiera config for the instance or something missing in the roles/profiles applied to the instance.

Can we delete logstash2 then? Is there anything running there?

Puppet is broken on deployment-logstash2.deployment-prep.eqiad.wmflabs: "The last Puppet run was at Tue Mar 3 23:28:34 UTC 2020 (140730 minutes ago)." The breakage is caused by the root partition being full: "Error: Could not run Puppet configuration client: No space left on device @ fptr_finalize - /var/lib/puppet/state/agent_catalog_run.lock". The disk is full because of log files in /var/log. The big log consumers are kafka, logstash, daemon.log, and syslog. They all seem to be yelling about the disk being full. Nice circular problem.

I nuked some old logs and and made enough space to get a Puppet run to start. I also stopped the logstash and kafka services to free up some cpu so that Puppet could run. It may or may not complete before the disk fills up again.

I nuked some old logs and and made enough space to get a Puppet run to start. I also stopped the logstash and kafka services to free up some cpu so that Puppet could run. It may or may not complete before the disk fills up again.

Puppet config is broken, likely due to upstream refactoring.

root@deployment-logstash2:/var/log# puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find class role::kibana for deployment-logstash2.deployment-prep.eqiad.wmflabs on node deployment-logstash2.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Looks like the kibana role was removed Mar 4 3224079cf0f53711bea07163a666531f2475897a

Attempted to update to use profile::kibana as it looks like the commit updated all the hiera config to support that; however, I had trouble sshing into deployment-logstash2

I managed to login via keyholder on the cumin instance and see:

/var/log/auth.log
error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1

possibly, likely, from puppet being broken here. I managed to run puppet on this instance. I can now ssh into this instance at least.

There's still breakage. It's screaming about mtail package. I got ferm running; however, there is still no iptables rule for port 80 on that machine.

I tried manually adding the rule to iptables, it didn't change the nginx gateway timeout to https://logstash-beta.wmflabs.org/ (I've re-removed that rule once again).

Can I suggest that efforts may be better placed on getting logstash03 into
operation rather than continuing to resurrect or keep logstash2 on life
support.

Can I suggest that efforts may be better placed on getting logstash03 into
operation rather than continuing to resurrect or keep logstash2 on life
support.

I agree, we should probably decomission/delete/remove logstash2.

herron claimed this task.
herron subscribed.

After applying profile::kibana::httpd_proxy to deployment-logstash03, and playing whack-a-mole with missing hiera keys (added to the instance hiera), https://logstash-beta.wmflabs.org is accessible once again. Resolving!