Page MenuHomePhabricator

WCQS gives "502 Bad Gateway Error"
Closed, ResolvedPublic

Description

Wikimedia Commons Query Service seems to be down, as https://wcqs-beta.wmflabs.org/ is giving me "502 Bad Gateway Error". I have not used WCQS in a bit so maybe something changed, like URL (?).

Event Timeline

This is still the correct URL. I don't have exact logs, but it looks like around 2021-12-09T17:44Z the instance stopped running and upon inspection reports a status of error and power state of paused. Around 20 minutes after I started poking it the power state changed to No State. I have projectadmin rights for the project, but attempting to start the instance reports I don't have appropriate rights. Will likely need a wmcs admin to poke it.

Is this monitored by any of the status tools? Does it just need to be restarted?
(and is it normal for projectadmin's to not have the ability to restart, or a feature of this being complex service?)

dcaro changed the task status from Open to In Progress.Dec 13 2021, 5:47 PM
dcaro claimed this task.
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

the VM is in error state:

This is what I get from the openstack API:

{'code': 500, 'created': '2021-12-10T19:19:35Z', 'message': 'OSError', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 205, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3743, in reboot_instance\n    do_reboot_instance(context, instance, block_device_info, reboot_type)\n  File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 360, in inner\n    return f(*args, **kwargs)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3742, in do_reboot_instance\n    reboot_type)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3835, in _reboot_instance\n    self._set_instance_obj_error_state(instance)\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3810, in _reboot_instance\n    bad_volumes_callback=bad_volumes_callback)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3250, in reboot\n    block_device_info, accel_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3343, in _hard_reboot\n    mdevs=mdevs, accel_info=accel_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6501, in _get_guest_xml\n    context, mdevs, accel_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6146, in _get_guest_config\n    flavor, guest.os_type)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4817, in _get_guest_storage_config\n    inst_type)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4724, in _get_guest_disk_config\n    conf = disk.libvirt_info(disk_info, self.disk_cachemode,\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 556, in disk_cachemode\n    if not nova.privsep.utils.supports_direct_io(CONF.instances_path):\n  File "/usr/lib/python3/dist-packages/nova/privsep/utils.py", line 74, in supports_direct_io\n    {\'path\': dirpath, \'ex\': e})\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/nova/privsep/utils.py", line 57, in supports_direct_io\n    fd = os.open(testfile, os.O_CREAT | os.O_WRONLY | os.O_DIRECT)\n  File "/usr/lib/python3/dist-packages/eventlet/green/os.py", line 118, in open\n    fd = __original_open__(file, flags, mode)\nOSError: [Errno 28] No space left on device: \'/var/lib/nova/instances/.directio.test.5199023175537772324\'\n'}

I don't know what happened. First time I see this error.

In T297454#7567004, @Sj wrote:

Is this monitored by any of the status tools? Does it just need to be restarted?

As a beta service running outside of production this does not have access to much of the tooling we generally use to ensure uptime. Downtime of the beta service is considered routine and expected.

(and is it normal for projectadmin's to not have the ability to restart, or a feature of this being complex service?)

This likely has more to do with the beta service running on non-standard cloudvirt's (the underlying hardware is different than the rest of WMCS, this service requires more resources than we typically allow in WMCS) that are special cased for this one service. In general projectadmin's can start/stop/create/delete instances at will.

The hypervisor box it's running on has apparently filled its disk:

taavi@cloudvirt-wdqs1001 ~ $ df -h /var/lib/nova/instances/
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/tank-data  3.3T  3.3T   20K 100% /var/lib/nova/instances
dcaro changed the task status from In Progress to Open.Dec 13 2021, 6:07 PM
dcaro moved this task from Doing to Today on the User-dcaro board.

The underlying filesystem is full, the disk size was set too big (3.4T), there's a cloudcanary instance taking 20G that can be removed to try to start the VM and cleanup stuff, but we'll need to resize the disk later once the sapce is freed. Not sure if that's possible, but will check.

dcaro moved this task from Today to Done on the User-dcaro board.

I've moved the other VM around, so there's a little bit of space free now, should be enough to start the VM but you'll have to make sure to cleanup right when it comes back up, if it's not possible to get it to less than half it's size, we will not be able to shrink the image though.

Let me know how it goes.

Mentioned in SAL (#wikimedia-cloud) [2021-12-14T10:26:14Z] <dcaro> Moved the nova cache (/var/lib/nova/instances/_base) and the canary image local data (/var/lib/nova/instance/<canary_image_id>) to the root disk on cloudvirt-wdqs1001 to temporary free some space (T297454)

I've moved the other VM around, so there's a little bit of space free now, should be enough to start the VM but you'll have to make sure to cleanup right when it comes back up, if it's not possible to get it to less than half it's size, we will not be able to shrink the image though.

Let me know how it goes.

Oddly it's still giving the initial error when starting the instance from horizon, potentially that is a second issue unrelated to the instance becoming unavailable? I re-verified and I am assigned the projectadmin in wikidata-query.

Error: You are not allowed to start instance: wcqs-beta-01

I had to stop (shutoff) and start the VM again (to clear the error state), the VM is up and running, and what's using all the space is:

root@wcqs-beta-01:/srv/wdqs-data# du -hs *
4.0K    aliases.map
4.0K    dumps
28G     latest-mediainfo.ttl.gz
28G     munged
72K     sdoc.jnl
4.0K    target
3.1T    wcqs.jnl

I'll turn off the VM again, so you can turn it on and cleanup before it fills up again, let me know how it goes.

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:09Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:14Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:23Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:27Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-cloud) [2021-12-15T12:44:25Z] <dcaro> Downtiming cloudvirt-wdqs1001 as it has no VMs running until disk space is fixed (T297454)

I had to stop (shutoff) and start the VM again (to clear the error state), the VM is up and running, and what's using all the space is:

root@wcqs-beta-01:/srv/wdqs-data# du -hs *
4.0K    aliases.map
4.0K    dumps
28G     latest-mediainfo.ttl.gz
28G     munged
72K     sdoc.jnl
4.0K    target
3.1T    wcqs.jnl

I'll turn off the VM again, so you can turn it on and cleanup before it fills up again, let me know how it goes.

Thanks! I was able to start the instance this time around. The wcqs.jnl can't be shrunk, it's a known problem of the beta installation. We've deleted the existing journal and started a fresh data load. It will take a few days before fully coming back online. Thi data load will fill 1T or so.

(reading up: Thanks EB)
I get a 500 error now... will this be back up before the general release on Feb 1?

Assuming it is released on time, yes.

The search team aren't going to put something unstable in production.

The production service will use its own dedicated and seperate hardware to beta though.

The beta service looks to have unintentionally picked up some of the configuration of the production cluster. I've put the configuration back and disabled the beta instance (puppet) from updating itself which should keep things in the current state as we make changes to roll out the production service.

This brings the wcqs-beta.wmflabs.org site back to working status.

Mentioned in SAL (#wikimedia-cloud) [2022-01-10T13:56:24Z] <dcaro> Replace too big flavor t206636 for a one with a smaller disk, t206636v2 (T297454)

If you create any new VMs, or re-image the existing ones, use that new flavor and it will avoid getting the host out of space (it has a slightly smaller disk).

I reopened this, as WCQS is again giving 502 Bad Gateway error, though slightly different behavior for me. wcqs-beta.wmflabs.org resolves, but gives the error as the response when clicking the "Run" button.

Agree, here's the specific error coming back:

<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.13.9</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->

dcausse subscribed.

Thanks for the report, blazegraph died I restarted it, should be available again now.