⚓ T297454 WCQS gives "502 Bad Gateway Error"

Jarekt created this task.Dec 10 2021, 4:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 10 2021, 4:31 AM

Maintenance_bot added a project: Wikidata.Dec 10 2021, 4:45 AM

This is still the correct URL. I don't have exact logs, but it looks like around 2021-12-09T17:44Z the instance stopped running and upon inspection reports a status of error and power state of paused. Around 20 minutes after I started poking it the power state changed to No State. I have projectadmin rights for the project, but attempting to start the instance reports I don't have appropriate rights. Will likely need a wmcs admin to poke it.

Alicia_Fagerving_WMSE subscribed.Dec 13 2021, 12:22 PM

HenkvD subscribed.Dec 13 2021, 1:35 PM

Dominicbm subscribed.Dec 13 2021, 3:33 PM

Fuzheado subscribed.Dec 13 2021, 4:51 PM

FRomeo_WMF subscribed.Dec 13 2021, 5:25 PM

Is this monitored by any of the status tools? Does it just need to be restarted?
(and is it normal for projectadmin's to not have the ability to restart, or a feature of this being complex service?)

GFontenelle_WMF subscribed.Dec 13 2021, 5:40 PM

dcaro changed the task status from Open to In Progress.Dec 13 2021, 5:47 PM

dcaro claimed this task.

dcaro added a project: User-dcaro.

dcaro moved this task from To refine to Doing on the User-dcaro board.

dcaro added a project: Cloud-Services-Worktype-Unplanned.

dcaro added a project: Cloud-Services-Origin-User.

the VM is in error state:

This is what I get from the openstack API:

{'code': 500, 'created': '2021-12-10T19:19:35Z', 'message': 'OSError', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 205, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3743, in reboot_instance\n    do_reboot_instance(context, instance, block_device_info, reboot_type)\n  File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 360, in inner\n    return f(*args, **kwargs)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3742, in do_reboot_instance\n    reboot_type)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3835, in _reboot_instance\n    self._set_instance_obj_error_state(instance)\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3810, in _reboot_instance\n    bad_volumes_callback=bad_volumes_callback)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3250, in reboot\n    block_device_info, accel_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3343, in _hard_reboot\n    mdevs=mdevs, accel_info=accel_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6501, in _get_guest_xml\n    context, mdevs, accel_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6146, in _get_guest_config\n    flavor, guest.os_type)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4817, in _get_guest_storage_config\n    inst_type)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4724, in _get_guest_disk_config\n    conf = disk.libvirt_info(disk_info, self.disk_cachemode,\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 556, in disk_cachemode\n    if not nova.privsep.utils.supports_direct_io(CONF.instances_path):\n  File "/usr/lib/python3/dist-packages/nova/privsep/utils.py", line 74, in supports_direct_io\n    {\'path\': dirpath, \'ex\': e})\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/nova/privsep/utils.py", line 57, in supports_direct_io\n    fd = os.open(testfile, os.O_CREAT | os.O_WRONLY | os.O_DIRECT)\n  File "/usr/lib/python3/dist-packages/eventlet/green/os.py", line 118, in open\n    fd = __original_open__(file, flags, mode)\nOSError: [Errno 28] No space left on device: \'/var/lib/nova/instances/.directio.test.5199023175537772324\'\n'}

I don't know what happened. First time I see this error.

In T297454#7567004, @Sj wrote:

Is this monitored by any of the status tools? Does it just need to be restarted?

As a beta service running outside of production this does not have access to much of the tooling we generally use to ensure uptime. Downtime of the beta service is considered routine and expected.

(and is it normal for projectadmin's to not have the ability to restart, or a feature of this being complex service?)

This likely has more to do with the beta service running on non-standard cloudvirt's (the underlying hardware is different than the rest of WMCS, this service requires more resources than we typically allow in WMCS) that are special cased for this one service. In general projectadmin's can start/stop/create/delete instances at will.

The hypervisor box it's running on has apparently filled its disk:

taavi@cloudvirt-wdqs1001 ~ $ df -h /var/lib/nova/instances/
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/tank-data  3.3T  3.3T   20K 100% /var/lib/nova/instances

RhinosF1 subscribed.Dec 13 2021, 6:01 PM

dcaro changed the task status from In Progress to Open.Dec 13 2021, 6:07 PM

dcaro moved this task from Doing to Today on the User-dcaro board.

Base subscribed.Dec 13 2021, 6:08 PM

The underlying filesystem is full, the disk size was set too big (3.4T), there's a cloudcanary instance taking 20G that can be removed to try to start the VM and cleanup stuff, but we'll need to resize the disk later once the sapce is freed. Not sure if that's possible, but will check.

dcaro closed this task as Resolved.Dec 14 2021, 9:59 AM

dcaro moved this task from Today to Done on the User-dcaro board.

Sorry about that.

I've moved the other VM around, so there's a little bit of space free now, should be enough to start the VM but you'll have to make sure to cleanup right when it comes back up, if it's not possible to get it to less than half it's size, we will not be able to shrink the image though.

Let me know how it goes.

Mentioned in SAL (#wikimedia-cloud) [2021-12-14T10:26:14Z] <dcaro> Moved the nova cache (/var/lib/nova/instances/_base) and the canary image local data (/var/lib/nova/instance/<canary_image_id>) to the root disk on cloudvirt-wdqs1001 to temporary free some space (T297454)

In T297454#7568856, @dcaro wrote:

I've moved the other VM around, so there's a little bit of space free now, should be enough to start the VM but you'll have to make sure to cleanup right when it comes back up, if it's not possible to get it to less than half it's size, we will not be able to shrink the image though.

Let me know how it goes.

Oddly it's still giving the initial error when starting the instance from horizon, potentially that is a second issue unrelated to the instance becoming unavailable? I re-verified and I am assigned the projectadmin in wikidata-query.

Error: You are not allowed to start instance: wcqs-beta-01

dcaro moved this task from Done to Today on the User-dcaro board.Dec 15 2021, 10:51 AM

I had to stop (shutoff) and start the VM again (to clear the error state), the VM is up and running, and what's using all the space is:

root@wcqs-beta-01:/srv/wdqs-data# du -hs *
4.0K    aliases.map
4.0K    dumps
28G     latest-mediainfo.ttl.gz
28G     munged
72K     sdoc.jnl
4.0K    target
3.1T    wcqs.jnl

I'll turn off the VM again, so you can turn it on and cleanup before it fills up again, let me know how it goes.

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:09Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:14Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:23Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:43:27Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454

Mentioned in SAL (#wikimedia-cloud) [2021-12-15T12:44:25Z] <dcaro> Downtiming cloudvirt-wdqs1001 as it has no VMs running until disk space is fixed (T297454)

In T297454#7572089, @dcaro wrote:
I had to stop (shutoff) and start the VM again (to clear the error state), the VM is up and running, and what's using all the space is:
root@wcqs-beta-01:/srv/wdqs-data# du -hs *
4.0K    aliases.map
4.0K    dumps
28G     latest-mediainfo.ttl.gz
28G     munged
72K     sdoc.jnl
4.0K    target
3.1T    wcqs.jnl
I'll turn off the VM again, so you can turn it on and cleanup before it fills up again, let me know how it goes.

Thanks! I was able to start the instance this time around. The wcqs.jnl can't be shrunk, it's a known problem of the beta installation. We've deleted the existing journal and started a fresh data load. It will take a few days before fully coming back online. Thi data load will fill 1T or so.

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Dec 20 2021, 4:22 PM

Gehel added a project: Discovery-Search (Current work).

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Dec 20 2021, 4:51 PM

Vojtech.dostal subscribed.Dec 21 2021, 7:07 AM

Marsupium subscribed.Dec 26 2021, 2:38 AM

Theklan subscribed.Dec 27 2021, 7:34 PM

(reading up: Thanks EB)
I get a 500 error now... will this be back up before the general release on Feb 1?

Assuming it is released on time, yes.

The search team aren't going to put something unstable in production.

The production service will use its own dedicated and seperate hardware to beta though.

The beta service looks to have unintentionally picked up some of the configuration of the production cluster. I've put the configuration back and disabled the beta instance (puppet) from updating itself which should keep things in the current state as we make changes to roll out the production service.

This brings the wcqs-beta.wmflabs.org site back to working status.

EBernhardson moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.Jan 7 2022, 5:34 PM

Gehel closed this task as Resolved.Jan 10 2022, 1:41 PM

Mentioned in SAL (#wikimedia-cloud) [2022-01-10T13:56:24Z] <dcaro> Replace too big flavor t206636 for a one with a smaller disk, t206636v2 (T297454)

If you create any new VMs, or re-image the existing ones, use that new flavor and it will avoid getting the host out of space (it has a slightly smaller disk).

I reopened this, as WCQS is again giving 502 Bad Gateway error, though slightly different behavior for me. wcqs-beta.wmflabs.org resolves, but gives the error as the response when clicking the "Run" button.

Agree, here's the specific error coming back:

<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.13.9</center>
</body>
</html>

Thanks for the report, blazegraph died I restarted it, should be available again now.

WCQS gives "502 Bad Gateway Error"
Closed, ResolvedPublic
Actions

Description

Event Timeline

WCQS gives "502 Bad Gateway Error"Closed, ResolvedPublicActions

Description

Event Timeline

WCQS gives "502 Bad Gateway Error"
Closed, ResolvedPublic
Actions