Page MenuHomePhabricator

Fix for Backstop reference file not found errors
Closed, ResolvedPublic

Description

There was a broken symlink on the server which I think was responsible for this

Edit: turned out the symlink was indeed an issue, but there was an additional issue - see comments

Event Timeline

Mhurd triaged this task as Unbreak Now! priority.Jun 25 2024, 11:36 PM

Fixed and running all groups currently.

This symlink from Pixel's report to the data volume was broken:

/home/pixel/pixel/report -> /mnt/pixel-data/reports

So I think what must have happened is since "report" wasn't in Pixel's root .gitignore, after creating our symlink on the server, it then showed up as an untracked change on the server. And then server changes were probably stashed at some point, erasing that untracked symlink

For fix I've ensured Pixel makes the report dir IF NOT PRESENT, and I added the report dir to Pixel's root .gitignore, so our symlink report dir doesn't get nuked if we try something on the server and stash it

https://github.com/wikimedia/pixel/pull/348

Update - the symlink issue had broken Pixel, but it was not actually responsible for the reference files not found issue

Still working on it

I think this fixed the super slow performance on the server, which I think accounted for the file related errors


First I ran iostat -xz 1 10000 and watched the "%util" while rebuild.sh ran - it was pegged near/at 100%

Then I ran this after stopping docker, which showed one process in uninterruptible sleep state - ie with D:

root@production:/mnt/docker# ps aux | awk '$8 ~ /D/ { print $0 }'
root     1644197  0.0  0.0      0     0 ?        D    17:35   0:04 [kworker/u8:4+events_unbound]

Next I ran this which caused that process to go away (though I probably could have just killed it directly):

sudo umount -l /mnt/docker
sudo mount /dev/sdb1 /mnt/docker

After this I could see iostat -xz 1 10000 output no longer showed "%util" pegged near/at 100%


ChatGPT said this:

The process kworker/u8:4+events_unbound in the D (uninterruptible sleep) state, which was causing high disk utilization, suggests that it was likely stuck in an I/O operation that could not be interrupted. Based on the stack trace, it was stuck in a discard operation on the ext4 filesystem, indicating an issue related to the trimming of unused blocks.
Possible Causes:
Filesystem TRIM Operations:
Automatic TRIM operations (discard) might have caused the kernel worker process to get stuck. These operations are intended to inform the storage device about unused blocks so it can manage the storage more efficiently. However, issues can arise if the underlying storage device or its firmware doesn't handle these operations well.
Filesystem Corruption or Inconsistency:
Corruption or inconsistencies in the filesystem might have led to issues with discard operations, causing the kernel worker process to get stuck.
Hardware Issues:
Problems with the storage hardware, such as a failing disk or controller, could cause I/O operations to hang.
High Load or Resource Contention:
High system load or contention for I/O resources could exacerbate the issue, leading to processes getting stuck in uninterruptible sleep states.

Fix above seems to have worked. Run times are back to expected range and no longer seeing the file not found errors

Seemed to run ok over the weekend. Marking this as resolved

Had the same issue again...

Ran these lines to fix:

sudo su
systemctl stop docker
umount -l /mnt/docker
mount /dev/sdb1 /mnt/docker
systemctl start docker

That only sped things up for a couple minutes.

So I ran this after sudo su:
root@production:/home/mhurd# cat /sys/block/sdb/queue/scheduler

Which showed there was no scheduler enabled, but mq-deadline was available:
[none] mq-deadline

So I enabled it:
root@production:/home/mhurd# echo mq-deadline > /sys/block/sdb/queue/scheduler

And confirmed the change took effect:

cat /sys/block/sdb/queue/scheduler
[mq-deadline] none

Then to make the scheduler change persist across reboots I added this file:
/etc/udev/rules.d/60-scheduler.rules

...containing this line:
ACTION=="add|change", KERNEL=="sdb", ATTR{queue/scheduler}="mq-deadline"

So far seems lightening fast. Going to reboot to confirm the scheduler setting persists...