Page MenuHomePhabricator

Fix for Backstop reference file not found errors
Open, Unbreak Now!Public

Description

There was a broken symlink on the server which I think was responsible for this

Event Timeline

Mhurd triaged this task as Unbreak Now! priority.Tue, Jun 25, 11:36 PM

Fixed and running all groups currently.

This symlink from Pixel's report to the data volume was broken:

/home/pixel/pixel/report -> /mnt/pixel-data/reports

So I think what must have happened is since "report" wasn't in Pixel's root .gitignore, after creating our symlink on the server, it then showed up as an untracked change on the server. And then server changes were probably stashed at some point, erasing that untracked symlink

For fix I've ensured Pixel makes the report dir IF NOT PRESENT, and I added the report dir to Pixel's root .gitignore, so our symlink report dir doesn't get nuked if we try something on the server and stash it

https://github.com/wikimedia/pixel/pull/348

Update - the symlink issue had broken Pixel, but it was not actually responsible for the reference files not found issue

Still working on it

I think this fixed the super slow performance on the server, which I think accounted for the file related errors


First I ran iostat -xz 1 10000 and watched the "%util" while rebuild.sh ran - it was pegged near/at 100%

Then I ran this after stopping docker, which showed one process in uninterruptible sleep state - ie with D:

root@production:/mnt/docker# ps aux | awk '$8 ~ /D/ { print $0 }'
root     1644197  0.0  0.0      0     0 ?        D    17:35   0:04 [kworker/u8:4+events_unbound]

Next I ran this which caused that process to go away (though I probably could have just killed it directly):

sudo umount -l /mnt/docker
sudo mount /dev/sdb1 /mnt/docker

After this I could see iostat -xz 1 10000 output no longer showed "%util" pegged near/at 100%


ChatGPT said this:

The process kworker/u8:4+events_unbound in the D (uninterruptible sleep) state, which was causing high disk utilization, suggests that it was likely stuck in an I/O operation that could not be interrupted. Based on the stack trace, it was stuck in a discard operation on the ext4 filesystem, indicating an issue related to the trimming of unused blocks.
Possible Causes:
Filesystem TRIM Operations:
Automatic TRIM operations (discard) might have caused the kernel worker process to get stuck. These operations are intended to inform the storage device about unused blocks so it can manage the storage more efficiently. However, issues can arise if the underlying storage device or its firmware doesn't handle these operations well.
Filesystem Corruption or Inconsistency:
Corruption or inconsistencies in the filesystem might have led to issues with discard operations, causing the kernel worker process to get stuck.
Hardware Issues:
Problems with the storage hardware, such as a failing disk or controller, could cause I/O operations to hang.
High Load or Resource Contention:
High system load or contention for I/O resources could exacerbate the issue, leading to processes getting stuck in uninterruptible sleep states.