Page MenuHomePhabricator

Toolforge NFS tracing misses some dumps events
Open, MediumPublic

Description

I accessed a dumps file over NFS from a tool shell pod:

[tools.majavah-test@shell-1769005612 ~] $ cat /public/dumps/public/enwiki/latest/enwiki-latest-md5sums.txt | head
10754ee2ff42845e44c5a50ee2aa6372  enwiki-20260101-site_stats.sql.gz
06be099469550601557bd02a8c1ce8bf  enwiki-20260101-image.sql.gz
b20cf5bf318cc78e52e9aa3cfc87ef3f  enwiki-20260101-pagelinks.sql.gz
122f7eef64feb443f3fe4ca9fedc9e2b  enwiki-20260101-categorylinks.sql.gz
0f6135c33a1aa8d51b0e7e475d9d388f  enwiki-20260101-imagelinks.sql.gz
729da6aef4fa179b386c1ea250fc17dc  enwiki-20260101-templatelinks.sql.gz
e246fd0593bba08ea03688297c0266b4  enwiki-20260101-linktarget.sql.gz
1aa2b96a8786e2e897723347b8650673  enwiki-20260101-externallinks.sql.gz
929e2c4b731e7321bcb7399d3c8faa2e  enwiki-20260101-langlinks.sql.gz
b22753be8c3bab46ba6664e3b24aac2d  enwiki-20260101-user_groups.sql.gz

This is not visible in the tracing dashboard like I would expect it to be:

image.png (659×1 px, 86 KB)

strace shows the openat() call not resolving the symlink beforehand, which I presume is the issue:

openat(AT_FDCWD, "/public/dumps/public/enwiki/latest/enwiki-latest-md5sums.txt", O_RDONLY) = 3

Event Timeline

fnegri triaged this task as Medium priority.Wed, Jan 21, 4:10 PM

Thanks for the report, that's why I was asking if we knew of tools using for example dumps. So it looks like the pre-existing code had an optimization to exclude anything that doesn't start with /mnt/nfs and I naively assumed that it was already checked that the data at that point was already a resolved path, but it's not.
It looks like we could move the hook from the open/openat syscall to LSM_PROBE (I've quickly checked and we seem to have support for it, but I'll check more extensively on the whole fleet of workers to be sure) that should gets hooked after the kernel has resolved the path.
I'm working on a patch to test the above assumptions.

Change #1231034 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] wmcs: fix infra-tracing-nfs

https://gerrit.wikimedia.org/r/1231034

After quite few testing the route of LSM_PROBE wasn't feasible because of some missing feature in the kernel that AFAICT can't just be enabled but requires recompilation compared to the standard Debian kernel. I've tried various hooks including file_open and vfs_open without success. I've come up with a solution using the current approach and sent the above patch for it.

Change #1231034 merged by Volans:

[operations/puppet@production] wmcs: fix infra-tracing-nfs

https://gerrit.wikimedia.org/r/1231034

Change #1237251 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] wmcs: infra-tracing-nfs bail out earlier if root

https://gerrit.wikimedia.org/r/1237251

Change #1237251 merged by Volans:

[operations/puppet@production] wmcs: infra-tracing-nfs bail out earlier if root

https://gerrit.wikimedia.org/r/1237251