Page MenuHomePhabricator

`become` command not working properly on login-buster.toolforge.org
Closed, ResolvedPublic

Description

After logging into login-buster.toolforge.org via SSH, I attempted to switch to my tool account using the following command:

become tedbot

However, the terminal becomes unresponsive and hangs indefinitely with no output.
I was able to forcibly enter the environment using this workaround:

sudo -u tools.tedbot sh -c 'exec sh'

Even then, running basic commands such as the following also causes the shell to hang with no response:

cd ~
ls

For reference, I ran the following to check the status of the home directory, and it appears to exist and be accessible at the metadata level:

stat ~

The output confirms the directory exists with proper ownership and permissions:

  File: /data/project/tedbot
  Size: 4096      	Blocks: 8          IO Block: 32768  directory
Device: 34h/52d	Inode: 6424921     Links: 32
Access: (2775/drwxrwsr-x)  Uid: (52765/tools.tedbot)   Gid: (52765/tools.tedbot)
Access: 2025-04-09 15:52:12.680684710 +0000
Modify: 2025-04-09 15:52:11.604675503 +0000
Change: 2025-04-09 15:52:11.604675503 +0000
 Birth: -

It seems that reading or listing the directory contents (e.g., ls ~) may be where the system is blocking.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 renamed this task from become command not working properly on Toolforge to `become` command not working properly on login-buster.toolforge.org.Apr 9 2025, 11:18 PM

Load average on tools-sgebastion-10 (login-buster.toolforge.org) is 24. My guess is that the NFS connection for the tool home directories is messed up.

Mentioned in SAL (#wikimedia-cloud) [2025-04-09T23:23:05Z] <bd808> Rebooting tools-sgebastion-10 (login-buster.toolforge.org) for high load/unresponsive NFS mounts (T391538)

shutdown -r now reboot was unresponsive so I did a hard reboot via Horizon. The instance's console log also had:

[185770.523700] Memory cgroup out of memory: Kill process 3847 (wiki.pl --diff) score 903 or sacrifice child
[185770.525873] Killed process 3847 (wiki.pl --diff) total-vm:950992kB, anon-rss:938280kB, file-rss:7380kB, shmem-rss:0kB
[443413.891654] Memory cgroup out of memory: Kill process 2170 (wiki.pl --diff) score 894 or sacrifice child
[443413.893630] Killed process 2170 (wiki.pl --diff) total-vm:941180kB, anon-rss:928448kB, file-rss:7948kB, shmem-rss:0kB

Thanks so much, it's working perfectly now! Really appreciate your help and quick support.

bd808 claimed this task.

Hi, unfortunately I'm encountering the same issue again. The become command is still not working properly on login-buster.toolforge.org, just like before.

Previously, this was resolved by performing a hard reboot via Horizon, as shutdown -r now was unresponsive. Would you mind applying the same fix again? Thank you!

dcaro subscribed.

@Ykhwong this was caused by a wider outage in toolforge, should be working agan, please reopen if you still face issues.

Thanks for the update.
However, I'm still experiencing the issue. When I run the become command on login-buster.toolforge.org, it hangs and does not proceed. This seems to persist even after the outage has been resolved.
Could you please take another look when you get a chance? Happy to provide any additional details if needed. Thanks again!

Yep, it seems it's still hanging (note that it does not happens with all tools, wm-lol did work, but tedbot does not), I'll reboot 👍

Thanks for the reboot — the issue seems to be resolved now. become is working properly again on login-buster.toolforge.org.
Appreciate the help!

@Ykhwong awesome :), may I ask why are you using the old buster bastion and not the newer one? (so we can provide whatever is missing for you to move, as buster has been EOL for a while)

Oh, I didn't realize I was still using the old buster bastion. Thanks for letting me know. I'll check out the migration guide and start transitioning to the newer one.