Page MenuHomePhabricator

[fourohfour,infra] Got a lot of alerts this weekend
Closed, ResolvedPublic

Assigned To
Authored By
dcaro
Jul 9 2024, 8:50 AM
Referenced Files
F56307389: image.png
Jul 9 2024, 8:56 AM
F56307358: image.png
Jul 9 2024, 8:54 AM
F56307356: image.png
Jul 9 2024, 8:54 AM
Subscribers

Description

The fourohfour tool has been a bit unstable lately, this task is to track and document the changes done to improve that this round (for a previous round see T335680: ProbeDown tools http_this_tool_does_not_exist_toolforge_org_ip4: tool fourohfour flapping).

Partial fix:
https://gitlab.wikimedia.org/toolforge-repos/fourohfour/-/merge_requests/2

Event Timeline

dcaro triaged this task as High priority.Jul 9 2024, 8:50 AM

Found that the logs were growing too much (>300GB in tools), so added a logrotate job:

tools.fourohfour@tools-bastion-13:~$ toolforge jobs list -o long
+-----------+---------------------------------------------------------------------------------------+------------------+---------+-------+-----------+----------------------------------------+----------------------------------------+---------+------------+---------+--------+---------------+----------------------------+
| Job name: |                                       Command:                                        |    Job type:     | Image:  | Port: | File log: |              Output log:               |               Error log:               | Emails: | Resources: | Mounts: | Retry: | Health check: |          Status:           |
+-----------+---------------------------------------------------------------------------------------+------------------+---------+-------+-----------+----------------------------------------+----------------------------------------+---------+------------+---------+--------+---------------+----------------------------+
| logrotate | logrotate -v $TOOL_DATA_DIR/logrotate-all.conf --state $TOOL_DATA_DIR/logrotate.state | schedule: @daily | mariadb | none  |    yes    | /data/project/fourohfour/logrotate.out | /data/project/fourohfour/logrotate.err |  none   |  default   |   all   |   no   |     none      | Waiting for scheduled time |
+-----------+---------------------------------------------------------------------------------------+------------------+---------+-------+-----------+----------------------------------------+----------------------------------------+---------+------------+---------+--------+---------------+----------------------------+

They seem to grow to several 100MB/day, so might have to make that logrotate a bit more aggressive.

Also deployed the custom branch caching ldap and disk access to reduce the io, so far it seems to be working a tad better, cpu usage went down:

image.png (360×873 px, 14 KB)

And memory usage:

image.png (313×926 px, 18 KB)

Though might be an artifact of just restarting the job, so will wait for a bit, see if it did actually help.

The traffic in the last two days does not seem suspicious, though there was one worker (nfs-37) that got nfs issues, so might have affected the tool too:

image.png (242×1 px, 64 KB)

Looking closer at the logs, there's a lot of logs like:

[WARNING] unable to add HTTP_ACCEPT_ENCODING=gzip, deflate, br to uwsgi packet, consider increasing buffer size

Will try increasing the buffer size, see if that helps (we have memory)

Changed the uwsgi config, let's see the log growth in an hour (the 10MB grew in 2h since the previous restart 09/Jul/2024:07:22:20):

tools.fourohfour@tools-bastion-13:~$ date; ls -lrth uwsgi.log
Tue Jul  9 09:29:08 UTC 2024
-rw-r----- 1 tools.fourohfour tools.fourohfour 9.9M Jul  9 09:28 uwsgi.log

Yep, logs seems way less busy:

tools.fourohfour@tools-bastion-13:~$ date; ls -lrth uwsgi.log
Tue Jul  9 12:21:20 UTC 2024
-rw-r----- 1 tools.fourohfour tools.fourohfour 11M Jul  9 12:21 uwsgi.log

Seems to be going smoother, will open a new task if it happens again

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 12) board.