Toolforge grid automation: consider creating a cookbook to heal the grid from D state procs
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	aborrero
	May 5 2023, 9:27 AM

Description

@dcaro found https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html that explains why a D state process results in increased loadavg on linux servers.

If some NFS hiccup (otherwise harmless) result in D state processes on the exec nodes, and the load avg goes up as a result, and if the grid schedules jobs based on grid load avg (just a theory at this point), then the failure mode is clear:

Any NFS hiccup (otherwise harmless) can result in the Grid becoming unavailable and/or unreliable.

We may consider creating a cookbook that scans the grid for D state procs and reboot affected nodes as an automated healing mechanism.

Details

	Subject	Repo	Branch	Lines +/-
	cloud vps: add cookbook to check nodes for proccess in D state	cloud/wmcs-cookbooks	main	+107 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
		Restricted Task
Resolved	• Bstorm	T169289 Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Resolved	MoritzMuehlenhoff	T169290 New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Resolved	• Bstorm	T203254 labstore1004 and labstore1005 high load issues following upgrades
Declined	None	T257945 NFS v4.1/2 as possible fix for elevated load and lock contention on our NFS servers
Resolved	dcaro	T335009 Toolforge grid seems overloaded
Declined	None	T336034 Toolforge grid automation: consider creating a cookbook to heal the grid from D state procs

Event Timeline

aborrero created this task.May 5 2023, 9:27 AM

aborrero mentioned this in T336681: Agree how to track/find all WMCS tasks that have a common topic, but belong to different projects.May 15 2023, 2:52 PM

aborrero added a parent task: T257945: NFS v4.1/2 as possible fix for elevated load and lock contention on our NFS servers.May 15 2023, 4:08 PM

Change 919868 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] toolforge: add cookbook to check nodes for proccess in D state

https://gerrit.wikimedia.org/r/919868

gerritbot added a project: Patch-For-Review.May 15 2023, 4:58 PM

aborrero edited projects, added Cloud-Services; removed Toolforge.Jun 19 2023, 3:31 PM

aborrero moved this task from Triage to Automation on the Cloud-Services board.

aborrero edited projects, added Toolforge, User-aborrero; removed Cloud-Services.Jun 20 2023, 10:32 AM

aborrero moved this task from Backlog to Automation on the User-aborrero board.

No more work on the grid is going to be done :), we are retiring it

Toolforge grid automation: consider creating a cookbook to heal the grid from D state procsClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Toolforge grid automation: consider creating a cookbook to heal the grid from D state procs
Closed, DeclinedPublic
Actions

Related Objects
Search...