User Details
- User Since
- Dec 15 2021, 9:19 PM (150 w, 1 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- BKing (WMF) [ Global Accounts ]
Today
Forgive the drive-by comment, but I'm wondering if we have evaluated any other NICs besides Broadcom? We've lost countless hours to their firmware bugs (at least ~100 of my team's hosts have been affected in the ~3 years I've worked here). That's a pretty significant cost if you think about our salaries, opportunity costs, etc.
OK, this is now fixed...closing again.
Yesterday
@Seppl2013 I recommend sysstat (also known as sar) for tracking memory and load. sysstat takes 10 minute samples by default, and you can see the memory stats with sar -r. Let us know if you have any other questions.
Reopening, as the monitors are using the default port 443 and we need them to use the correct port per cluster.
Contrary to my prior statement, I no longer believe that disabling numa is necessary (see this comment for more details).
On the DPE side, I've confirmed that the host is back up and part of the cluster using these instructions (which I just added myself). Moving to "done" on our workboard...
Tue, Oct 29
CR for new hosts merged per @RobH 's instructions above. Unassigning...
CR for new hosts merged per @RobH 's instructions above. Unassigning...
DC Ops, this host is hard down, feel free to replace RAM or take any other actions to restore it to working condition at your convenience (this is not an emergency).
Mon, Oct 28
I think migrating the test instance is a good AC for this task; we can create a new task or tasks for migrating the production instances. Closing...
Based on /etc/wikimedia/contacts.yaml , these hosts are owned by Data Persistence.
Thanks to @BTullis for pointing out this Puppet code . I now believe that this code, not numa, was causing the hosts to seize up at 50% RAM utilization. Because of the large gap between MemoryHigh (when the system starts to aggressively reclaim memory) and MemoryMax (when it actually kills the process) , this led to a state where the system was unable to recover. Turning off numa helped, but did not fix the root cause.
Fri, Oct 25
This has been implemented per the above patch. Closing...
Thu, Oct 24
Today is the one-year anniversary of this ticket! As Ben pointed out, this is pretty vague. As I haven't followed up, and our upstream helm chart policy has matured, it's probably past time to close this ticket. We can always follow up with more specific goals as time permits.
Closing as duplicate of T375404...
Wed, Oct 23
As shown by this dashboard, I've run the cleanup script and usage has fallen back below the alert threshold.
Closing as duplicate of T375109 .
Forgive the drive-by comment, but at the 6-month anniversary of this ticket, it might be worth checking how our upstream production applications (such as gitlab, netbox etc) are handling this change, if it all. For example, I noticed that netbox-docker is now using valkey .
Tue, Oct 22
@aborrero sorry for the confusion. I believe we are talking about a single server, as opposed to a project-wide quota.
Do you know if the latest wibase dumps are available via nfs?
Mon, Oct 21
Hello @Physikerwelt ! I am an SRE on the Search Platform team, and my responsibilities include the current WDQS infrastucture. While I can't estimate the exact resource needs of the WDQS graph under qlever, I can give you some info on its current resource usage under Blazegraph.
I also updated the stat hosts dashboard with a panel that shows memory usage per slice
This alert is no longer firing, so I'm going to go ahead and close this one out for now.
Reopening, as enabling node interleaving did improve stability on stat1011. We should apply this to the other stat hosts.
Fri, Oct 18
Update @fkaelin helped us get a reproducer:
Thu, Oct 17
Wed, Oct 16
Adding some observations from our slack thread .
Thu, Oct 10
Per the above patch, we've enabled zRAM, which should give the hosts a bit of protection under extreme memory pressure. I had planned on exploring more I/O-related optimizations...but as mentioned in T376653, it's likely these hosts will use Ceph mounts for their homedirs instead of local disks. As such, I don't think it's worth the effort to invest much more time on this issue. We can always revisit if need be. Closing...
Per the above PR, we have activated memory and I/O cgroups on all stat hosts. I've crossed out the rest of the AC as it's entirely possible that we'll be using Ceph homedirs instead of the current disks fairly soon (ref: this design doc). We can always take a closer look at the disks if necessary, but I'm going to close this one out for now.
Wed, Oct 9
Tue, Oct 8
I provisioned wdqs-categories1001 in T376079. After provisioning, I one-offed the host and loaded categories via /usr/local/bin/reloadCategories.sh wdqs . As demonstrated by this graph , the reload took ~2h. Post-reload, memory usage has been stable at ~10 GB. I think this enough evidence that we can run categories in the Ganeti infrastructure if necessary. At this point, I'm ready to decom/destroy this VM and work on a migration in a future task.*
The VM wdqs-categories1001 has been provisioned successfully, so I'm closing out this task.
Mon, Oct 7
This writeup from Facebook provides an excellent real-world example of using cgroups v2 to protect workloads.