Page MenuHomePhabricator

bking (Brian King)
Senior Site Reliability Engineer, Search Platform Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Dec 15 2021, 9:19 PM (150 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
BKing (WMF) [ Global Accounts ]

Recent Activity

Today

bking updated the task description for T378757: Jupyter and Analytics Client Enhancements Phase 1: Ensure acceptable storage performance with CephFS.
Thu, Oct 31, 8:23 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated the task description for T378757: Jupyter and Analytics Client Enhancements Phase 1: Ensure acceptable storage performance with CephFS.
Thu, Oct 31, 8:10 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking renamed T378757: Jupyter and Analytics Client Enhancements Phase 1: Ensure acceptable storage performance with CephFS from Ensure acceptable storage performance with CephFS to Jupyter and Analytics Client Enhancements Phase 1: Ensure acceptable storage performance with CephFS.
Thu, Oct 31, 8:08 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking renamed T378738: Jupyter and Analytics Client Enhancements Phase 1: Estimate storage needs/provision CephFS storage for user directories from Estimate storage needs/provision CephFS storage for user directories to Jupyter and Analytics Client Enhancements Phase 1: Estimate storage needs/provision CephFS storage for user directories.
Thu, Oct 31, 8:07 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking renamed T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task from Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers to Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.
Thu, Oct 31, 8:07 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated the task description for T378757: Jupyter and Analytics Client Enhancements Phase 1: Ensure acceptable storage performance with CephFS.
Thu, Oct 31, 8:04 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created T378757: Jupyter and Analytics Client Enhancements Phase 1: Ensure acceptable storage performance with CephFS.
Thu, Oct 31, 7:27 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created T378738: Jupyter and Analytics Client Enhancements Phase 1: Estimate storage needs/provision CephFS storage for user directories.
Thu, Oct 31, 4:00 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created P70771 stat hosts disk usage 2024-10-31 🎃.
Thu, Oct 31, 3:59 PM · Data-Platform-SRE
bking changed the status of T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task from Open to In Progress.
Thu, Oct 31, 3:55 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.
Thu, Oct 31, 3:54 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T373519: Allow UEFI DHCP configs.

Forgive the drive-by comment, but I'm wondering if we have evaluated any other NICs besides Broadcom? We've lost countless hours to their firmware bugs (at least ~100 of my team's hosts have been affected in the ~3 years I've worked here). That's a pretty significant cost if you think about our salaries, opportunity costs, etc.

Thu, Oct 31, 2:57 PM · Patch-For-Review, Infrastructure-Foundations
bking closed T357146: Monitor Elastic S3 repository status as Resolved.

OK, this is now fixed...closing again.

Thu, Oct 31, 1:02 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Yesterday

bking added a comment to T377655: Request creation of wikiqlever VPS project.

@Seppl2013 I recommend sysstat (also known as sar) for tracking memory and load. sysstat takes 10 minute samples by default, and you can see the memory stats with sar -r. Let us know if you have any other questions.

Wed, Oct 30, 8:39 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service, Cloud-VPS (Project-requests)
bking reopened T357146: Monitor Elastic S3 repository status as "In Progress".

Reopening, as the monitors are using the default port 443 and we need them to use the correct port per cluster.

Wed, Oct 30, 6:56 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T376813: Implement non-cgroups-related performance optimizations on stat hosts as Resolved.
Wed, Oct 30, 6:50 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated subscribers of T376813: Implement non-cgroups-related performance optimizations on stat hosts.

Contrary to my prior statement, I no longer believe that disabling numa is necessary (see this comment for more details).

Wed, Oct 30, 6:50 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T376813: Implement non-cgroups-related performance optimizations on stat hosts, a subtask of T376426: Improve developer experience on stat hosts part 2, as Resolved.
Wed, Oct 30, 6:49 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking moved T378454: an-worker1165: Broken RAM from Blocked/Waiting to Done on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.

On the DPE side, I've confirmed that the host is back up and part of the cluster using these instructions (which I just added myself). Moving to "done" on our workboard...

Wed, Oct 30, 6:28 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), DC-Ops, ops-eqiad, SRE

Tue, Oct 29

bking placed T378368: Q2:rack/setup/install cloudelastic101[12] up for grabs.

CR for new hosts merged per @RobH 's instructions above. Unassigning...

Tue, Oct 29, 10:04 PM · SRE, Data-Platform-SRE (2024.10.19 - 2024.11.08), ops-eqiad, Discovery-Search, DC-Ops
bking placed T378031: Q2:rack/setup/install wdqs202[67] up for grabs.

CR for new hosts merged per @RobH 's instructions above. Unassigning...

Tue, Oct 29, 10:02 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), SRE, ops-codfw, Discovery-Search, DC-Ops
bking assigned T378454: an-worker1165: Broken RAM to VRiley-WMF.
Tue, Oct 29, 6:55 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), DC-Ops, ops-eqiad, SRE
bking moved T378454: an-worker1165: Broken RAM from Backlog - operations to Blocked/Waiting on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.
Tue, Oct 29, 6:54 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), DC-Ops, ops-eqiad, SRE
bking updated subscribers of T378454: an-worker1165: Broken RAM.
Tue, Oct 29, 6:53 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), DC-Ops, ops-eqiad, SRE
bking added a comment to T378454: an-worker1165: Broken RAM.

DC Ops, this host is hard down, feel free to replace RAM or take any other actions to restore it to working condition at your convenience (this is not an emergency).

Tue, Oct 29, 6:49 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), DC-Ops, ops-eqiad, SRE

Mon, Oct 28

bking closed T374948: Migrate airflow webservers to Kubernetes as Resolved.

I think migrating the test instance is a good AC for this task; we can create a new task or tasks for migrating the production instances. Closing...

Mon, Oct 28, 7:00 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated Other Assignee for T374948: Migrate airflow webservers to Kubernetes, added: brouberol.
Mon, Oct 28, 7:00 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T374948: Migrate airflow webservers to Kubernetes, a subtask of T375729: Create LDAP groups to use for OIDC permission mapping with corresponding airflow DAG Authors groups , as Resolved.
Mon, Oct 28, 7:00 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Infrastructure-Foundations
bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Mon, Oct 28, 6:59 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking placed T373490: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 up for grabs.
Mon, Oct 28, 4:25 PM · Data-Persistence, serviceops, SRE
bking added a comment to T373490: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18.

Based on /etc/wikimedia/contacts.yaml , these hosts are owned by Data Persistence.

Mon, Oct 28, 4:22 PM · Data-Persistence, serviceops, SRE
bking claimed T373490: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18.
Mon, Oct 28, 4:04 PM · Data-Persistence, serviceops, SRE
bking reassigned T378227: Investigate failed Cirrus index build services on mwmaint2002 from bking to dcausse.
Mon, Oct 28, 2:56 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated subscribers of T376426: Improve developer experience on stat hosts part 2.

Thanks to @BTullis for pointing out this Puppet code . I now believe that this code, not numa, was causing the hosts to seize up at 50% RAM utilization. Because of the large gap between MemoryHigh (when the system starts to aggressively reclaim memory) and MemoryMax (when it actually kills the process) , this led to a state where the system was unable to recover. Turning off numa helped, but did not fix the root cause.

Mon, Oct 28, 1:47 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Fri, Oct 25

bking created T378227: Investigate failed Cirrus index build services on mwmaint2002.
Fri, Oct 25, 8:18 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated the task description for T371061: Update CirrusSearch dashboards to use new metrics/refresh dashboards.
Fri, Oct 25, 1:26 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Discovery-Search, CirrusSearch
bking updated the task description for T371061: Update CirrusSearch dashboards to use new metrics/refresh dashboards.
Fri, Oct 25, 1:25 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Discovery-Search, CirrusSearch
bking closed T357146: Monitor Elastic S3 repository status as Resolved.

This has been implemented per the above patch. Closing...

Fri, Oct 25, 1:22 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Thu, Oct 24

bking added a member for Ceph: bking.
Thu, Oct 24, 9:42 PM
bking moved T349666: [EPIC] Improve helm chart development experience from Backlog - project to Done on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.
Thu, Oct 24, 1:48 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Kubernetes, serviceops-radar
bking edited projects for T349666: [EPIC] Improve helm chart development experience, added: Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Discovery-Search, Epic, Data-Platform-SRE.
Thu, Oct 24, 1:47 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Kubernetes, serviceops-radar
bking closed T349666: [EPIC] Improve helm chart development experience as Declined.

Today is the one-year anniversary of this ticket! As Ben pointed out, this is pretty vague. As I haven't followed up, and our upstream helm chart policy has matured, it's probably past time to close this ticket. We can always follow up with more specific goals as time permits.

Thu, Oct 24, 1:47 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Kubernetes, serviceops-radar
bking closed T377158: RdfStreamingUpdaterSpaceUsageTooHigh as Invalid.

Closing as duplicate of T375404...

Thu, Oct 24, 1:28 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T377655: Request creation of wikiqlever VPS project.

@aborrero sorry for the confusion. I believe we are talking about a single server, as opposed to a project-wide quota.

yeah, the only quotas that exists in Cloud VPS are project-wide.

So, you all can do the math of how many RAM, CPU, disk, instances. you need in total for the project, even if only a single VM will be used, and that would be the project quota that we would need to evaluate/approve/set.

Thu, Oct 24, 1:23 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service, Cloud-VPS (Project-requests)

Wed, Oct 23

bking closed T375109: RdfStreamingUpdaterSpaceUsageTooHigh as Resolved.

As shown by this dashboard, I've run the cleanup script and usage has fallen back below the alert threshold.

Wed, Oct 23, 7:02 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T375404: RdfStreamingUpdaterSpaceUsageTooHigh as Invalid.

Closing as duplicate of T375109 .

Wed, Oct 23, 5:58 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T360596: Figure out a plan to move forward with regarding Redis License changes.

Forgive the drive-by comment, but at the 6-month anniversary of this ticket, it might be worth checking how our upstream production applications (such as gitlab, netbox etc) are handling this change, if it all. For example, I noticed that netbox-docker is now using valkey .

Wed, Oct 23, 1:34 PM · cloud-services-team, GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Core Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE

Tue, Oct 22

bking moved T377158: RdfStreamingUpdaterSpaceUsageTooHigh from Backlog - operations to In Progress on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.
Tue, Oct 22, 6:42 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking claimed T377772: RdfStreamingUpdaterSpaceUsageTooHigh.
Tue, Oct 22, 6:42 PM · Discovery-Search (Current work)
bking added a comment to T377655: Request creation of wikiqlever VPS project.

@aborrero sorry for the confusion. I believe we are talking about a single server, as opposed to a project-wide quota.

Tue, Oct 22, 2:39 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service, Cloud-VPS (Project-requests)
bking added a comment to T377655: Request creation of wikiqlever VPS project.

Do you know if the latest wibase dumps are available via nfs?

Tue, Oct 22, 1:34 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service, Cloud-VPS (Project-requests)

Mon, Oct 21

bking updated the task description for T377734: Refactor cgroups implementation/improve process observability for stat hosts.
Mon, Oct 21, 10:42 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T377655: Request creation of wikiqlever VPS project.

Hello @Physikerwelt ! I am an SRE on the Search Platform team, and my responsibilities include the current WDQS infrastucture. While I can't estimate the exact resource needs of the WDQS graph under qlever, I can give you some info on its current resource usage under Blazegraph.

Mon, Oct 21, 8:26 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service, Cloud-VPS (Project-requests)
bking added projects to T377655: Request creation of wikiqlever VPS project: Wikidata-Query-Service, Data-Platform-SRE.
Mon, Oct 21, 8:00 PM · Wikidata, Data-Platform-SRE, Wikidata-Query-Service, Cloud-VPS (Project-requests)
bking updated the task description for T377734: Refactor cgroups implementation/improve process observability for stat hosts.
Mon, Oct 21, 5:07 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T377734: Refactor cgroups implementation/improve process observability for stat hosts.

I also updated the stat hosts dashboard with a panel that shows memory usage per slice

Mon, Oct 21, 5:02 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated the task description for T377734: Refactor cgroups implementation/improve process observability for stat hosts.
Mon, Oct 21, 4:11 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T377445: ProbeDown as Resolved.

This alert is no longer firing, so I'm going to go ahead and close this one out for now.

Mon, Oct 21, 4:03 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking edited projects for T377445: ProbeDown, added: Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Discovery-Search (Current work).
Mon, Oct 21, 3:41 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking edited projects for T377158: RdfStreamingUpdaterSpaceUsageTooHigh, added: Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Discovery-Search (Current work).
Mon, Oct 21, 3:40 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created T377734: Refactor cgroups implementation/improve process observability for stat hosts.
Mon, Oct 21, 2:33 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T376813: Implement non-cgroups-related performance optimizations on stat hosts.

Reopening, as enabling node interleaving did improve stability on stat1011. We should apply this to the other stat hosts.

Mon, Oct 21, 1:15 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking moved T376813: Implement non-cgroups-related performance optimizations on stat hosts from Backlog - project to In Progress on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.
Mon, Oct 21, 1:15 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Fri, Oct 18

bking reopened T376813: Implement non-cgroups-related performance optimizations on stat hosts as "In Progress".
Fri, Oct 18, 7:39 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking reopened T376813: Implement non-cgroups-related performance optimizations on stat hosts, a subtask of T376426: Improve developer experience on stat hosts part 2, as In Progress.
Fri, Oct 18, 7:39 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated subscribers of T376426: Improve developer experience on stat hosts part 2.

Update @fkaelin helped us get a reproducer:

Fri, Oct 18, 7:12 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Thu, Oct 17

bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Thu, Oct 17, 8:45 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Thu, Oct 17, 7:07 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created P70262 admin_ng namespace removal.
Thu, Oct 17, 1:45 PM · Data-Platform-SRE

Wed, Oct 16

bking updated subscribers of T376426: Improve developer experience on stat hosts part 2.

Adding some observations from our slack thread .

Wed, Oct 16, 5:29 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Thu, Oct 10

bking closed T376813: Implement non-cgroups-related performance optimizations on stat hosts as Resolved.

Per the above patch, we've enabled zRAM, which should give the hosts a bit of protection under extreme memory pressure. I had planned on exploring more I/O-related optimizations...but as mentioned in T376653, it's likely these hosts will use Ceph mounts for their homedirs instead of local disks. As such, I don't think it's worth the effort to invest much more time on this issue. We can always revisit if need be. Closing...

Thu, Oct 10, 9:20 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T376813: Implement non-cgroups-related performance optimizations on stat hosts, a subtask of T376426: Improve developer experience on stat hosts part 2, as Resolved.
Thu, Oct 10, 9:20 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking closed T376653: Investigate I/O and implement cgroups on stat hosts as Resolved.
Thu, Oct 10, 9:17 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking closed T376653: Investigate I/O and implement cgroups on stat hosts, a subtask of T376426: Improve developer experience on stat hosts part 2, as Resolved.
Thu, Oct 10, 9:16 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking added a comment to T376653: Investigate I/O and implement cgroups on stat hosts.

Per the above PR, we have activated memory and I/O cgroups on all stat hosts. I've crossed out the rest of the AC as it's entirely possible that we'll be using Ceph homedirs instead of the current disks fairly soon (ref: this design doc). We can always take a closer look at the disks if necessary, but I'm going to close this one out for now.

Thu, Oct 10, 9:16 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking updated the task description for T376653: Investigate I/O and implement cgroups on stat hosts.
Thu, Oct 10, 8:56 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Thu, Oct 10, 8:01 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking moved T376813: Implement non-cgroups-related performance optimizations on stat hosts from Backlog - project to In Progress on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.
Thu, Oct 10, 7:34 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking updated the task description for T374948: Migrate airflow webservers to Kubernetes.
Thu, Oct 10, 7:18 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created T376919: Create a software update cadence for Search Platform-owned applications.
Thu, Oct 10, 3:52 PM · Data-Platform-SRE

Wed, Oct 9

bking changed the status of T376813: Implement non-cgroups-related performance optimizations on stat hosts from Open to In Progress.
Wed, Oct 9, 3:15 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking changed the status of T376813: Implement non-cgroups-related performance optimizations on stat hosts, a subtask of T376426: Improve developer experience on stat hosts part 2, from Open to In Progress.
Wed, Oct 9, 3:15 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking created T376813: Implement non-cgroups-related performance optimizations on stat hosts.
Wed, Oct 9, 3:14 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking renamed T376653: Investigate I/O and implement cgroups on stat hosts from Investigate I/O and implement cgroups on stat1011 to Investigate I/O and implement cgroups on stat hosts.
Wed, Oct 9, 3:10 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking updated the task description for T376426: Improve developer experience on stat hosts part 2.
Wed, Oct 9, 3:07 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Tue, Oct 8

bking closed T375687: Test categories performance under Ganeti as Resolved.

I provisioned wdqs-categories1001 in T376079. After provisioning, I one-offed the host and loaded categories via /usr/local/bin/reloadCategories.sh wdqs . As demonstrated by this graph , the reload took ~2h. Post-reload, memory usage has been stable at ~10 GB. I think this enough evidence that we can run categories in the Ganeti infrastructure if necessary. At this point, I'm ready to decom/destroy this VM and work on a migration in a future task.*

Tue, Oct 8, 6:48 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.09.28 - 2024.10.18), Wikidata
bking closed T375687: Test categories performance under Ganeti, a subtask of T375520: EPIC: WDQS categories migration, as Resolved.
Tue, Oct 8, 6:48 PM · Data-Platform-SRE, Wikidata, Data-Platform, Epic
bking updated the task description for T375687: Test categories performance under Ganeti.
Tue, Oct 8, 4:41 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.09.28 - 2024.10.18), Wikidata
bking closed T376079: eqiad: request 1 VM for wdqs-categories as Resolved.

The VM wdqs-categories1001 has been provisioned successfully, so I'm closing out this task.

Tue, Oct 8, 4:40 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), vm-requests, Infrastructure-Foundations, SRE
bking closed T376079: eqiad: request 1 VM for wdqs-categories, a subtask of T375520: EPIC: WDQS categories migration, as Resolved.
Tue, Oct 8, 4:39 PM · Data-Platform-SRE, Wikidata, Data-Platform, Epic
bking closed T376079: eqiad: request 1 VM for wdqs-categories, a subtask of T375687: Test categories performance under Ganeti, as Resolved.
Tue, Oct 8, 4:39 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.09.28 - 2024.10.18), Wikidata

Mon, Oct 7

bking added a comment to T376653: Investigate I/O and implement cgroups on stat hosts.

This writeup from Facebook provides an excellent real-world example of using cgroups v2 to protect workloads.

Mon, Oct 7, 10:19 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking updated the task description for T376653: Investigate I/O and implement cgroups on stat hosts.
Mon, Oct 7, 8:22 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking changed the status of T376653: Investigate I/O and implement cgroups on stat hosts, a subtask of T376426: Improve developer experience on stat hosts part 2, from Open to In Progress.
Mon, Oct 7, 8:22 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)
bking changed the status of T376653: Investigate I/O and implement cgroups on stat hosts from Open to In Progress.
Mon, Oct 7, 8:22 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)
bking created T376653: Investigate I/O and implement cgroups on stat hosts.
Mon, Oct 7, 8:07 PM · Patch-For-Review, Data-Platform-SRE (2024.09.28 - 2024.10.18)

Fri, Oct 4

bking added a subtask for T337013: [Epic] Splitting the graph in WDQS: T374967: wdqs-categories migration: decide where to migrate.
Fri, Oct 4, 1:21 PM · Discovery-Search (Current work), Epic, Wikidata-Query-Service, Wikidata
bking added a parent task for T374967: wdqs-categories migration: decide where to migrate: T337013: [Epic] Splitting the graph in WDQS.
Fri, Oct 4, 1:21 PM · Wikidata, Wikidata-Query-Service, Data-Platform-SRE

Thu, Oct 3

bking created T376426: Improve developer experience on stat hosts part 2.
Thu, Oct 3, 9:06 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08)

Wed, Oct 2

bking added a subtask for T375520: EPIC: WDQS categories migration: T374009: Investigate EQIAD WDQS graph split host alerts/separate out categories in the puppet code.
Wed, Oct 2, 5:36 PM · Data-Platform-SRE, Wikidata, Data-Platform, Epic