Page MenuHomePhabricator

bking (Brian King)
Senior Site Reliability Engineer, Search Platform Team

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Dec 15 2021, 9:19 PM (50 w, 4 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
BKing (WMF) [ Global Accounts ]

Recent Activity

Fri, Dec 2

bking added a comment to T324346: High error rate in WDQS since 15:00 UTC.

Adding a reminder to myself to capture dashboard links, log-diving commands and other troubleshooting info from today's IRC log and add to the WDQS runbook .

Fri, Dec 2, 9:37 PM · Wikimedia-Incident, Wikidata, Wikidata-Query-Service
Dusty292 awarded T323620: 2022-11-22 WDQS high load incident a Yellow Medal token.
Fri, Dec 2, 6:31 PM · Discovery-Search (Current work)
bking added a comment to T324346: High error rate in WDQS since 15:00 UTC.

Follow-up email sent to Wikidata Users list:

Fri, Dec 2, 6:24 PM · Wikimedia-Incident, Wikidata, Wikidata-Query-Service
bking updated subscribers of T324346: High error rate in WDQS since 15:00 UTC.
Fri, Dec 2, 5:14 PM · Wikimedia-Incident, Wikidata, Wikidata-Query-Service
bking added a comment to T324346: High error rate in WDQS since 15:00 UTC.

Email sent to Wikidata Users list:

Fri, Dec 2, 5:07 PM · Wikimedia-Incident, Wikidata, Wikidata-Query-Service
bking created T324341: Add stack tracing to the wdqs restart cookbook .
Fri, Dec 2, 3:33 PM · Wikidata, Discovery-Search, Wikidata-Query-Service

Thu, Dec 1

bking added a comment to T323646: Observe results from JVM options/heap memory changes.

We changed the java GC options above to reduce old GC alerts, but we had another one today for cloudelastic:

Thu, Dec 1, 6:30 PM · Discovery-Search (Current work)
bking created T324209: Quota increase request for wikidata-query project.
Thu, Dec 1, 2:27 PM · Cloud-VPS (Quota-requests), Discovery-Search

Wed, Nov 30

bking created T324147: Investigate and document cloudvirt-wdqs servers.
Wed, Nov 30, 8:27 PM · Discovery-Search

Mon, Nov 28

bking updated subscribers of T321491: Evaluate Flink Operator on DSE Kubernetes Cluster.

Per last week's conversation with @dcausse :
The helm chart for operator needs to be modified for the DSE environment. @BTullis 's spark helm chart PR is probably a good template. Also note that @Ottomata is working on the Flink docker images .

Mon, Nov 28, 2:45 PM · serviceops-radar, Discovery-Search (Current work)
bking added a comment to T321587: Build Flink minikube playground on WMCS Search project.

The Flink jobs are deployed. I also mirrored the Flink Helm Chart so we can see what's happening.

Mon, Nov 28, 2:39 PM · Discovery-Search (Current work)
bking claimed T321587: Build Flink minikube playground on WMCS Search project.
Mon, Nov 28, 2:28 PM · Discovery-Search (Current work)

Tue, Nov 22

bking edited projects for T323620: 2022-11-22 WDQS high load incident, added: Discovery-Search; removed Discovery-Search (Current work).
Tue, Nov 22, 9:40 PM · Discovery-Search (Current work)
bking renamed T323620: 2022-11-22 WDQS high load incident from 2022-11-22 WDQS high load to 2022-11-22 WDQS high load incident.
Tue, Nov 22, 9:38 PM · Discovery-Search (Current work)
bking created T323646: Observe results from JVM options/heap memory changes.
Tue, Nov 22, 8:33 PM · Discovery-Search (Current work)
bking moved T319020: Reset to upstream java GC options and remove redundant JVM options from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.

Will create a new task for

  • we don't hit an OOME within 2 weeks
  • gather some data
Tue, Nov 22, 7:40 PM · Patch-For-Review, Discovery-Search (Current work)
bking added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

IMHO the encryption part can't just be removed. The cookbook is also used to transfer cross-DC and for that encryption AFAIK is a requirement and anyway strongly encouraged also for same-DC transfers.

Agreed, this is not a best practice and we wouldn't make that a permanent change in the cookbook. We did look at the rsync Puppet class and saw there was an option to disable encryption, so we figured it would be OK as a one-off (also considering we were transferring publicly-available data).

Tue, Nov 22, 7:24 PM · Discovery-Search (Current work)
bking added a comment to T321874: Consider alternative configuration management tooling.

I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.

Agreed. In retrospect, convincing someone of the value of a tool on a message board is NOT a winning strategy. But, I would caution against reading too much into peoples' responses (or lack thereof). I have had people tell me privately, "Thanks for bringing this up, I wouldn't have the energy," and "this tends to be a religious war, you should only suggest Ansible for net-new environments."

I think describing this discussion as a religion war is quite unfair. Attributing the comment to someone else isnt' going to make it less disrespectful of the people who dedicated time and efforts to discussing this topic with you.

I've updated my post to the exact quote. I don't think this has offended anyone who did engage, but if it did please let me know and I will remove it from this page entirely. I don't know how to read "attributing the comment to someone else," can you clarify? It sounds like you are accusing me of sockpuppeting. If you don't believe me, I can ask the person if they are OK with me revealing their personal details.

Tue, Nov 22, 5:16 PM · Puppet, Infrastructure-Foundations
bking created T323620: 2022-11-22 WDQS high load incident.
Tue, Nov 22, 3:48 PM · Discovery-Search (Current work)
bking added a comment to T319020: Reset to upstream java GC options and remove redundant JVM options.

Rough way of validating that the extraneous JVM options have been removed:

Tue, Nov 22, 3:01 PM · Patch-For-Review, Discovery-Search (Current work)
bking added a comment to T323612: Increase small cluster heap memory.

Creating this ticket retroactively to link the following code changes:

Tue, Nov 22, 2:52 PM · Discovery-Search (Current work)
bking created T323612: Increase small cluster heap memory.
Tue, Nov 22, 2:50 PM · Discovery-Search (Current work)

Fri, Nov 18

bking closed T316236: Reload WCQS from dumps as Resolved.
Fri, Nov 18, 3:43 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking added a comment to T316236: Reload WCQS from dumps.

Opened T323380 for the disk errors, closing this one for now.

Fri, Nov 18, 3:43 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking closed T316236: Reload WCQS from dumps, a subtask of T314703: Structured data for deleted files on Commons still visible in SPARQL engine after deletion, as Resolved.
Fri, Nov 18, 3:42 PM · Discovery-Search (Current work), Privacy Engineering, Wikidata, Wikidata-Query-Service, MediaWiki-Page-deletion, Privacy, Commons
bking added a comment to T320812: [SPIKE] Deploy event driven stateless Flink service to DSE cluster.

Just a heads-up as I'm re-engaging. I plan to use the stream-enrichment-poc namespace within the next couple of weeks. Ping me here or on the Slack thread if this is going to be a problem and/or if you have any advice on this.

Fri, Nov 18, 3:09 PM · Event-Platform Value Stream, Shared-Data-Infrastructure, Data-Engineering-Planning
bking added a project to T323380: Investigate disk errors on wcqs1003.eqiad.wmnet: DC-Ops.
Fri, Nov 18, 3:01 PM · Discovery-Search (Current work), DC-Ops
bking added a comment to T323380: Investigate disk errors on wcqs1003.eqiad.wmnet.

Hello DC Ops,

Fri, Nov 18, 3:01 PM · Discovery-Search (Current work), DC-Ops
bking created T323380: Investigate disk errors on wcqs1003.eqiad.wmnet.
Fri, Nov 18, 2:59 PM · Discovery-Search (Current work), DC-Ops
bking added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

We looked the rsync class in Puppet, and it didn't seem like a great fit for our use case. Instead, we simplified the cookbook (removing openssl and pigz ) and were able to complete the data transfer.

Fri, Nov 18, 2:44 PM · Discovery-Search (Current work)
bking added a comment to T316236: Reload WCQS from dumps.

The reload is complete. However, we had to reboot wcqs1003.eqiad.wmnet several times before it would actually load the OS, and the BIOS displayed disk errors every time:

Fri, Nov 18, 2:42 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking added a comment to T321874: Consider alternative configuration management tooling.

I can definitely relate with the long (and stressful!) cycles of Puppet patches you mention @bking and that one of my main motivations for starting Pontoon almost three years ago now.

Needless to say, I'm a fan of the peace of mind that having a sandbox similar to production has provided me. Hope that helps!

Fri, Nov 18, 2:18 PM · Puppet, Infrastructure-Foundations
bking added a comment to T321874: Consider alternative configuration management tooling.

I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in the discussion. I would suggest we close it as declined.

Fri, Nov 18, 2:13 PM · Puppet, Infrastructure-Foundations

Thu, Nov 17

bking updated bking.
Thu, Nov 17, 8:17 PM

Wed, Nov 16

bking added a comment to T321587: Build Flink minikube playground on WMCS Search project.

Host deployment-flink0.deployment-prep.eqiad1.wikimedia.cloud is up and running the default Flink job as described at Flink's quickstart page. Next steps are to test Flink Application Cluster as k8s Deployment and k8s Job resource as described here

Wed, Nov 16, 8:25 PM · Discovery-Search (Current work)
bking added a comment to T321874: Consider alternative configuration management tooling.

How would this be different under Ansible?

  • I could render the template live on the server before committing changes, so I wouldn't make the mistake in the first place.

I have found this extremely helpful as well. At my last job we regurarly ran Puppet noops in production, which allowed us to render a template with full production data. I have made some effort to make that possible here at the foundation with bolt, but it has significant gaps due to our use of PuppetDB and having private data that is only available to the puppet masters.

Wed, Nov 16, 7:47 PM · Puppet, Infrastructure-Foundations
bking added a comment to T321874: Consider alternative configuration management tooling.

The problems of deployment-prep are a matter of resourcing, (lack of) team ownership, processes and prioritization, not the tooling.

Wed, Nov 16, 3:53 PM · Puppet, Infrastructure-Foundations

Tue, Nov 15

bking added a comment to T319020: Reset to upstream java GC options and remove redundant JVM options.

In addition to the patches listed above, we wrote a new test as well.

Tue, Nov 15, 9:29 PM · Patch-For-Review, Discovery-Search (Current work)

Mon, Nov 14

bking moved T323071: Removed decommissioned NFS shares from /etc/fstab on wdqs1009 and wcqs2001 from Incoming to Needs review on the Discovery-Search (Current work) board.
Mon, Nov 14, 9:40 PM · Discovery-Search (Current work)
bking edited projects for T323071: Removed decommissioned NFS shares from /etc/fstab on wdqs1009 and wcqs2001, added: Discovery-Search (Current work); removed Discovery-Search.
Mon, Nov 14, 9:39 PM · Discovery-Search (Current work)
bking closed T323071: Removed decommissioned NFS shares from /etc/fstab on wdqs1009 and wcqs2001 as Resolved.

I reviewed the commit above, but I don't see any place where the actual server FQDNs are hardcoded. So I don't think a Puppet patch will be necessary, just some manual steps on the servers.

Mon, Nov 14, 9:30 PM · Discovery-Search (Current work)
bking added a comment to T323071: Removed decommissioned NFS shares from /etc/fstab on wdqs1009 and wcqs2001.

See also this commit.

Mon, Nov 14, 9:09 PM · Discovery-Search (Current work)
bking created T323071: Removed decommissioned NFS shares from /etc/fstab on wdqs1009 and wcqs2001.
Mon, Nov 14, 9:07 PM · Discovery-Search (Current work)
bking added a comment to T316236: Reload WCQS from dumps.

wcqs1003 is the only host left that needs a reload:

Mon, Nov 14, 3:01 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking closed T313842: Decommission elastic2049.codfw.wmnet as Resolved.
Mon, Nov 14, 2:12 PM · DC-Ops, Discovery-Search (Current work), decommission-hardware

Wed, Nov 9

bking claimed T313842: Decommission elastic2049.codfw.wmnet.
Wed, Nov 9, 2:00 PM · DC-Ops, Discovery-Search (Current work), decommission-hardware

Mon, Nov 7

bking removed a project from T322377: Use DNS name instead of IP in PyBal alerts: Discovery-Search.
Mon, Nov 7, 4:31 PM · SRE, Observability-Alerting, Traffic
bking updated the task description for T321605: Make WCQS/WDQS data transfer cookbook more reliable .
Mon, Nov 7, 4:05 PM · Discovery-Search (Current work)
bking added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .
  • Adding a timeout to the nc command did not help.
Mon, Nov 7, 3:52 PM · Discovery-Search (Current work)
bking added a comment to T316103: Migrate WDQS to Java 11.

Looking on Blazegraph Database's Github, there are a few issues related to Java version.

Mon, Nov 7, 2:57 PM · Packaging, Infrastructure-Foundations, wdwb-tech, Wikidata, Wikidata-Query-Service
bking added a comment to T321874: Consider alternative configuration management tooling.

Thanks again for your response.

Mon, Nov 7, 2:13 AM · Puppet, Infrastructure-Foundations

Nov 4 2022

bking placed T303011: Automate elastic plugin pkg build process up for grabs.

Unassigning as I'm no longer actively working on this. I do intend to finish it eventually.

Nov 4 2022, 1:52 PM · Discovery-Search
bking awarded T322168: Update Zuul status page to WMUI (remove last bit of Bootstrap) a 100 token.
Nov 4 2022, 1:00 PM · Patch-For-Review, Continuous-Integration-Infrastructure

Nov 3 2022

bking added a comment to T321587: Build Flink minikube playground on WMCS Search project.

Started the Terraform repo

Nov 3 2022, 9:52 PM · Discovery-Search (Current work)
bking updated subscribers of T322377: Use DNS name instead of IP in PyBal alerts.
Nov 3 2022, 9:31 PM · SRE, Observability-Alerting, Traffic
bking updated the task description for T322377: Use DNS name instead of IP in PyBal alerts.
Nov 3 2022, 9:31 PM · SRE, Observability-Alerting, Traffic
bking created T322377: Use DNS name instead of IP in PyBal alerts.
Nov 3 2022, 9:30 PM · SRE, Observability-Alerting, Traffic
bking created P38095 mktun.sh.
Nov 3 2022, 8:30 PM
bking created T322358: Address Puppet change errors in Relforge.
Nov 3 2022, 6:11 PM · Discovery-Search (Current work)

Nov 2 2022

bking added a comment to T321874: Consider alternative configuration management tooling.

Thanks jbond, these are all legitimate points and must be addressed before we start to consider Ansible. Here's what I have so far:

Nov 2 2022, 3:45 PM · Puppet, Infrastructure-Foundations

Oct 31 2022

bking updated the task description for T322045: Investigate "Processing latency of WDQS_Streaming_Updater" noisy alerts.
Oct 31 2022, 4:46 PM · Wikidata, Wikidata-Query-Service
bking added a subtask for T316236: Reload WCQS from dumps: T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.
Oct 31 2022, 4:45 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking added a parent task for T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service: T316236: Reload WCQS from dumps.
Oct 31 2022, 4:45 PM · Discovery-Search (Current work)
bking updated the task description for T321414: Restore elasticsearch-oss package to bullseye-wikimedia repo.
Oct 31 2022, 4:39 PM · Discovery-Search
bking closed T321414: Restore elasticsearch-oss package to bullseye-wikimedia repo as Resolved.

Closing in favor of the original ticket

Oct 31 2022, 4:35 PM · Discovery-Search
bking created T322045: Investigate "Processing latency of WDQS_Streaming_Updater" noisy alerts.
Oct 31 2022, 4:34 PM · Wikidata, Wikidata-Query-Service
bking closed T302736: Translate ES ansible playbook to cumin cookbook as Invalid.

I wrote this before I realized that there is no cumin/spicerack in the deployment-prep env. I still need to get more familiar with cookbooks, but we'll close this one for now and get another ticket with some better requirements started.

Oct 31 2022, 4:33 PM · Beta-Cluster-Infrastructure, Discovery-Search
bking claimed T321491: Evaluate Flink Operator on DSE Kubernetes Cluster.
Oct 31 2022, 4:25 PM · serviceops-radar, Discovery-Search (Current work)
bking moved T313999: Record ES info with a script from In Progress to Needs Reporting on the Discovery-Search (Current work) board.
Oct 31 2022, 4:12 PM · Discovery-Search (Current work)
bking closed T313999: Record ES info with a script as Resolved.

Closing this out, as the original script fulfilled its purpose and we need to open a new ticket with more specific requirements before we continue.

Oct 31 2022, 4:11 PM · Discovery-Search (Current work)
bking renamed T321605: Make WCQS/WDQS data transfer cookbook more reliable from Sanity-check disk size in query services data transfer cookbook to Make WCQS/WDQS data transfer cookbook more reliable .
Oct 31 2022, 3:37 PM · Discovery-Search (Current work)
bking created T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.
Oct 31 2022, 3:23 PM · Discovery-Search (Current work)
bking added a comment to T321491: Evaluate Flink Operator on DSE Kubernetes Cluster.

Adding T321587 as a prereq.

Oct 31 2022, 1:28 PM · serviceops-radar, Discovery-Search (Current work)
bking added a parent task for T321587: Build Flink minikube playground on WMCS Search project: T321491: Evaluate Flink Operator on DSE Kubernetes Cluster.
Oct 31 2022, 1:27 PM · Discovery-Search (Current work)
bking added a subtask for T321491: Evaluate Flink Operator on DSE Kubernetes Cluster: T321587: Build Flink minikube playground on WMCS Search project.
Oct 31 2022, 1:27 PM · serviceops-radar, Discovery-Search (Current work)
bking renamed T321491: Evaluate Flink Operator on DSE Kubernetes Cluster from Evaluate Flink Operator on Staging Kubernetes Cluster to Evaluate Flink Operator on DSE Kubernetes Cluster.
Oct 31 2022, 1:24 PM · serviceops-radar, Discovery-Search (Current work)

Oct 27 2022

bking moved T316031: Clean up the rdf-streaming-updater-codfw container from thanos-swift. from In Progress to Needs Reporting on the Discovery-Search (Current work) board.
Oct 27 2022, 5:49 PM · Discovery-Search (Current work), Data-Engineering-Planning, wdwb-tech, Wikidata, SRE-swift-storage, SRE, Wikidata-Query-Service
bking closed T316031: Clean up the rdf-streaming-updater-codfw container from thanos-swift., a subtask of T314835: wdqs space usage on thanos-swift, as Resolved.
Oct 27 2022, 5:48 PM · Discovery-Search (Current work), Patch-For-Review, Data-Engineering-Planning, wdwb-tech, Wikidata, SRE-swift-storage, SRE, Wikidata-Query-Service
bking closed T316031: Clean up the rdf-streaming-updater-codfw container from thanos-swift. as Resolved.
Oct 27 2022, 5:48 PM · Discovery-Search (Current work), Data-Engineering-Planning, wdwb-tech, Wikidata, SRE-swift-storage, SRE, Wikidata-Query-Service
bking closed T303621: Stand up test ES environment for Search team as Declined.
Oct 27 2022, 5:47 PM · Discovery-Search
bking added a comment to T303621: Stand up test ES environment for Search team.

Closing, will continue similar work in https://phabricator.wikimedia.org/T321587

Oct 27 2022, 5:47 PM · Discovery-Search
bking awarded T315428: [SPIKE] Assess what is required for the enrichment pipeline to run on k8s a Love token.
Oct 27 2022, 4:45 PM · Data-Engineering-Planning, Event-Platform Value Stream (Sprint 01), Spike

Oct 26 2022

bking added a comment to T316236: Reload WCQS from dumps.

Still not sure exactly why they're failling, wcqs2002 also has the new data.
wcqs2003 and wcqs1003 still need it.

Oct 26 2022, 9:19 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

The transfer cookbook continues to fail with the same errors noted above.

Oct 26 2022, 3:40 PM · Discovery-Search (Current work)

Oct 25 2022

bking added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

We had a few transfers fail very quickly today. From #wikimedia-operations:

Oct 25 2022, 10:07 PM · Discovery-Search (Current work)
bking added a comment to T316236: Reload WCQS from dumps.

As of this writing, the following hosts have the updated data:

Oct 25 2022, 9:49 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking updated the task description for T321605: Make WCQS/WDQS data transfer cookbook more reliable .
Oct 25 2022, 6:41 PM · Discovery-Search (Current work)
bking created T321605: Make WCQS/WDQS data transfer cookbook more reliable .
Oct 25 2022, 6:41 PM · Discovery-Search (Current work)
bking updated the task description for T321587: Build Flink minikube playground on WMCS Search project.
Oct 25 2022, 3:57 PM · Discovery-Search (Current work)
bking created T321587: Build Flink minikube playground on WMCS Search project.
Oct 25 2022, 3:55 PM · Discovery-Search (Current work)

Oct 24 2022

bking added a comment to T316236: Reload WCQS from dumps.

The data reload is complete, but there were errors .

Oct 24 2022, 9:04 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
bking created P36109 Results of WCQS data reload T316236.
Oct 24 2022, 8:35 PM · Discovery-Search
bking created T321491: Evaluate Flink Operator on DSE Kubernetes Cluster.
Oct 24 2022, 2:21 PM · serviceops-radar, Discovery-Search (Current work)

Oct 21 2022

bking added a comment to T321414: Restore elasticsearch-oss package to bullseye-wikimedia repo.

When I copy repository_elasticsearch-oss.list in /etc/apt/sources.list.d on the test server and run` apt-cache update`, I get the following error:

Oct 21 2022, 9:12 PM · Discovery-Search
bking added a comment to T321414: Restore elasticsearch-oss package to bullseye-wikimedia repo.

Created server es-oss on the WMCS Search project to help troubleshoot

Oct 21 2022, 9:08 PM · Discovery-Search
bking created T321414: Restore elasticsearch-oss package to bullseye-wikimedia repo.
Oct 21 2022, 9:07 PM · Discovery-Search
bking removed a project from T318820: Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo: SRE.
Oct 21 2022, 8:41 PM · Discovery-Search (Current work)
bking added a comment to T318820: Provide compatible elasticsearch-oss (7.x) and wmf-elasticsearch-search-plugins for buster on WMF APT repo.

Looks like we (as in Search Platform SREs) need to cut a new package for wmf-elasticsearch-search-plugins . The general SRE teams are not needed for this work, so I'm removing the tag.

Oct 21 2022, 8:40 PM · Discovery-Search (Current work)

Oct 20 2022

bking created P35738 reroute for relforge reboot!.
Oct 20 2022, 8:19 PM · Discovery-Search

Oct 19 2022

bking created T321243: decommission elastic20[25-36].codfw.wmnet.
Oct 19 2022, 9:23 PM · SRE, ops-codfw, Discovery-Search (Current work), decommission-hardware