Page MenuHomePhabricator

{Engineering} Free Up KSQLDB Resources After Client Push Query Completes
Closed, ResolvedPublic3 Estimated Story Points

Description

User Story:

As a developer, I want to free up KSQLDB resources as soon as a client's push query completes or the HTTP connection disconnects so that realtime queries are not affected by previous push queries.

Acceptance Criteria:

Code Changes:

  • Add a QueryClose call to the Push callback function in Realtime/articles.go.
  • Add unit tests to verify that the close call is executed when expected.

Integration Testing:

  • Run the realtime-stress-test using the test-by-timestamp command.
  • Verify that the KSQLDB Custom Dashboard no longer shows long-running active queries. See chart
    image.png (271×810 px, 40 KB)
  • Specifically, ensure that after the test run (e.g., running ./sequential.sh 10 "./test-by-timestamps" in the K8s branch of the realtime-stress-test repo), the red lines indicating active queries do not persist for an extended period (20 minutes) after the test completes.
ToDo:
Implement Code Changes:

Modify Realtime/articles.go to include a QueryClose call within the push callback function.
Write and Add Unit Tests:

Create unit tests to confirm that the QueryClose call is executed when a push query completes or the connection disconnects.

Perform Integration Testing:

Execute the realtime-stress-test using the ./sequential.sh 10 "./test-by-timestamps" command in the K8s branch.
Monitor the KSQLDB Custom Dashboard to ensure that no long-running active queries remain after test completion.

Test Strategy:

Testing Type:

  • Unit Testing: Verify that the new QueryClose functionality is called correctly.
  • Integration Testing: Use the realtime-stress-test scenario to ensure that the issue with lingering queries is resolved.
  • Manual Verification: Monitor the KSQLDB dashboard manually during the integration test to confirm that active query indicators (red lines) do not persist unnecessarily.
Potential Impact:

Broken Areas: Changes may affect how realtime queries handle resource cleanup. Improper closing of connections could lead to resource leaks or affect query performance.
Monitoring: Any failure to free resources properly might impact system performance and should be caught by monitoring tools.

    1. Checklist for Testing:
  • Confirm that the QueryClose function is being executed in the push callback.
  • Verify through unit tests that the connection closes correctly when expected.
  • Run the integration test using the provided command and ensure that the KSQLDB dashboard no longer shows long-running active queries.
  • Monitor system performance during and after the test to ensure no residual resource locks or performance degradation.
Code Suggestions
  1. In the ksqldb submodule update the PushPuller interface to include the CloseQuery interface, the method is already in the submodule in client.go, add:
// Closer is an interface to wrap the default ksqld Close method for unit testing.
type Closer interface {
	CloseQuery(ctx context.Context, queryId string) error
}
  1. Add the Closer to PushPuller
  1. In realtime Repo, articles.go in the cbk callback function add a line similar to, around line 333:
  defer p.KSQLDB.CloseQuery(gcx.Request.Context(), hr.QueryID)
``
4. Check if we need a Close() call inside the error logic on line 383 when the Push call is made.

Event Timeline

REsquito-WMF renamed this task from [Tech Debt] Free Up KSQLDB Resources After Client Push Query Completes to {Engineering} Free Up KSQLDB Resources After Client Push Query Completes.Feb 20 2025, 9:17 AM
REsquito-WMF triaged this task as High priority.

Changes happen inside ksqldb module...no need to change callback.

This way we avoid internals of module slipping outside....and making error prone the usage of submodule.

@REsquito-WMF is it engineering for product support? continuous improvement? or any other category? Tech debt, infrastructure, architecture, etc? Also, please score it. thanks!

Tech debt and bug, system improvements

set. 3 points because of testing.

Picking it up, now that most MRs are in prd

Looks like resources freed up better than before, see first chart goes back to baseline after queries finish. Previously the query count stayed high for prolonged periods of time

image.png (1×1 px, 303 KB)

Still getting these "unexpected end of JSON input" Must be from an unmarshal() call. We should migrate to our WME logger and add unique prefixes to know where and why these "unexpected end of JSON" are happening.

Started a discussion about how we can do consistent logging across all packages/submodules. Choices: Use go.mod replace command with submodules, or publish log submodule as a GitHub public OSS package.

Our HTTP connections to KSQLDB use connection reuse. When a connection is closed to free up resources, any reused connections are also terminated, leading to attempts to write data after the stream has been closed. We're using Envoy in K8s that manages connection pools, which can close idle TCP connections, this could be contributing to the closed connections in our Push code.

Two options:

  1. We abandon the KSQLDB resource clean-up ticket
  2. We rewrite the ksqldb.NewClient to not reuse HTTP connections, migrate newScanner to newReader and have more control over token reading from the response byte stream. Improve logging and clean up in the KSQLDB client Push method. Potentially, tweak the KSQLDB Envoy configuration to set the idle timeouts and settings around HTTP connection reuse.
  3. Migrate our Push and KSQLDB calls to use a 3rd party package: https://github.com/thmeitz/ksqldb-go, which is the community package mentioned in https://developer.confluent.io/faq/apache-kafka/ksqldb/

Close Current Ticket and Merge with Real-Time Optimization Work
Decision: Current ticket will be closed.
It will be merged into a broader real-time query optimization initiative https://phabricator.wikimedia.org/T389815
Ruairi will add his findings to the new ticket