Page MenuHomePhabricator

Investigate causes of Catalyst slowness
Open, Needs TriagePublic5 Estimated Story Points

Description

Symptoms

  • CI jobs timing out waiting for response from Catalyst for 10 mins
  • Demo creation taking longer
  • Demo deletion taking longer

Possible problems

  • Catalyst API logs mention slow sql queries
  • K8s API bottleneck

Needs: go code profiling, sql query eval, checking traffic levels/crawlers, digging through logs during slowness, ...did we do apache bench tests?

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
batch Server-Sent Events when streaming logsrepos/test-platform/catalyst/catalyst-api!166jnucheT417689main
use more relaxed SLOW SQL threshold when dropping env's DB schemarepos/test-platform/catalyst/catalyst-api!165jnucheT417689main
tweak configuration values for DB connectionsrepos/test-platform/catalyst/catalyst-api!163jnucheT417689main
reduce critical section in `environment.WaitForDeploymentToSucceed`repos/test-platform/catalyst/catalyst-api!161jnucheT417689main
remove `latestStatus` query param from Catalyst requestsrepos/test-platform/catalyst/patchdemo!269jnucheT417689main
set Deployment's revision history limit to 2repos/test-platform/catalyst/ci-charts!119jnucheT417689main
add User Agent when calling Gerritrepos/test-platform/catalyst/patchdemo!268jnucheT417689main
add explicit history max limit to all Helm deploymentsrepos/test-platform/catalyst/patchdemo!266jnucheT417689main
add explicit history max limit to all Helm deploymentsrepos/test-platform/catalyst/catalyst-api!159jnucheT417689main
Customize query in GitLab

Event Timeline

thcipriani set the point value for this task to 5.Thu, Feb 19, 5:20 PM

@jnuche mentions helm has a bunch of old secrets: one path he's running down.

Some of the helm commands had started slowing to a crawl; as it turns out we had 912 secrets in the cat-env namespace. A majority of those were originating in helm history revisions. A lot of those revisions could be safely removed and after doing that helm commands became noticeably more responsive again. Plus env creation times have gone down again to the levels we had a few months ago:

image.png (80×1 px, 16 KB)

helm's CLI sets a maximum of 10 revisions per release by default but the go library doesn't set a value. One of the envs had 246 live revisions:

sh.helm.release.v1.wiki-6c16706061-3362.v1     helm.sh/release.v1   1      56d
[...]
sh.helm.release.v1.wiki-6c16706061-3362.v246   helm.sh/release.v1   1      41h

We don't really rely on the revision history for our workflows and the impact on performance of each one seems to add up pretty quickly, so I'm setting a limit of 3 for our components all across the board.

Similarly to what happened with secrets, recreated envs have been leaving behind a significant number of replica sets behind:

kubectl -n cat-env get rs | grep 750c4d946d
wiki-750c4d946d-3895-mediawiki-66bccd4d77   0         0         0       15d
wiki-750c4d946d-3895-mediawiki-8669b898df   0         0         0       15d
wiki-750c4d946d-3895-mediawiki-668466c6b8   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-c556c45c5    0         0         0       14d
wiki-750c4d946d-3895-mediawiki-76c5fb6d87   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-6968d99c4d   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-6b5b86ff6f   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-68cd949987   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-75749d974b   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-5d46bb4dc5   0         0         0       14d
wiki-750c4d946d-3895-mediawiki-68c5994db9   1         1         1       11d

I'm going to proceed to remove many of those old replica sets + add configuration to prevent this for future envs. Note that the configuration will not be updated for existing envs, so these won't get the fix