Instances of Kask need to be provisioned (in both data-centers) for storage of Echo timestamps (alert, and notification last-seen times). These instances will connect to the RESTBase Cassandra cluster, and so will require keys created from that clusters authority.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T234286 Multi-DC Echo Notification Storage | |||
| Resolved | None | T234289 Migrate Wikimedia Echo notification timestamps from MainStash to Kask | |||
| Resolved | None | T234402 Wikimedia infrastructure is configured for multi-DC echo notification storage | |||
| Resolved | None | T234376 Provision Kask for Echo timestamp storage in k8s | |||
| Resolved | None | T235558 Dashboards for monitoring of echostore |
Event Timeline
Change 543212 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] [WIP] echostore: create staging deployment
Change 543463 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] echostore: Add namespace creation stanzas
Change 543463 merged by jenkins-bot:
[operations/deployment-charts@master] echostore: Add namespace creation stanzas
Mentioned in SAL (#wikimedia-operations) [2019-10-16T14:24:04Z] <_joe_> creating namespaces and policies for echostore in codfw, T234376
Change 543212 merged by Eevans:
[operations/deployment-charts@master] echostore: create new staging deployment
Change 543699 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] echostore: create production deployments
Change 543699 merged by Eevans:
[operations/deployment-charts@master] echostore: create production deployments
I'm unable to deploy to codfw; I'm seeing the following:
$ kubectl get events LAST SEEN TYPE REASON KIND MESSAGE 46s Warning FailedScheduling Pod 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector. 46s Warning FailedScheduling Pod 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector. 46s Warning FailedScheduling Pod 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector. 46s Warning FailedScheduling Pod 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector. 9m6s Normal SuccessfulCreate ReplicaSet Created pod: kask-production-dfd5f9666-6zxfd 9m6s Normal SuccessfulCreate ReplicaSet Created pod: kask-production-dfd5f9666-c4mnd 9m6s Normal SuccessfulCreate ReplicaSet Created pod: kask-production-dfd5f9666-jx6sg 9m6s Normal SuccessfulCreate ReplicaSet Created pod: kask-production-dfd5f9666-2xztd 9m6s Normal ScalingReplicaSet Deployment Scaled up replica set kask-production-dfd5f9666 to 4
Change 543711 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] echostore: remove affinity (copypasta from sessionstore)
Change 543711 merged by Eevans:
[operations/deployment-charts@master] echostore: remove affinity (copypasta from sessionstore)
From a conversation w/ @Joe on IRC, it seems the nodeAffinity section (copypasta from the sessionstore deployment) was likely causing the problem. I issued a helmfile delete, and updated the config (removing that section), but am now getting:
$ helmfile diff Adding repo stable https://releases.wikimedia.org/charts/ "stable" has been added to your repositories Updating repo Hang tight while we grab the latest from your chart repositories... ...Skip local chart repository ...Successfully got an update from the "stable" chart repository Update Complete. ⎈ Happy Helming!⎈ helmfile.yaml: basePath=. Comparing production stable/kask "production" has no deployed releases in ./helmfile.yaml: failed processing release production: helm exited with status 1: Error: "production" has no deployed releases Error: plugin "diff" exited with erro
Perhaps there is some step required after helmfile delete?
Change 543731 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] echostore: fixup Cassandra contact list
Change 543731 merged by Eevans:
[operations/deployment-charts@master] echostore: fixup Cassandra contact list
Hat tip to @CDanis who pointed me at https://github.com/helm/helm/issues/3208#issuecomment-348154521; A helm delete production --purge did the trick.
Heh yes sorry, I forgot to tell you yesterday - you need to use helmfile destroy in newer versions of helmfile.
I'm pretty sure I tried that (it seemed like the Right Thing™ based on the description in the help synopsis), and got an error of a different kind. If that's supposed to be equivalent, I'll see if I can't suss out the exact error from my scroll buffer.