Page MenuHomePhabricator

Update mobileapps k8s deployment chart for Cassandra credentials
Open, HighPublic

Description

The mobileapps deployment chart will need to be updated to configure the service for connecting to Cassandra, including key material for client encryption, as well as sourcing the password from the private repository.

  • Create cassandra user for mobileapps, add credentials in puppet
  • Add support to mobileapps chart for cassandra config

Event Timeline

Am I right in thinking that mobileapps itself has had no code changes to that allow it to communicate with Cassandra?

Not yet, we are building a small npm package for that but haven't merged anything on PCS yet. I hope I have something this week.

How can we setup things to be able to use cassandra (on staging for now)? I can send a patch with the config changes on deployment-charts for mobileapps but not sure how to reference credentials.

I suggest we standardize on the configuration that we've used for the golang applications using cassandra.

For instance, cassandra-http-gateway does as follows:

{{- define "config.app" }}
...
cassandra:
  port: 9042
  consistency: {{ .Values.main_app.consistency }}
  hosts:
{{- range $cassandra_host := .Values.main_app.cassandra_hosts }}
    - {{ $cassandra_host }}
{{- end }}
  local_dc: {{ .Values.main_app.datacentre }}
  authentication:
    username: {{ .Values.main_app.cassandra_user }}
    password: {{ .Values.config.private.cassandra_pass }}
  tls:
    ca: /etc/ssl/certs/wmf-ca-certificates.crt
{{- end }}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cassandra-http-gateway-base-config
  labels:
    app: {{ template "base.name.chart" . }}
    chart: {{ template "base.name.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
data:
  config.yaml: |- {{ include "config.app" . | indent 4 }}

In your case, you probably want to add a similar cassandra section to mobileapps' configuration. I'll try to find the time to create a deployment-charts module to avoid repetition.

Sounds good, i will adapt the config to something thats compatible with the snippet.

The snippet from the cassandra-http-gateway helm chart is not using keyspace/tables (because its user submitted at runtime from what I understand from the project). I updated my patch with this expected config structure:

cassandra:
  hosts: ["127.0.0.1"]
  port: 9042
  local_dc: "datacenter1"
  authentication:
    username: "cassandra"
    password: "cassandra"
caching:
  enabled: false
  cassandra:
    keyspace: "tests"
    storageTable: "storage"

The snippet from the cassandra-http-gateway helm chart is not using keyspace/tables (because its user submitted at runtime from what I understand from the project). I updated my patch with this expected config structure:

cassandra:
  hosts: ["127.0.0.1"]
  port: 9042
  local_dc: "datacenter1"
  authentication:
    username: "cassandra"
    password: "cassandra"
caching:
  enabled: false
  cassandra:
    keyspace: "tests"
    storageTable: "storage"

I think it makes sense to separate the connection configuration from keyspace/table configurations as you did.

Sounds good I will leave it as it is on the patch

Change 985166 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Introduce server side caching

https://gerrit.wikimedia.org/r/985166

@Joe and @Eevans is there anything we should do on our side apart from merging the patch for caching handling in PCS?

MSantos triaged this task as High priority.Jan 15 2024, 4:45 PM

@Joe and @Eevans is there anything we should do on our side apart from merging the patch for caching handling in PCS?

Is there a way of specifying a ca cert that the connection to Cassandra will obey? I don't see it in the Cassandra config but I would imagine that the service already has support elsewhere.

Change 985166 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Introduce server side caching

https://gerrit.wikimedia.org/r/985166

The example config doesn't use TLS because we don't have cassandra using tls in our dev env. In order to pass the path to the file the config expects this:
https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/990729

Change 991027 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] modules: add cassandra client module

https://gerrit.wikimedia.org/r/991027

Change 991032 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mobileapps: add Cassandra config support

https://gerrit.wikimedia.org/r/991032

  • Create cassandra user for mobileapps, add credentials in puppet

This is done (TTBMK). The user is mediawiki_services_mobileapps, it has read/write access to pregenerated_cache.{media_list,mobile_html,page_summary}. The password is defined in private.git (profile::cassandra::user_credentials).

The same user exists on the production cluster, and on cassandra-dev as well (for staging purposes).

Change 991579 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Adapt config to helm charts

https://gerrit.wikimedia.org/r/991579

@Eevans Sine things are moving forward, can devs have cqlsh access (read-only should be good enough) to verify things on staging when PCS is connected to cassandra?

Change 991579 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Adapt config to helm charts

https://gerrit.wikimedia.org/r/991579

@Eevans Sine things are moving forward, can devs have cqlsh access (read-only should be good enough) to verify things on staging when PCS is connected to cassandra?

We aren't currently setup to provide that access, but I recognize that it is needed. I created T355730 and will follow up there.

Change 991027 merged by jenkins-bot:

[operations/deployment-charts@master] modules: add cassandra client module

https://gerrit.wikimedia.org/r/991027

Change 993154 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mobileapps: add cassandra config in staging

https://gerrit.wikimedia.org/r/993154

Now that RESTBase/parsoid storage deprecation is almost done, we would like to pick up this as a next step to deprecate PCS/summary too.
What are the next steps? In terms of PCS we are ready to test cassandra storage on staging.

Change #991032 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: add Cassandra config support

https://gerrit.wikimedia.org/r/991032

Change #993154 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: add cassandra config in staging

https://gerrit.wikimedia.org/r/993154

Change #1016722 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Caching config for pregenerated content

https://gerrit.wikimedia.org/r/1016722

Change #1016722 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Caching config for pregenerated content

https://gerrit.wikimedia.org/r/1016722

Change #1016744 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] pipeline: Add certs required for cassandra connections

https://gerrit.wikimedia.org/r/1016744

I am testing things on staging and I am getting this error (and a CrashLoopBackOff from the pod):

ENOENT: no such file or directory, open '/etc/ssl/certs/wmf-ca-certificates.crt'",

Change #1016744 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] pipeline: Add certs required for cassandra connections

https://gerrit.wikimedia.org/r/1016744

From staging:

{
  "status": 500,
  "type": "internal_error",
  "title": "NoHostAvailableError",
  "detail": "All host(s) tried for query failed. First host tried, 10.192.16.15:9042: OperationTimedOutError: The host 10.192.16.15:9042 did not reply before timeout 12000 ms\n    at Timeout.requestTimedOut [as _onTimeout] (/srv/service/node_modules/cassandra-driver/lib/operation-state.js:107:28)\n    at listOnTimeout (internal/timers.js:554:17)\n    at processTimers (internal/timers.js:497:7) {\n  info: 'Represents a client-side error that is raised when the client did not hear back from the server within socketOptions.readTimeout',\n  host: '10.192.16.15:9042'\n}. See innerErrors.",
  "method": "GET",
  "uri": "/en.wikipedia.org/v1/page/mobile-html/Dog"
}

It looks like the network setup between staging and cassandra-dev is all what we would expect. Pods are allowed to connect, and the cassandra firewall allows the staging IP range. While nsentered into the mobileapps staging pod:

root@kubestage1003:/home/hnowlan# time nc -z 10.192.16.15 9042  && echo ok

real    0m0.032s
user    0m0.001s
sys     0m0.000s
ok

I think the problem is on the nodejs cassandra client TLS initialization and more specifically on how we pass the config options.

From staging:

{
  "status": 500,
  "type": "internal_error",
  "title": "ArgumentError",
  "detail": "Datacenter eqiad was not found. Available DCs are: [codfw]",
  "method": "GET",
  "uri": "/en.wikipedia.org/v1/page/mobile-html/Dog"
}

Change #1017035 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Use codfw as cassandra local DC

https://gerrit.wikimedia.org/r/1017035

Change #1017035 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Use codfw as cassandra local DC on staging

https://gerrit.wikimedia.org/r/1017035

Things look better on staging:

MSantos moved this task from Backlog to Tracking on the Content-Transform-Team board.

Great! From our side this should be tracking now. Please, let us know if I'm missing anything and feel free to go ahead and close once it's completely done.

Do we have a plan for when and how we'd like to move this to production?