Page MenuHomePhabricator

Test push-notifications service in the staging environment
Closed, ResolvedPublic

Description

Background information

The push-notifications service is hosted in the staging environment at k8s cluster, we should perform minimal tests it before going full production.

How

# check its status from deploy1001
curl -k https://staging.svc.eqiad.wmnet:4104
4104 is the application port

Acceptance criteria

  • Service is properly running and ready for production load.

Event Timeline

@jijiki I'm having a problem while testing the service in production, but I don't know how to see the logs in order to have a better understanding of the root cause.

Here is what I'm trying

mbsantos@deploy1001:~$ curl -k --location --request POST 'https://staging.svc.eqiad.wmnet:4104/v1/message/fcm' --header 'Content-Type: 
application/x-www-form-urlencoded' --data-urlencode 'dryRun=true' --data-urlencode 'messageType=checkEchoV1' --data-urlencode 'deviceTo
kens=["lalala"]'
{"status":500,"type":"fcm_send_failed","title":"Failed to send message to FCM","detail":{},"method":"POST","uri":"/v1/message/fcm"}

The service was supposed to return 200, even though the token is fake. This usually occurs when failing to authenticate with FCM.

I am not sure if there is some work still in flight but here are a couple of API calls i tried from deploy1001:

GETing the service swagger doc works:

jgiannelos@deploy1001:~$ curl -k -I "https://staging.svc.eqiad.wmnet:4104/?doc" 
HTTP/1.1 200 OK
`

POSTing to the FCM endpoint hangs:

curl -k -X POST "https://staging.svc.eqiad.wmnet:4104/v1/message/fcm"  \
-H  "accept: application/json" 
-H  "Content-Type: application/json"
-d "{\"deviceTokens\":[\"lala\"],\"messageType\":\"checkEchoV1\",\"topic\":\"string\",\"dryRun\":false}"

POSTing to the APNS endpoint returns an error:

curl -k -X POST "https://staging.svc.eqiad.wmnet:4104/v1/message/apns"
-H  "accept: application/json"
-H  "Content-Type: application/json"
-d "{\"deviceTokens\":[\"lala\"],\"messageType\":\"checkEchoV1\",\"topic\":\"string\",\"dryRun\":false}"

error:

<pre>Cannot POST /v1/message/apns</pre>

@Jgiannelos the deployed image to production is outdated, my patch last patch on deployment-charts fixes it (not the WIP one)

Can we please get an openapi spec added to this service?

I was quite confused when trying to understand what endpoints it serves:

$ curl -k https://kubestage1002.eqiad.wmnet:4104/?spec -v | jq .
{
  "openapi": "3.0.0",
  "info": {
    "version": "0.0.1",
    "title": "push-notifications",
    "description": "A service for sending push notification messages"
  },
  "paths": {}
}

@jijiki I'm having a problem while testing the service in production, but I don't know how to see the logs in order to have a better understanding of the root cause.

You should be able to grab your logs from kibaba via a kubernetes.namespace_name filter like: https://logstash-next.wikimedia.org/goto/b9464dd5baca7d18772e541dc9527205

Change 625832 had a related patch set uploaded (by Jgiannelos; owner: Jgiannelos):
[operations/deployment-charts@master] Enable OpenAPI spec on push-notifications service

https://gerrit.wikimedia.org/r/625832

This comment was removed by MSantos.

@MSantos

@jijiki I'm having a problem while testing the service in production, but I don't know how to see the logs in order to have a better understanding of the root cause.

Here is what I'm trying

mbsantos@deploy1001:~$ curl -k --location --request POST 'https://staging.svc.eqiad.wmnet:4104/v1/message/fcm' --header 'Content-Type: 
application/x-www-form-urlencoded' --data-urlencode 'dryRun=true' --data-urlencode 'messageType=checkEchoV1' --data-urlencode 'deviceTo
kens=["lalala"]'
{"status":500,"type":"fcm_send_failed","title":"Failed to send message to FCM","detail":{},"method":"POST","uri":"/v1/message/fcm"}

I merged a patch to add the relevant proxy so push-notifications will be able to talk to external endpoints. I run the test but sadly I got a 500 error again.

{"name":"push-notifications","hostname":"push-notifications-main-6bcff65987-v8jxv","pid":29,"level":"ERROR","message":"500: fcm_send_failed","stack":"HTTPError: 500: fcm_send_failed\n    at fcm_1.sendMessage.then.catch (/srv/service/dist/routes/v1.js:42:15)\n    at process._tickCallback (internal/process/next_tick.js:68:7)","status":500,"type":"fcm_send_failed","detail":{},"request_id":"da0a4a52-b71c-4958-b379-f305e76cbfa0","request":{"url":"/v1/message/fcm","headers":{"user-agent":"curl/7.52.1","content-type":"application/x-www-form-urlencoded","content-length":"67","x-request-id":"da0a4a52-b71c-4958-b379-f305e76cbfa0"},"method":"POST","params":{"0":"/v1/message/fcm"},"query":{},"remoteAddress":"127.0.0.1","remotePort":34038},"levelPath":"error/500","msg":"500: fcm_send_failed","time":"2020-09-09T11:21:13.887Z","v":0}

Let me know how I can help

Here is the latest error response I got from the service: https://phabricator.wikimedia.org/P12543

Could that be related to a proxy connectivity issue?

Error while making request: connect ETIMEDOUT 172.217.164.141:443.\"."

Here is the latest error response I got from the service: https://phabricator.wikimedia.org/P12543

Could that be related to a proxy connectivity issue?

Error while making request: connect ETIMEDOUT 172.217.164.141:443.\"."

I see the app trying to connect to google IPs directly (e.g. not using the configured proxy server). Could you please verify that you use the proxy (code wise)?

I verified that it works in our local env when in the service config yaml there is an entry (under services -> conf) like:

proxy:
   host: <host>
   port: <port>
   protocol: <protocol>

I looked at the helm charts and it might be the case that the proxy config is not rendered as the service is expecting, I can't be sure though since I don't have access on the staging helm values.

I verified that it works in our local env when in the service config yaml there is an entry (under services -> conf) like:

proxy:
   host: <host>
   port: <port>
   protocol: <protocol>

I looked at the helm charts and it might be the case that the proxy config is not rendered as the service is expecting, I can't be sure though since I don't have access on the staging helm values.

I am afraid you may have to change this, so it will be similar to how other applications are configured to use our proxies.

Staging helm values can be accessed by running helmfile -e staging template:

deploy1001:/srv/deployment-charts/helmfile.d/services/push-notifications$ helmfile -e staging template

Fetching wmf-stable/push-notifications
Building dependency release=main, chart=/tmp/726107518/push-notifications/main/wmf-stable/push-notifications/latest/push-notifications
<snip>

Proxies are defined per enviroment here: https://github.com/wikimedia/operations-deployment-charts/tree/master/helmfile.d/services/push-notifications

Change 625832 abandoned by Jgiannelos:
[operations/deployment-charts@master] Enable OpenAPI spec on push-notifications service

Reason:
https://phabricator.wikimedia.org/T262528

https://gerrit.wikimedia.org/r/625832

After trying to release the proxy fixes from T262444 on staging, it errors with:

Failed to generate token: error:0D0680A8:asn1 encoding routines:asn1_check_tlen:wrong tag

From logstash:

VError: Failed to generate token: error:0D0680A8:asn1 encoding routines:asn1_check_tlen:wrong tag
    at prepareToken (/srv/service/node_modules/apn/lib/credentials/token/prepare.js:45:13)
    at config (/srv/service/node_modules/apn/lib/config.js:43:31)
    at new Client (/srv/service/node_modules/apn/lib/client.js:11:19)
    at new Provider (/srv/service/node_modules/apn/lib/provider.js:12:19)
    at Object.init (/srv/service/dist/outgoing/apns/apns.js:99:19)
    at Object.<anonymous> (/srv/service/dist/loaders/index.js:24:16)
    at Generator.next (<anonymous>)
    at /srv/service/dist/loaders/index.js:8:71
    at new Promise (<anonymous>)
    at __awaiter (/srv/service/dist/loaders/index.js:4:12)
    at Object.init (/srv/service/dist/loaders/index.js:20:12)
    at initApp (/srv/service/dist/app.js:44:20)
    at module.exports (/srv/service/dist/app.js:79:12)
    at _requireModule.then (/srv/service/node_modules/service-runner/lib/worker.js:195:56)
    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)
Caused by: Error: error:0D0680A8:asn1 encoding routines:asn1_check_tlen:wrong tag
    at Sign.sign (internal/crypto/sig.js:80:26)
    at sign (/srv/service/node_modules/jwa/index.js:152:45)
    at Object.sign (/srv/service/node_modules/jwa/index.js:200:27)
    at Object.jwsSign [as sign] (/srv/service/node_modules/jws/lib/sign-stream.js:32:24)
    at module.exports (/srv/service/node_modules/jsonwebtoken/sign.js:204:16)
    at prepareToken (/srv/service/node_modules/apn/lib/credentials/token/prepare.js:27:18)
    at config (/srv/service/node_modules/apn/lib/config.js:43:31)
    at new Client (/srv/service/node_modules/apn/lib/client.js:11:19)
    at new Provider (/srv/service/node_modules/apn/lib/provider.js:12:19)
    at Object.init (/srv/service/dist/outgoing/apns/apns.js:99:19)
    at Object.<anonymous> (/srv/service/dist/loaders/index.js:24:16)
    at Generator.next (<anonymous>)
    at /srv/service/dist/loaders/index.js:8:71
    at new Promise (<anonymous>)
    at __awaiter (/srv/service/dist/loaders/index.js:4:12)
    at Object.init (/srv/service/dist/loaders/index.js:20:12)

Also from helmfile -e staging diff the image from the previous state looks very old:

      spec:
        containers:
          # The main application container
          - name: push-notifications-main
-           image: "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-push-notifications:2020-06-08-183341-publish"
+           image: "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-push-notifications:2020-09-11-113122-publish"
            imagePullPolicy: IfNotPresent

helmfile status is

s@deploy1001:/srv/deployment-charts/helmfile.d/services/push-notifications$ helmfile -e staging status
Getting status main
LAST DEPLOYED: Fri Sep 11 15:12:34 2020
NAMESPACE: push-notifications
STATUS: PENDING_UPGRADE

RESOURCES:
==> v1/ConfigMap
NAME                                         DATA  AGE
config-main                                  3     17d
push-notifications-main-envoy-config-volume  1     17d
push-notifications-main-tls-proxy-certs      3     17d

==> v1/Deployment
NAME                     READY  UP-TO-DATE  AVAILABLE  AGE
push-notifications-main  1/1    1           1          17d

==> v1/NetworkPolicy
NAME                     POD-SELECTOR                         AGE
push-notifications-main  app=push-notifications,release=main  17d

==> v1/Pod(related)
NAME                                      READY  STATUS             RESTARTS  AGE
push-notifications-main-584d787fb6-2cgvk  0/2    ContainerCreating  0         47s
push-notifications-main-6bcff65987-v8jxv  2/2    Running            0         2d3h

==> v1/Secret
NAME                                   TYPE    DATA  AGE
push-notifications-main-secret-config  Opaque  0     17d

==> v1/Service
NAME                                 TYPE      CLUSTER-IP   EXTERNAL-IP  PORT(S)        AGE
push-notifications-main-tls-service  NodePort  10.64.76.42  <none>       4104:4104/TCP  17d


NOTES:
Thank you for installing push-notifications.

Your release is named push-notifications-main.

To learn more about the release, try:

  $ helm status push-notifications-main
  $ helm get push-notifications-main

@jijiki Can we somehow verify what's the APNS key that ends up in the staging pods? The error indicates that the key is misformatted. If thats not the case we might want to revoke the key we use and try a new one.

@Jgiannelos Sorry I didn't have time to look into it today. At a first glance the APNS key is in place, I will have a closer look tomorrow

Thanks @jijiki! If there is anything we can do to continue helping to resolve this, please feel free to ping us tomorrow. Our target date is 9/21 and so we are certainly in high priority mode :)

All sensitive data are stored in a different repo, and we had made a mistake in the yaml file where those are stored. Sorry about that!

We deployed again, and it appears to be up and running:)

Awesome, thanks @jijiki, both APNS/FCM endpoint requests return 200. I will keep an eye on the logs but I think we are good.

@Jgiannelos Great!

Btw, I came across this:

{"name":"push-notifications","hostname":"push-notifications-main-6ff659f745-qpw6g","pid":17,"level":50,"levelPath":"error/metrics","msg":"No such metrics client: 'undefined'","time":"2020-09-15T11:00:13.243Z","v":0}
{"name":"push-notifications","hostname":"push-notifications-main-6ff659f745-qpw6g","pid":29,"level":50,"levelPath":"error/metrics","msg":"No such metrics client: 'undefined'","time":"2020-09-15T11:00:17.233Z","v":0}

The goal is to get staging metrics at Push Notifications staging

Can you please check if everything is in order on your end regarding metrics? Thank you!

I think this is raised because metrics is not defined in the service config since .Values.monitoring.enabled is false:
https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/push-notifications/values.yaml#L8
https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/push-notifications/values-staging.yaml

Can we connect prometheus to our service in staging and enable monitoring on helm so we can also verify that metrics are pulled?

I have enabled it, but I still don't see any metrics, so something else is missing. I will take a look tomorrow

I have enabled it, but I still don't see any metrics, so something else is missing. I will take a look tomorrow

I fixed the Grafana dashboard, and now all the metrics can be visualized https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&refresh=1m&from=now-3h&to=now&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=push-notifications

@jijiki this was our last sanity check before giving you green light to make the prod deploy, the test in the staging environment should now be complete. cc/ @Jgiannelos @Mholloway @sdkim

@MSantos, great! Can you please add the -production tag to the final version and update helmfile.d/services/push-notifications/values.yaml accordingly? @JMeybohm and I will take care of the rest. I will ping you if there is anything I am forgetting. Since tomorrow is Friday, we will possibly schedule the release for Monday morning.

MSantos claimed this task.

@MSantos, great! Can you please add the -production tag to the final version and update helmfile.d/services/push-notifications/values.yaml accordingly? @JMeybohm and I will take care of the rest. I will ping you if there is anything I am forgetting. Since tomorrow is Friday, we will possibly schedule the release for Monday morning.

Done! Thanks @jijiki, this task should now be considered resolved.

Change 628340 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/deployment-charts@master] push-notifications: change version tag to -production

https://gerrit.wikimedia.org/r/628340