Page MenuHomePhabricator

Migrate push-notifications to mw-api-int
Closed, ResolvedPublic

Description

Switch backend mw-api calls from baremetal api_appserver cluster to kubernetes hosted mw-api-int

Event Timeline

Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.

Change 905942 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s

https://gerrit.wikimedia.org/r/905942

@Tgr @MSantos If you could provide me with a way to test that push-notifications works correctly after changing api backend, it'd be much appreciated :)

I'm not involved with the library. The end-to-end test would be to install the Android or iOS app, connect it to a user account, make sure app notifications are enabled at <wiki_domain>/wiki/Special:Preferences#mw-prefsection-echo, then generate a notification (e.g. by registering another user and mentioning the original user). There are probably simpler ways though.

@Jgiannelos @MSantos As you are the two contacts listed on https://www.mediawiki.org/wiki/Wikimedia_Product_Infrastructure_team/Push_Notifications_Infrastructure could you please provide me with a way to test the service after migrating the api backend, or can we schedule a sync time to do the migration?

Is this about outgoing traffic from push-notifications node service to MW api?

Yes, we are changing it from mwapi-async (which redirects to api-rw.discovery.wmnet) to mw-api-int-async (which redirects to mw-api-int.discovery.wmnet, hosted on kubernetes).

I took a look, and i don't see any outgoing requests to MW API in the codebase. Most of the related codebase that exists is boilerplate from https://github.com/wikimedia/service-runner but not actively used.

We can try the following:

We haven't worked for quite some time on push notifications so I don't remember off the top of my head the cURL requests.
It should be something like:

curl -k -X POST https://staging.svc.eqiad:4104/v1/message/fcm -H -H "Content-Type: application/json"
{
  "deviceTokens": [
    "testing-mw-api-dummy-token"
  ],
  "messageType": "checkEchoV1",
  "dryRun": false
}

We can try the following:

  • Deploy the change only on staging k8s

I've changed the CR to modify the listener only in staging

  • Enable debug level logs only on staging

According to https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/push-notifications/templates/_config.yaml#L25 logs are already in debug for all releases. Do you think it is worth adding a way to change that level so we can put it at trace for staging?

Changed to 10/30 seconds in the staging CR.

  • Make a request to push notifications to enqueue a new message (eg. /v1/message/fcm)
    • Use dummy device tokens so service fails and queries MWapi
  • Check the logs for errors from mw api

In normal operation, with a dummy token, does the mw api log an error?

We haven't worked for quite some time on push notifications so I don't remember off the top of my head the cURL requests.
It should be something like:

curl -k -X POST https://staging.svc.eqiad:4104/v1/message/fcm -H -H "Content-Type: application/json"
{
  "deviceTokens": [
    "testing-mw-api-dummy-token"
  ],
  "messageType": "checkEchoV1",
  "dryRun": false
}

Thanks for the test query. I've added you to the CR so you're kept up to date with the changes we may do.

In normal operation, with a dummy token, does the mw api log an error?

Yes

On the other hand, i see a lot of connection failures for the MW api requests:
https://logstash.wikimedia.org/goto/eba59420e81c8509587a488662d8fb97

Maybe something is broken already there.

curl -H 'Host: en.wikipedia.org' http://localhost:6500/w/api.php works from the pod namespace, so at least it's not an egress issue.

Lets merge the staging config and try the same GET request.
From the config it looks like it is using localhost:

mwapi_req:
  method: post
  uri: http://localhost:6500/w/api.php
  headers:
    host: meta.wikimedia.org
    user-agent: '{{user-agent}}'
    x-forwarded-proto: https
  body: '{{ default(request.query, {}) }}'

Request used for testing:

cgoubert@deploy1002:~$ curl -i -X POST 'https://staging.svc.eqiad.wmnet:4104/v1/message/fcm' -H "Content-Type: application/json" --data-binary "@./push-notification.json"^C
cgoubert@deploy1002:~$ cat push-notification.json 
{
  "deviceTokens": [
    "testing-mw-api-dummy-token"
  ],
  "messageType": "checkEchoV1",
  "dryRun": false
}

Staging logs

Change 905942 merged by jenkins-bot:

[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s

https://gerrit.wikimedia.org/r/905942

Mentioned in SAL (#wikimedia-operations) [2023-04-24T12:28:44Z] <claime> Deploying push-notifications staging for switch to mw-api-int - T334061

In kubectl logs I can see my message in queue:

Queue (push): Added item:  SingleDeviceMessage {                                                                                      
  provider: 'fcm',                                                                                                                    
  type: 'checkEchoV1',                                                                                                                
  dryRun: false,                                                                                                                      
  meta: {},                                                                                                                           
  deviceToken: 'testing-mw-api-dummy-token',                                                                                          
  enqueueTimestamp: 1682339370641                                                                                                     
}

The 200 with api_error response from mw-api-int

REQUEST response end http://localhost:6500/w/api.php 200 {                                                                            
  date: 'Mon, 24 Apr 2023 12:29:56 GMT',                                                                                              
  server: 'envoy',                                                                                                                    
  'x-powered-by': 'PHP/7.4.33',                                                                                                       
  'x-content-type-options': 'nosniff',                                                                                                
  'mediawiki-api-error': 'badtoken',                                                                                                  
  'x-frame-options': 'DENY',                                                                                                          
  'content-disposition': 'inline; filename=api-result.json',                                                                          
  'cache-control': 'private, must-revalidate, max-age=0',                                                                             
  vary: 'Accept-Encoding',                                                                                                            
  'x-request-id': 'd2763daa-950f-4050-a84c-5772faf7f475',                                                                             
  'set-cookie': [                                                                                                                     
    'ss0-metawikiSession=hit9fsno6obeendic8c26j2vrmp9q201; path=/; secure; HttpOnly',                                                 
    'metawikiSession=hit9fsno6obeendic8c26j2vrmp9q201; path=/; secure; HttpOnly; SameSite=None'                                       
  ],                                                                                                                                  
  'content-length': '370',                                                                                                            
  'backend-timing': 'D=45110 t=1682339396245510',                                                                                     
  'content-type': 'application/json; charset=utf-8',                                                                                  
  'x-envoy-upstream-service-time': '79'                                                                                               
}                                                                                                                                     
REQUEST end event http://localhost:6500/w/api.php                                                                                     
REQUEST has body http://localhost:6500/w/api.php 370                                                                                  
REQUEST emitting complete http://localhost:6500/w/api.php                                                                             
{"name":"push-notifications","hostname":"push-notifications-main-8c6598c6c-4zkk5","pid":17,"level":50,"err":{"message":"200: api_error
","name":"push-notifications","stack":"HTTPError: 200: api_error\n    at /srv/service/dist/lib/api-util.js:71:19\n    at tryCatcher (/
srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/blueb
ird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n  
  at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n    at Promise._settlePromises (/srv/s
ervice/node_modules/bluebird/js/release/promise.js:729:18)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/asyn
c.js:93:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service
/node_modules/bluebird/js/release/async.js:102:5)\n    at Immediate.Async.drainQueues [as _onImmediate] (/srv/service/node_modules/blu
ebird/js/release/async.js:15:14)\n    at processImmediate (internal/timers.js:461:21)","status":200,"type":"api_error","title":"badtok
en","detail":"Invalid CSRF token.","levelPath":"error/login"},"msg":"200: api_error","time":"2023-04-24T12:29:56.311Z","v":0}         
{"name":"push-notifications","hostname":"push-notifications-main-8c6598c6c-4zkk5","pid":17,"level":20,"err":{"message":"200: api_error
","name":"push-notifications","stack":"HTTPError: 200: api_error\n    at /srv/service/dist/lib/api-util.js:71:19\n    at tryCatcher (/
srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/blueb
ird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n  
  at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n    at Promise._settlePromises (/srv/s
ervice/node_modules/bluebird/js/release/promise.js:729:18)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/asyn
c.js:93:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service
/node_modules/bluebird/js/release/async.js:102:5)\n    at Immediate.Async.drainQueues [as _onImmediate] (/srv/service/node_modules/blu
ebird/js/release/async.js:15:14)\n    at processImmediate (internal/timers.js:461:21)","status":200,"type":"api_error","title":"badtok
en","detail":"Invalid CSRF token.","levelPath":"debug"},"msg":"200: api_error","time":"2023-04-24T12:29:56.311Z","v":0}

And the hit in mw-api-int logstash

Change 911288 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s

https://gerrit.wikimedia.org/r/911288

Change 911288 merged by jenkins-bot:

[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s

https://gerrit.wikimedia.org/r/911288

Mentioned in SAL (#wikimedia-operations) [2023-04-24T13:13:34Z] <claime> Deploying push-notifications production for switch to mw-api-int - T334061

Mentioned in SAL (#wikimedia-operations) [2023-04-24T13:32:31Z] <claime> Deployed push-notifications production for switch to mw-api-int - T334061

End to end testing delivered the push notification, the test of an invalid token gave the same log as in staging.
Considering resolved, feel free to reopen in case of any issue.