Switch backend mw-api calls from baremetal api_appserver cluster to kubernetes hosted mw-api-int
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T255792 Quibble runs core:unit tests twice! | |||
Open | None | T328919 Upgrade to PHPUnit 10 | |||
Open | None | T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+ | |||
Open | None | T328921 Drop PHP 7.4 support from MediaWiki | |||
Stalled | None | T334726 Use return type `never` in Wikibase | |||
Open | None | T328922 Drop PHP 8.0 support from MediaWiki | |||
Stalled | None | T319055 Upgrade to psr/container 2.x | |||
Stalled | Krinkle | T319432 Migrate WMF production from PHP 7.4 to PHP 8.1 | |||
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Stalled | None | T356293 Migrate MW appservers' base images to bullseye | |||
Open | None | T290536 Serve production traffic via Kubernetes | |||
In Progress | Clement_Goubert | T333120 Migrate internal traffic to k8s | |||
Resolved | Clement_Goubert | T334061 Migrate push-notifications to mw-api-int |
Event Timeline
Change 905942 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s
I'm not involved with the library. The end-to-end test would be to install the Android or iOS app, connect it to a user account, make sure app notifications are enabled at <wiki_domain>/wiki/Special:Preferences#mw-prefsection-echo, then generate a notification (e.g. by registering another user and mentioning the original user). There are probably simpler ways though.
@Jgiannelos @MSantos As you are the two contacts listed on https://www.mediawiki.org/wiki/Wikimedia_Product_Infrastructure_team/Push_Notifications_Infrastructure could you please provide me with a way to test the service after migrating the api backend, or can we schedule a sync time to do the migration?
Yes, we are changing it from mwapi-async (which redirects to api-rw.discovery.wmnet) to mw-api-int-async (which redirects to mw-api-int.discovery.wmnet, hosted on kubernetes).
I took a look, and i don't see any outgoing requests to MW API in the codebase. Most of the related codebase that exists is boilerplate from https://github.com/wikimedia/service-runner but not actively used.
We can try the following:
- Deploy the change only on staging k8s
- Enable debug level logs only on staging
- Change the time to flush to something reasonable so we dont wait forever: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/push-notifications/templates/_config.yaml#L75
- Make a request to push notifications to enqueue a new message (eg. /v1/message/fcm)
- Use dummy device tokens so service fails and queries MWapi
- Check the logs for errors from mw api
We haven't worked for quite some time on push notifications so I don't remember off the top of my head the cURL requests.
It should be something like:
curl -k -X POST https://staging.svc.eqiad:4104/v1/message/fcm -H -H "Content-Type: application/json"
{ "deviceTokens": [ "testing-mw-api-dummy-token" ], "messageType": "checkEchoV1", "dryRun": false }
I've changed the CR to modify the listener only in staging
- Enable debug level logs only on staging
According to https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/push-notifications/templates/_config.yaml#L25 logs are already in debug for all releases. Do you think it is worth adding a way to change that level so we can put it at trace for staging?
- Change the time to flush to something reasonable so we dont wait forever: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/push-notifications/templates/_config.yaml#L75
Changed to 10/30 seconds in the staging CR.
- Make a request to push notifications to enqueue a new message (eg. /v1/message/fcm)
- Use dummy device tokens so service fails and queries MWapi
- Check the logs for errors from mw api
In normal operation, with a dummy token, does the mw api log an error?
We haven't worked for quite some time on push notifications so I don't remember off the top of my head the cURL requests.
It should be something like:curl -k -X POST https://staging.svc.eqiad:4104/v1/message/fcm -H -H "Content-Type: application/json"{ "deviceTokens": [ "testing-mw-api-dummy-token" ], "messageType": "checkEchoV1", "dryRun": false }
Thanks for the test query. I've added you to the CR so you're kept up to date with the changes we may do.
On the other hand, i see a lot of connection failures for the MW api requests:
https://logstash.wikimedia.org/goto/eba59420e81c8509587a488662d8fb97
Maybe something is broken already there.
curl -H 'Host: en.wikipedia.org' http://localhost:6500/w/api.php works from the pod namespace, so at least it's not an egress issue.
Lets merge the staging config and try the same GET request.
From the config it looks like it is using localhost:
mwapi_req: method: post uri: http://localhost:6500/w/api.php headers: host: meta.wikimedia.org user-agent: '{{user-agent}}' x-forwarded-proto: https body: '{{ default(request.query, {}) }}'
Request used for testing:
cgoubert@deploy1002:~$ curl -i -X POST 'https://staging.svc.eqiad.wmnet:4104/v1/message/fcm' -H "Content-Type: application/json" --data-binary "@./push-notification.json"^C cgoubert@deploy1002:~$ cat push-notification.json { "deviceTokens": [ "testing-mw-api-dummy-token" ], "messageType": "checkEchoV1", "dryRun": false }
Change 905942 merged by jenkins-bot:
[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s
Mentioned in SAL (#wikimedia-operations) [2023-04-24T12:28:44Z] <claime> Deploying push-notifications staging for switch to mw-api-int - T334061
In kubectl logs I can see my message in queue:
Queue (push): Added item: SingleDeviceMessage { provider: 'fcm', type: 'checkEchoV1', dryRun: false, meta: {}, deviceToken: 'testing-mw-api-dummy-token', enqueueTimestamp: 1682339370641 }
The 200 with api_error response from mw-api-int
REQUEST response end http://localhost:6500/w/api.php 200 { date: 'Mon, 24 Apr 2023 12:29:56 GMT', server: 'envoy', 'x-powered-by': 'PHP/7.4.33', 'x-content-type-options': 'nosniff', 'mediawiki-api-error': 'badtoken', 'x-frame-options': 'DENY', 'content-disposition': 'inline; filename=api-result.json', 'cache-control': 'private, must-revalidate, max-age=0', vary: 'Accept-Encoding', 'x-request-id': 'd2763daa-950f-4050-a84c-5772faf7f475', 'set-cookie': [ 'ss0-metawikiSession=hit9fsno6obeendic8c26j2vrmp9q201; path=/; secure; HttpOnly', 'metawikiSession=hit9fsno6obeendic8c26j2vrmp9q201; path=/; secure; HttpOnly; SameSite=None' ], 'content-length': '370', 'backend-timing': 'D=45110 t=1682339396245510', 'content-type': 'application/json; charset=utf-8', 'x-envoy-upstream-service-time': '79' } REQUEST end event http://localhost:6500/w/api.php REQUEST has body http://localhost:6500/w/api.php 370 REQUEST emitting complete http://localhost:6500/w/api.php {"name":"push-notifications","hostname":"push-notifications-main-8c6598c6c-4zkk5","pid":17,"level":50,"err":{"message":"200: api_error ","name":"push-notifications","stack":"HTTPError: 200: api_error\n at /srv/service/dist/lib/api-util.js:71:19\n at tryCatcher (/ srv/service/node_modules/bluebird/js/release/util.js:16:23)\n at Promise._settlePromiseFromHandler (/srv/service/node_modules/blueb ird/js/release/promise.js:547:31)\n at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n at Promise._settlePromises (/srv/s ervice/node_modules/bluebird/js/release/promise.js:729:18)\n at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/asyn c.js:93:12)\n at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n at Async._drainQueues (/srv/service /node_modules/bluebird/js/release/async.js:102:5)\n at Immediate.Async.drainQueues [as _onImmediate] (/srv/service/node_modules/blu ebird/js/release/async.js:15:14)\n at processImmediate (internal/timers.js:461:21)","status":200,"type":"api_error","title":"badtok en","detail":"Invalid CSRF token.","levelPath":"error/login"},"msg":"200: api_error","time":"2023-04-24T12:29:56.311Z","v":0} {"name":"push-notifications","hostname":"push-notifications-main-8c6598c6c-4zkk5","pid":17,"level":20,"err":{"message":"200: api_error ","name":"push-notifications","stack":"HTTPError: 200: api_error\n at /srv/service/dist/lib/api-util.js:71:19\n at tryCatcher (/ srv/service/node_modules/bluebird/js/release/util.js:16:23)\n at Promise._settlePromiseFromHandler (/srv/service/node_modules/blueb ird/js/release/promise.js:547:31)\n at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n at Promise._settlePromises (/srv/s ervice/node_modules/bluebird/js/release/promise.js:729:18)\n at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/asyn c.js:93:12)\n at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n at Async._drainQueues (/srv/service /node_modules/bluebird/js/release/async.js:102:5)\n at Immediate.Async.drainQueues [as _onImmediate] (/srv/service/node_modules/blu ebird/js/release/async.js:15:14)\n at processImmediate (internal/timers.js:461:21)","status":200,"type":"api_error","title":"badtok en","detail":"Invalid CSRF token.","levelPath":"debug"},"msg":"200: api_error","time":"2023-04-24T12:29:56.311Z","v":0}
And the hit in mw-api-int logstash
Change 911288 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s
Change 911288 merged by jenkins-bot:
[operations/deployment-charts@master] push-notifications: Switch to mw-api-int-async on k8s
Mentioned in SAL (#wikimedia-operations) [2023-04-24T13:13:34Z] <claime> Deploying push-notifications production for switch to mw-api-int - T334061
Mentioned in SAL (#wikimedia-operations) [2023-04-24T13:32:31Z] <claime> Deployed push-notifications production for switch to mw-api-int - T334061
End to end testing delivered the push notification, the test of an invalid token gave the same log as in staging.
Considering resolved, feel free to reopen in case of any issue.