Page MenuHomePhabricator

2020-06-17 recommendation-api production deployment failed
Closed, ResolvedPublic

Description

mholloway-shell@deploy1001:/srv/deployment/recommendation-api/deploy$ scap deploy "`git log --pretty=format:'%s' -n 1`"
14:43:17 Started deploy [recommendation-api/deploy@c39d567]
14:43:17 Deploying Rev: HEAD = c39d56753f4e695fd0d6c8ed1785f78385cfbce7
14:43:17 Started deploy [recommendation-api/deploy@c39d567]: Update recommendation-api to db97742
14:43:17 
== CANARY ==
:* scb2001.codfw.wmnet
recommendation-api/deploy: fetch stage(s): 100% (ok: 1; fail: 0; left: 0)       
recommendation-api/deploy: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0)
14:43:46 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'recommendation-api/deploy', '-g', 'canary', 'promote', '--refresh-config'] on scb2001.codfw.wmnet returned [1]: Linking config files at: /srv/deployment/recommendation-api/deploy-cache/revs/c39d56753f4e695fd0d6c8ed1785f78385cfbce7/.git/config-files
Executing check 'depool'
Check 'depool' completed, output: 2020-06-17 14:43:29,368 [INFO] Depooling currently pooled services
2020-06-17 14:43:29,470 [WARNING] LB lvs2009:9090 reports pool recommendation-api_9632 as enabled/up/pooled, should be disabled/*/not pooled

Restarting service 'recommendation_api'
Port 9632 not up. Waiting 3.00s
Port 9632 up in 3.00s
Executing check 'endpoints'
Check 'endpoints' failed: /{domain}/v1/description/addition/{target} (Caption addition suggestions beta cluster) is CRITICAL: Test Caption addition suggestions beta cluster returned the unexpected status 504 (expecting: 200)

Executing check 'repool'
Check 'repool' completed, output: 2020-06-17 14:43:40,951 [INFO] Pooling currently depooled services
2020-06-17 14:43:41,043 [WARNING] LB lvs2009:9090 reports pool recommendation-api_9632 as disabled/up/not pooled, should be enabled/up/pooled


recommendation-api/deploy: promote and restart_service stage(s): 100% (ok: 0; fail: 1; left: 0)
14:43:46 1 targets had deploy errors
14:43:46 1 targets failed
14:43:46 1 of 1 canary targets failed, exceeding limit
Rollback all deployed groups? [Y/n]: Y
14:44:25 
== CANARY ==
:* scb2001.codfw.wmnet
recommendation-api/deploy: rollback stage(s): 100% (ok: 1; fail: 0; left: 0)    
14:44:33 Finished deploy [recommendation-api/deploy@c39d567]: Update recommendation-api to db97742 (duration: 01m 16s)
14:44:33 Finished deploy [recommendation-api/deploy@c39d567] (duration: 01m 16s)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-06-17T14:49:18Z] <mdholloway> rolled back recommendation-api deployment due to canary endpoint check failure (T255683)

Underlying issue appears to be a TLS error:

mholloway-shell@mwmaint1002:~$ curl -s -X GET --header 'Accept: application/json; charset=utf-8' 'http://recommendation-api.discovery.wmnet:9632/wikidata.beta.wmflabs.org/v1/description/addition/en' | jq .
{
  "status": 504,
  "type": "internal_http_error",
  "detail": "Hostname/IP doesn't match certificate's altnames: \"Host: en.wikipedia.beta.wmflabs.org. is not in the cert's altnames: DNS:*.wikipedia.org, DNS:*.m.mediawiki.org, DNS:*.m.wikibooks.org, DNS:*.m.wikidata.org, DNS:*.m.wikimedia.org, DNS:*.m.wikimediafoundation.org, DNS:*.m.wikinews.org, DNS:*.m.wikipedia.org, DNS:*.m.wikiquote.org, DNS:*.m.wikisource.org, DNS:*.m.wikiversity.org, DNS:*.m.wikivoyage.org, DNS:*.m.wiktionary.org, DNS:*.mediawiki.org, DNS:*.planet.wikimedia.org, DNS:*.wikibooks.org, DNS:*.wikidata.org, DNS:*.wikimedia.org, DNS:*.wikimediafoundation.org, DNS:*.wikinews.org, DNS:*.wikiquote.org, DNS:*.wikisource.org, DNS:*.wikiversity.org, DNS:*.wikivoyage.org, DNS:*.wiktionary.org, DNS:*.wmfusercontent.org, DNS:*.zero.wikipedia.org, DNS:mediawiki.org, DNS:w.wiki, DNS:wikibooks.org, DNS:wikidata.org, DNS:wikimedia.org, DNS:wikimediafoundation.org, DNS:wikinews.org, DNS:wikiquote.org, DNS:wikisource.org, DNS:wikiversity.org, DNS:wikivoyage.org, DNS:wiktionary.org, DNS:wmfusercontent.org, DNS:wikipedia.org, DNS:api-ro.discovery.wmnet, DNS:api-rw.discovery.wmnet, DNS:api.svc.eqiad.wmnet\"",
  "method": "GET",
  "uri": "/wikidata.beta.wmflabs.org/v1/description/addition/en"
}

I think the easiest fix here is just to take out the new endpoint check making the Beta Cluster request.

Change 606210 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/recommendation-api@master] Remove new Beta Cluster endpoint check

https://gerrit.wikimedia.org/r/606210

I see the recommendation-api is in deployment-charts. Are you sure we should still deploy this using scap?

@Mhollway Good to know. Thanks!

Change 606481 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/recommendation-api@master] SE endpoints: Refactor for easier testing and add tests

https://gerrit.wikimedia.org/r/606481

Change 606210 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] Remove new Beta Cluster endpoint check

https://gerrit.wikimedia.org/r/606210

Change 606481 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] SE endpoints: Refactor for easier testing and add tests

https://gerrit.wikimedia.org/r/606481

Mholloway claimed this task.

This morning's deployment succeeded. Beta and production are now back in sync.