Page MenuHomePhabricator

New Service Request: developer-portal
Closed, ResolvedPublic

Description

Description: static site implementing https://www.mediawiki.org/wiki/Developer_Advocacy/Developer_Portal. Demo instance at https://developer-portal.wmcloud.org/
Timeline: FY 21/22 Q4 (April-June 2022)
Diagram: browserInternetdeveloper-portal
Technologies: a single container with nginx and a pile of static assets (html, js, css, images). Static content is produced by a python based static site generator which executes in PipelineLib.
Point person: @bd808

This could be deployed as a more traditional static site as well, but the build process for generating the static content is based on PipelineLib resulting in a Docker container, so deployment on a Kubernetes cluster seems "simple" once ingress is taken care of.

Checklist

  • Review helm charts
  • toolhub namespaces in k8s.
  • toolhub puppet private tokens.
  • Generate TLS certificates
  • Review helmfile.d files
  • LVS setup
  • DNS for LVS records
  • Discovery DNS
  • Monitoring dashboard
  • Integration and Acceptance tests

Related Objects

StatusSubtypeAssignedTask
ResolvedBUG REPORTbd808
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
OpenNone
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
Resolvedapaskulin
ResolvedAklapper
DuplicateNone
DeclinedNone
ResolvedAklapper
ResolvedNone
ResolvedTBurmeister
ResolvedAklapper
ResolvedTBurmeister
ResolvedAklapper
ResolvedNone
ResolvedNone
ResolvedAklapper
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedbd808
Resolvedbd808
Resolvedakosiaris
ResolvedNone
OpenNone
ResolvedAklapper
ResolvedNone
Resolvedbd808
Resolvedabi_
Resolvedapaskulin
ResolvedAklapper
ResolvedAklapper
Resolvedbd808
Resolvedbd808
ResolvedSpikebd808

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 updated the task description. (Show Details)
bd808 added a project: Goal.

@akosiaris just a ping to let you know that this should be ready to move forward from the point of view of the site being ready for publication in early April. We would very much like to be live by the Hackathon in mid-May if possible.

What hostname will this be hosted on?

In T297140#7800215, @Majavah wrote:

What hostname will this be hosted on?

developer.wikimedia.org is the preferred hostname (T287748: Set up location / host URL - developer.wikimedia.org)

Change 773267 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/deployment-charts@master] admin: add developer-portal namespace

https://gerrit.wikimedia.org/r/773267

Change 773268 had a related patch set uploaded (by Majavah; author: Majavah):

[labs/private@master] Add dummy tokens for developer-portal

https://gerrit.wikimedia.org/r/773268

Change 773270 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Add developer-portal k8s accounts

https://gerrit.wikimedia.org/r/773270

Change 773994 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/deployment-charts@master] Add developer-portal chart

https://gerrit.wikimedia.org/r/773994

Change 773995 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/deployment-charts@master] helmfile.d: add developer-portal

https://gerrit.wikimedia.org/r/773995

Change 773268 merged by Alexandros Kosiaris:

[labs/private@master] Add dummy tokens for developer-portal

https://gerrit.wikimedia.org/r/773268

Change 773270 merged by Alexandros Kosiaris:

[operations/puppet@production] Add developer-portal k8s accounts

https://gerrit.wikimedia.org/r/773270

Change 773267 merged by jenkins-bot:

[operations/deployment-charts@master] admin: add developer-portal namespace

https://gerrit.wikimedia.org/r/773267

Change 783849 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] add developer.wikimedia.org alias

https://gerrit.wikimedia.org/r/783849

Change 783849 merged by Alexandros Kosiaris:

[operations/dns@master] add developer.wikimedia.org alias

https://gerrit.wikimedia.org/r/783849

Hi @akosiaris @JMeybohm

Today I merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/775964 to add another namespace, image-suggestion, and when looking at the diff on staging-codfw I noticed it did not only have my expected change but also developer-portal in it.

So it seems like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/773267/ was not deployed yet.

Having to make a choice at the y/n prompt I decided to deploy it for both, but only on staging-codfw. I did not continue with the other clusters so far.

I am following the docs around step 5/6/7 on https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service

It says to test if everything is ok, when adding a new namespace, with:

kube_env $YOUR-SERVICE-NAME staging-codfw
kubectl get ns

When literally doing that I run into:

Error from server (Forbidden): namespaces is forbidden: User "image-suggestion" cannot list resource "namespaces" in API group "" at the cluster scope

I _know_ this happened to me back when I added miscweb and it was not actually an issue but can you remind me what the corrext fix to the docs is?

And could you please check if everything looks ok to you and optionally deploy both namespace additions to other (staging) cluster(s)?

Mentioned in SAL (#wikimedia-operations) [2022-04-27T22:34:36Z] <mutante> kubernetes - Uprading release=namespaces/namspace-certificates which added developer-portal and image-suggestion namespaces - but only on staging-codfw - (T304891, T305155, T297140)

It says to test if everything is ok, when adding a new namespace, with:

kube_env $YOUR-SERVICE-NAME staging-codfw
kubectl get ns

When literally doing that I run into:

Error from server (Forbidden): namespaces is forbidden: User "image-suggestion" cannot list resource "namespaces" in API group "" at the cluster scope

I _know_ this happened to me back when I added miscweb and it was not actually an issue but can you remind me what the corrext fix to the docs is?

This is bad docs. The deployer Kubernetes accounts don't have (and don't need) permission to "get ns", that's why you get this error.
I've changed the docs to:

kube_env admin staging-codfw
kubectl describe ns $YOUR-SERVICE-NAME

And could you please check if everything looks ok to you and optionally deploy both namespace additions to other (staging) cluster(s)?

Everything looks fine on staging-codfw. I've deployed to all other clusters as well.

Thank you both! I can confirm I see both new namespaces in all 4 envs/clusters.

@deploy1002:~# kube_env admin eqiad


@deploy1002:~# kubectl describe ns developer-portal
Name:         developer-portal
...
              net.beta.kubernetes.io/network-policy: {"ingress":{"isolation":"DefaultDeny"}}

Status:       Active

Resource Quotas
 Name:            quota-compute-resources
 Resource         Used  Hard
 --------         ---   ---
 limits.cpu       0     90
 limits.memory    0     100Gi
 requests.cpu     0     90
 requests.memory  0     100Gi

Resource Limits
 Type       Resource  Min    Max  Default Request  Default Limit  Max Limit/Request Ratio
 ----       --------  ---    ---  ---------------  -------------  -----------------------
 Container  memory    100Mi  3Gi  100Mi            100Mi          -
 Container  cpu       100m   8    100m             100m           -
 Pod        cpu       100m   9    -                -              -
 Pod        memory    100Mi  5Gi  -

@bd808, just for greater visibility, as I said in https://gerrit.wikimedia.org/r/c/773994, you can proceed and self-merge https://gerrit.wikimedia.org/r/c/773994 and https://gerrit.wikimedia.org/r/c/773995 and do the first deploy of it. We are unfortunately short-handed right now, we can't properly review the chart. The rest of the changes have been merged, so we shouldn't be a blocker for getting developer-portal deployed. Don't hesitate however to reach out if you hit any roadblocks or need any help!

@bd808, just for greater visibility, as I said in https://gerrit.wikimedia.org/r/c/773994, you can proceed and self-merge https://gerrit.wikimedia.org/r/c/773994 and https://gerrit.wikimedia.org/r/c/773995 and do the first deploy of it.

Thanks for the follow up @akosiaris. I realized today that I have known for several weeks that the team behind this project has adjusted their target timelines, but I had not communicated that out on this task. We are still hoping for a public launch prior to the end of June 2022, but we are no longer hoping to be deployed at the production URL prior to this week's hackathon. I will be trying to push forward in the weeks after the hackathon (May 23- June 3, 2022) to achieve an initial deployment of the app.

Ah, cool! Thanks for the update!

Change 773994 merged by jenkins-bot:

[operations/deployment-charts@master] Add developer-portal chart

https://gerrit.wikimedia.org/r/773994

Change 773995 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d: add developer-portal

https://gerrit.wikimedia.org/r/773995

Deployment into the "staging" environment worked:

$ curl -I https://developer-portal.k8s-staging.discovery.wmnet:30443
HTTP/2 200
server: istio-envoy
date: Wed, 25 May 2022 20:33:45 GMT
content-type: text/html
content-length: 27263
last-modified: Tue, 24 May 2022 19:14:04 GMT
etag: "628d2e7c-6a7f"
permissions-policy: interest-cohort=()
content-security-policy: default-src 'self'; connect-src 'self' https://commons.wikimedia.org; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https://upload.wikimedia.org; frame-src 'none'; sandbox allow-forms allow-same-origin allow-scripts allow-top-navigation;
accept-ranges: bytes
x-envoy-upstream-service-time: 2

Deployment into the "codfw" cluster is failing with the pod in CrashLoopBackoff. The error is:

$ kubectl logs developer-portal-main-6df6f7c77-zjs26 developer-portal-main-tls-proxy
...
[2022-05-25 20:33:33.749][1][critical][main] [source/server/server.cc:113] error initializing configuration '/etc/envoy/envoy.yaml': Failed to load certificate chain from /etc/envoy/ssl/service.crt
[2022-05-25 20:33:33.749][1][info][main] [source/server/server.cc:815] exiting
Failed to load certificate chain from /etc/envoy/ssl/service.crt

The certificate there is a "snakeoil" placeholder:

$ kubectl describe configmap developer-portal-main-tls-proxy-certs
Name:         developer-portal-main-tls-proxy-certs
Namespace:    developer-portal
Labels:       app=developer-portal
              app.kubernetes.io/managed-by=Helm
              chart=developer-portal-0.0.1
              heritage=Helm
              release=main
Annotations:  meta.helm.sh/release-name: main
              meta.helm.sh/release-namespace: developer-portal

Data
====
ca.crt:
----
-----BEGIN CERTIFICATE-----
MIIFXzCCA0egAwIBAgIUQPBPwrOR622kzKx6kBuEsU5OxV0wDQYJKoZIhvcNAQEL
BQAwKzEpMCcGA1UEAwwgUHVwcGV0IENBOiBwYWxsYWRpdW0uZXFpYWQud21uZXQw
HhcNMTkxMTA0MTIwOTM4WhcNMjkxMTAxMTIwOTM4WjArMSkwJwYDVQQDDCBQdXBw
ZXQgQ0E6IHBhbGxhZGl1bS5lcWlhZC53bW5ldDCCAiIwDQYJKoZIhvcNAQEBBQAD
ggIPADCCAgoCggIBAMMi6NogAUaQaMhR7iQfPX1hQAzHUqnIeHxaPp/JVyPxTEBy
sjfmQsO3dkr/eRPaS+6VIljA7lc9lvbwVkUb3DUc/lmqrz0xipeWj0HvAG3Lt+Vw
rwgQkQrtwEntIyWhVC3sVrBaPMqz42ybh+QIWWdDBR1H3LpURi1Aidd+P6zCjPS/
9/wiujXNwrqe006JCv3M0kZqkz/0YqBxMQIn1mzw+xaZJ0pn7LJL3nq2iidS+zgO
zuXU/Sf7yZVG6xKn/AXBmgu1NEkYJyyBjyHIXI71AW49jOSTds9NZ6kBnJOboTZC
52Wg/1QjREjcDdXsApXDXCEtQZRCIEMK3BvRgeEcARargOgcqcPKy0QYk5Ch2BkZ
PkKDI6DO/7mmkFA0Xs7D7KVU+wyzdMxHya+l4vEblE66imSLhA4cSJea/AmaYRBe
SXVR0duNwvs8rq8kW3bB7lrqgd6D6pF9/OHwI3gPNDtc5Eq1tD/R1FK4VUIq1m24
8ib0abtirnROtMuS7GCdjDrLFJb3eO+fgkpAsW3Ga7taTBJ4AqbgoxB2SXuej0bz
Wfe2hk1tzJjY6qSlo/nbtRf+eHq75tqTK1ybcL5YUlLV2dCnUgho9porL+ms5+ay
b4T0DGRYg+xxAUali4eLmA2PETuZILPktJCQumhP7yCAerWfNjKgZXreAsldAgMB
AAGjezB5MA8GA1UdEwEB/wQFMAMBAf8wNwYJYIZIAYb4QgENBCoWKFB1cHBldCBS
dWJ5L09wZW5TU0wgSW50ZXJuYWwgQ2VydGlmaWNhdGUwDgYDVR0PAQH/BAQDAgEG
MB0GA1UdDgQWBBRZ5IYwfgKvDQCmdO2a9g4XrmyuujANBgkqhkiG9w0BAQsFAAOC
AgEAhx2QGcCOlGIRKWmnG0zbdpOVoy1L9Bjb3EuCkGWOue1cod2BINU+65PDmMMl
MTvoExKJI/fbs8ADGaVDAeyt2LHiOLbp8sRn6ThFmhnQN2uU61zvAwneVnCApDFO
0+gEok/mNtD4FLKP/4OhHfcSgmw/3M3I04Nrm3ssu37jCss7ZnZ5LrVZBzT41ulc
UZ1Y1JPSLFvdd8kA053oR3GDmchOIqWXkPBo6XjvE/dVGdoUSeWdNIAVmFvZTc1I
/KGhkw0ll3bNIHmWRWPjRR5QmHTmJTgoxIXWZcr2vRLh3Mjyq1mLw4YEjvYPLtIR
tBGswBpc7eY8exDDkA1tJhxKS3DA0JkGm2wbAfQU2vim54VQ09J/8wCiTsUxNT5U
E2UwAW+fbLghjItFULr7B09usEXo6Qoiq3QGsJal1ksfjIxA8l0GY7v8l4io2Hsa
nT6EssrHNxEEZQxY4tBp1c+qS8IG7ILyAAiwtLFRtjcp2rQRvZLDSZ7FJivrqOjY
h4us+rUVI/KJfaKHrh70Q5ufj+dOZFBmpLgupzxP1aWNRtFHNiJqYIVcAjvba3dv
SaEqoNHJ2+KytzdcT9HzY/ywvd0tUFBJCCtuGpwVtimHYXkInwFfP4zmFZmsETld
Jl3aYuLUirKWSp+dQm8ikFCJ2gGaB8WHQWzIswFEw08vpAM=
-----END CERTIFICATE-----
service.crt:
----
snakeoil
service.key:
----
snakeoil
Events:  <none>

I believe this means that a production root needs to do the puppet/private.git steps from https://wikitech.wikimedia.org/wiki/Kubernetes/Enabling_TLS#Create_and_place_certificates to create a valid X509 cert for the alt_names:

  • developer-portal.discovery.wmnet
  • developer-portal.svc.codfw.wmnet
  • developer-portal.svc.eqiad.wmnet
  • developer.wikimedia.org

Followed the steps in https://wikitech.wikimedia.org/wiki/Kubernetes/Enabling_TLS#Create_and_place_certificates with @bking, @bd808, @BCornwall

(See the new commits in /srv/private, fa0a72296fb1391a2381c081077b09339a620d3d...1ab1d6a0833eb6fb785d489dc80cf0a673136767)

New files are showing up:

ryankemper@deploy1002:/etc/helmfile-defaults$ ls -lah  /etc/helmfile-defaults/private/main_services/developer-portal/
total 20K
drwxr-x---  2 mwdeploy deployment 4.0K May 25 22:43 .
drwxr-x--- 43 root     deployment 4.0K Apr 21 00:10 ..
-rw-r-----  1 mwdeploy deployment 2.0K May 25 22:43 codfw.yaml
-rw-r-----  1 mwdeploy deployment 2.0K May 25 22:43 eqiad.yaml
-rw-r-----  1 mwdeploy deployment 1.9K Apr 18 14:15 staging.yaml

Change 799427 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/dns@master] developer-portal: add service discovery records

https://gerrit.wikimedia.org/r/799427

Change 799429 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] developer-portal: add to service catalog

https://gerrit.wikimedia.org/r/799429

Change 799427 merged by Cathal Mooney:

[operations/dns@master] developer-portal: add service discovery records

https://gerrit.wikimedia.org/r/799427

Change 800181 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] developer-portal: add developer.wikimedia.org to CDN config

https://gerrit.wikimedia.org/r/800181

Change 799429 merged by Alexandros Kosiaris:

[operations/puppet@production] developer-portal: add to service catalog

https://gerrit.wikimedia.org/r/799429

Change 800181 merged by Vgutierrez:

[operations/puppet@production] developer-portal: add developer.wikimedia.org to CDN config

https://gerrit.wikimedia.org/r/800181

This has been deployed for some time so I moved it to the Done column, but I see 2 remaining unchecked items in the Checklist section of the task

  • Monitoring dashboard
  • Integration and Acceptance tests

@bd808, any news on those?

This has been deployed for some time so I moved it to the Done column, but I see 2 remaining unchecked items in the Checklist section of the task

  • Monitoring dashboard
  • Integration and Acceptance tests

@bd808, any news on those?

I'll take a look at what is possible on the monitoring side next week (at an offsite this week). There's not likely to be much telemetry to track yet. I will probably need to figure out some kind of prometheus sidecar to go with the nginx container to tell us more. Tests are not likely at this time.

Change 891502 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] developer-portal: Switch state to production

https://gerrit.wikimedia.org/r/891502

Change 891502 merged by Alexandros Kosiaris:

[operations/puppet@production] developer-portal: Switch state to production

https://gerrit.wikimedia.org/r/891502

akosiaris claimed this task.

I am gonna resolve this, apparently there isn't likely much telemetry to track and tests aren't likely. Feel free to reopen