Page MenuHomePhabricator

Host static sites on kubernetes
Open, Needs TriagePublic

Description

Over the last weeks I have been having discussions with the discovery team and the service ops team about the future deployment of some static sites around the wikidata query service world.

Sites of interest:

Status Quo:

  • The fact that the query service UI is deployed on wdqsxxxx hosts makes it harder than it could be to perform deployments. The deployment process is primarily controlled by the discovery team.
  • The new query builder needs somewhere to be deployed
  • The discovery team would rather not deploy more things to the wdqsxxxx hosts

Discussions

Requirements

  • Teams that manage / own the sites should be able to update the content of the site
  • The hosting location can be pointed to from sub paths of query.wikidata.org (and similar flexible locations). For WDQS this could be done in the WDQS nginx server config
  • Does CDN cache purging need to be considered at all (TODO how is this currently done with the existing microsites infra)
  • Support for structured logging to stdout to allow debugging issues via our ELK stack should be a requirement.
  • Support for exporting metrics via prometheus or statsd should be a requirement. This should allow debugging issues, establishing SLIs and SLOs and allowing to come up with a level of support and ownership of powered services. Failing that, it will be impossible to come up with a level of support and will make those static sites a best effort.

Ideas

  1. One "service" per static site
  2. One "service" to rule them all (hosting all static sites)

Static sites could be hosted using a node service and service-runner to enable integration with logging and metrics.

1) One "service" per static site

A base image would be created with nginx and the things needed for structured logging and exporting metrics to prometheus.
On the whole this may use more compute resources but would be less humanely complex.
This would also result in more services running in general on the k8s infra.

Pros:

  • Ultimate control over the service to the team that owns the site

Cons:

  • More compute resources used (but probably fine in relation to the reduced human time needed)

Reading:

2) One "service" to rule them all (swift)

A single image and service would be developed to host "all" static sites.
This would likely serve content form "swift".
Content in swift would need to be updated when sites need to be updated. (This could be done by the CI CD pipeline (TO BE CONFIRMED).
The services would either serve directly from swift, or perhaps have a short lived local on disk cache (THOUGHTS?)
service-runner could be used to provide structured logging and other metrics (as this service may have some extra complexity)

Pros:

  • Less services & Less compute resources used

Cons:

  • Single point of failure for all static sites (if 1 gets heavy load the rest may fail?)
  • More complex service with more moving parts
3) One "service" to rule them all (built)

A single image and service would be developed to host "all" static sites.
Every static site would be in its own git repo OR be built using blubber to its own image.
When a new tag is merged for the git repo or a new image is built a new build of the main static site service would also be triggered using the latest built individual site.
This static site service would then be updated to use the latest image.

As joe said, we are probably just some pipeline / jenkins triggers and bash scripts away from having this image built.

Pros:

  • Less services & Less compute resources used
  • Individual sites could still create their own blubber files and own docker images (even though they are deployed together in wmf k8s infra)

Cons:

  • Single point of failure for all static sites (if 1 gets heavy load the rest may fail?)
  • Image size will continually grow in disk and layers & will suffer from a longer build time

Event Timeline

Addshore created this task.Tue, Oct 6, 8:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Oct 6, 8:22 AM
Addshore updated the task description. (Show Details)Tue, Oct 6, 8:27 AM
Tarrow added a subscriber: Tarrow.Tue, Oct 6, 8:44 AM

It might be that one of the benefits of hosting (e.g. Wikidata query service query builder) on k8s would be the CI build step that could be provided by pipelinelib/blubber.

In this case a single generic service (all using the same container images) would mean we miss out on this benefit.

A couple of requirements from my side, regardless of where those sites are deployed and the technology used:

  • Support for structured logging to stdout to allow debugging issues via our ELK stack should be a requirement.
  • Support for exporting metrics via prometheus or statsd should be a requirement. This should allow debugging issues, establishing SLIs and SLOs and allowing to come up with a level of support and ownership of powered services. Failing that, it will be impossible to come up with a level of support and will make those static sites a best effort.
Addshore updated the task description. (Show Details)Mon, Oct 12, 3:46 PM
Addshore updated the task description. (Show Details)
Addshore updated the task description. (Show Details)Mon, Oct 12, 3:55 PM
Addshore updated the task description. (Show Details)
Dzahn added a subscriber: Dzahn.Tue, Oct 13, 4:10 PM
CDanis added a subscriber: CDanis.Wed, Oct 14, 3:27 PM
Addshore updated the task description. (Show Details)Wed, Oct 14, 5:03 PM
Addshore updated the task description. (Show Details)Wed, Oct 14, 6:04 PM
Addshore updated the task description. (Show Details)Sun, Oct 18, 10:00 AM

While we are generally interested in moving all static sites at some point in the future we are not there yet at the current time, primarily because we don't have an ingress yet.

Until that has changed we would like to offer a the following fix in the meantime:

wikidata-query-service UI / query builder move to the current setup for microsites. They are hosted on miscweb* servers alongside other existing microsites.

We would add puppet code needed for that and puppet would pull content from deployment repos on Gerrit, like we do it with other microsites.

Once setup, some people would then get +2 on the deploy repo and it would also solve the requirement that you can deploy content by yourself without having to wait for other teams
while needing a lot less engineering effort.

Regarding CDN cache purging the same would apply that does for varnish/ATS on wiki sites. (https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Forcing_a_cache_miss_(similar_to_ban) if really needed though almost never needed for microsites).

It is also possible to configure a specific site as just "passthrough" without actual caching.

I would be happy to setup the puppet part for you. In the future we would then revisit this to move all static sites at once.

For what is worth, the idea that Daniel explains above, would solve the issue for now without the need to move to kubernetes, satisfying multiple of the requirements without requiring significant effort.

The following from the task description are satisfied:

  • Teams that manage / own the sites should be able to update the content of the site
  • The hosting location can be pointed to from sub paths of query.wikidata.org (and similar flexible locations). For WDQS this could be done in the WDQS nginx server config
  • Does CDN cache purging need to be considered at all (setting the correct Cache-control HTTP header in the apache config would solve this).

The following aren't, but were marked by yours truly as SHOULD, not MUST to begin with. In the interest of moving forward and providing a solution I think it's ok.

  • Support for structured logging to stdout to allow debugging issues via our ELK stack should be a requirement.
  • Support for exporting metrics via prometheus or statsd should be a requirement.