Change Details

Over the last weeks I have been having discussions with the discovery team and the service ops team about the future deployment of some static sites around the wikidata query service world. Sites of interest: - **Wikidata query service UI** (currently deployed on wdqsxxxx hosts) - **Wikidata query service query builder** (currently under development) Status Quo: - The fact that the query service UI is deployed on wdqsxxxx hosts makes it harder than it could be to perform deployments. The deployment process is primarily controlled by the discovery team. - The new query builder needs somewhere to be deployed - The discovery team would rather not deploy more things to the wdqsxxxx hosts ### Discussions - Initial discussion about the idea of static sites on k8s https://wm-bot.wmflabs.org/browser/index.php?start=09%2F24%2F2020&end=09%2F25%2F2020&display=%23wikimedia-serviceops - Discussion a short while after ticket creation https://wm-bot.wmflabs.org/browser/index.php?start=10%2F14%2F2020&end=10%2F14%2F2020&display=%23wikimedia-serviceops ###Requirements * Teams that manage / own the sites should be able to update the content of the site * The hosting location can be pointed to from sub paths of query.wikidata.org (and similar flexible locations). For WDQS this could be done in the WDQS nginx server config * Does CDN cache purging need to be considered at all (TODO how is this currently done with the existing microsites infra) >>! In T264710#6532140, @akosiaris wrote: > * **Support for structured logging to stdout** to allow debugging issues via our ELK stack should be a requirement. > * **Support for exporting metrics via prometheus or statsd should be a requirement**. This should allow debugging issues, establishing SLIs and SLOs and allowing to come up with a level of support and ownership of powered services. Failing that, it will be impossible to come up with a level of support and will make those static sites a best effort. ###Ideas # One "service" per static site # One "service" to rule them all (hosting all static sites) Static sites could be hosted using a node service and service-runner to enable integration with logging and metrics. ####1) One "service" per static site A base image would be created with nginx and the things needed for structured logging and exporting metrics to prometheus. On the whole this may use more compute resources but would be less humanely complex. This would also result in more services running in general on the k8s infra. Pros: - Ultimate control over the service to the team that owns the site Cons: - More compute resources used (but probably fine in relation to the reduced human time needed) Reading: - https://medium.com/bolt-labs/using-json-for-nginx-log-format-793743064fc4 JSON logging from nginx - https://github.com/nginxinc/nginx-prometheus-exporter Very basic prometheus exporter (only what is on the status page) - https://blog.ruanbekker.com/blog/2020/04/25/nginx-metrics-on-prometheus-with-the-nginx-log-exporter/ More advanced metric exporter from logs (but requires writing logs?) - https://github.com/knyar/nginx-lua-prometheus probably the best solution for metrics? ####2) One "service" to rule them all (swift) A single image and service would be developed to host "all" static sites. This would likely serve content form "swift". Content in swift would need to be updated when sites need to be updated. (This could be done by the CI CD pipeline (TO BE CONFIRMED). The services would either serve directly from swift, or perhaps have a short lived local on disk cache (THOUGHTS?) service-runner could be used to provide structured logging and other metrics (as this service may have some extra complexity) Pros: - Less services & Less compute resources used Cons: - Single point of failure for all static sites (if 1 gets heavy load the rest may fail?) - More complex service with more moving parts ####3) One "service" to rule them all (built) A single image and service would be developed to host "all" static sites. Every static site would be in its own git repo. When a new tag is merged a new build of the main static site service would be triggered using the latest tag from each repo that the service is serving. This static site service would then be updated to use the latest image. Pros: - Less services & Less compute resources used Cons: - Single point of failure for all static sites (if 1 gets heavy load the rest may fail?) - Image size will continually grow in disk and layers & will suffer from a longer build time

Over the last weeks I have been having discussions with the discovery team and the service ops team about the future deployment of some static sites around the wikidata query service world. Sites of interest: - **Wikidata query service UI** (currently deployed on wdqsxxxx hosts) - **Wikidata query service query builder** (currently under development) - Potentially everything else currently covered by https://wikitech.wikimedia.org/wiki/Microsites Status Quo: - The fact that the query service UI is deployed on wdqsxxxx hosts makes it harder than it could be to perform deployments. The deployment process is primarily controlled by the discovery team. - The new query builder needs somewhere to be deployed - The discovery team would rather not deploy more things to the wdqsxxxx hosts ### Discussions - Initial discussion about the idea of static sites on k8s https://wm-bot.wmflabs.org/browser/index.php?start=09%2F24%2F2020&end=09%2F25%2F2020&display=%23wikimedia-serviceops - Discussion a short while after ticket creation https://wm-bot.wmflabs.org/browser/index.php?start=10%2F14%2F2020&end=10%2F14%2F2020&display=%23wikimedia-serviceops ###Requirements * Teams that manage / own the sites should be able to update the content of the site * The hosting location can be pointed to from sub paths of query.wikidata.org (and similar flexible locations). For WDQS this could be done in the WDQS nginx server config * Does CDN cache purging need to be considered at all (TODO how is this currently done with the existing microsites infra) >>! In T264710#6532140, @akosiaris wrote: > * **Support for structured logging to stdout** to allow debugging issues via our ELK stack should be a requirement. > * **Support for exporting metrics via prometheus or statsd should be a requirement**. This should allow debugging issues, establishing SLIs and SLOs and allowing to come up with a level of support and ownership of powered services. Failing that, it will be impossible to come up with a level of support and will make those static sites a best effort. ###Ideas # One "service" per static site # One "service" to rule them all (hosting all static sites) Static sites could be hosted using a node service and service-runner to enable integration with logging and metrics. ####1) One "service" per static site A base image would be created with nginx and the things needed for structured logging and exporting metrics to prometheus. On the whole this may use more compute resources but would be less humanely complex. This would also result in more services running in general on the k8s infra. Pros: - Ultimate control over the service to the team that owns the site Cons: - More compute resources used (but probably fine in relation to the reduced human time needed) Reading: - https://medium.com/bolt-labs/using-json-for-nginx-log-format-793743064fc4 JSON logging from nginx - https://github.com/nginxinc/nginx-prometheus-exporter Very basic prometheus exporter (only what is on the status page) - https://blog.ruanbekker.com/blog/2020/04/25/nginx-metrics-on-prometheus-with-the-nginx-log-exporter/ More advanced metric exporter from logs (but requires writing logs?) - https://github.com/knyar/nginx-lua-prometheus probably the best solution for metrics? ####2) One "service" to rule them all (swift) A single image and service would be developed to host "all" static sites. This would likely serve content form "swift". Content in swift would need to be updated when sites need to be updated. (This could be done by the CI CD pipeline (TO BE CONFIRMED). The services would either serve directly from swift, or perhaps have a short lived local on disk cache (THOUGHTS?) service-runner could be used to provide structured logging and other metrics (as this service may have some extra complexity) Pros: - Less services & Less compute resources used Cons: - Single point of failure for all static sites (if 1 gets heavy load the rest may fail?) - More complex service with more moving parts ####3) One "service" to rule them all (built) A single image and service would be developed to host "all" static sites. Every static site would be in its own git repo. When a new tag is merged a new build of the main static site service would be triggered using the latest tag from each repo that the service is serving. This static site service would then be updated to use the latest image. Pros: - Less services & Less compute resources used Cons: - Single point of failure for all static sites (if 1 gets heavy load the rest may fail?) - Image size will continually grow in disk and layers & will suffer from a longer build time