Page MenuHomePhabricator

Evaluate miscweb bugzilla architecture
Closed, ResolvedPublicSecurity

Description

bugzilla is one of the miscweb services serving https://static-bugzilla.wikimedia.org.

The static data of our old bugzilla instance is stored in a archive and extracted on-the-fly by apache when served. This was done to decrease the image size and reduce problems with long image pulls.
It seems the current architecture needs a lot of computing resources and increased traffic can disrupt the service and potentially surrounding services on the same node (restbase issues?) A page revealed some performance issues when ~20rps hit the service in codfw. miscweb bugzilla became unavailble and was throttled due to excessive CPU load.

We should:

  • re-evaluate the usage of zipped content served by apache
  • do load-tests
  • remove page from bugzilla

https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=codfw%20prometheus%2Fk8s&var-namespace=All&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=1691596283746&to=1691600877841

https://grafana.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?forceLogin&from=1691596802613&orgId=1&to=1691600039056&var-container=All&var-dc=thanos&var-ignore_container_regex=&var-prometheus=k8s&var-service=miscweb&var-site=codfw&var-sum_by=container&var-sum_by=pod

https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=codfw%20prometheus%2Fops&var-cluster=text&var-origin=miscweb.discovery.wmnet&from=1691596548210&to=1691599756862

Event Timeline

Jelto triaged this task as High priority.Aug 9 2023, 5:12 PM
Jelto added a project: collaboration-services.
Jelto updated the task description. (Show Details)
Jelto added subscribers: elukey, jcrespo, LSobanski.

Why are we compressing it and decompressing-on-the-fly? It's just static HTML, surely we have enough disk space for that...

and potentially surrounding services on the same node (restbase issues?)

Please note that, at least at the time, the issues about restbase were an unfortunante coincidence with a migration happening at the same time causing a spike on errors, but in the end, unrelated.

https://logstash.wikimedia.org/goto/80b55aef15feb34bf12a2b00228bea0e vs https://logstash.wikimedia.org/goto/c2cfdf3274451444ace863ce6741583f

Why are we compressing it and decompressing-on-the-fly? It's just static HTML, surely we have enough disk space for that...

The initial efforts were made to reduce the disk size of the docker image. In the past the Kubernetes nodes in wikikube staging had HDD disks and there were issues with long disk pulling times.

I'll investigate if it's possible to disable the decompression or limit it somehow for clients which doesn't support gzip content.

Currently all pages in bugzilla are stored in around 150k different gz archives. They have a total size of ~1.2GB on disk and ~1.5GB in the docker images. For a test I extracted all archives and uncompressed they need 3.6GB.
3.6GB should be reasonable size if we decide to serve all content uncompressed.

I'll do some more research to see if we can tweak the apache settings to make sure we use gzip content all the time.

Note: the increased number of replicas was reverted with the last deployment of miscweb in codfw. I left it at the default (two replicas) for now.

To re-apply the increased replicas, run:

kubectl scale --replicas=8 deployment/miscweb-bugzilla

As mentioned in T300171#9117180 I refactored the future bugzilla docker image which is built on GitLab (see https://gitlab.wikimedia.org/repos/sre/miscweb/bugzilla).

I've done some benchmarks locally and there is a significant performance difference between serving gziped content versus uncompressed content with the existing image (6% CPU usage for uncompressed content and 90% CPU usage for compressed content). So I uncompressed all html files and reconfigured apache to serve uncompressed content only. The total image size went up from 1.5GB to 3.6GB. That should be fine as all Kubernetes nodes are using SSDs now.

If future tests are looking good I'll switch the old bugzilla image from Gerrit to the new one built on GitLab (https://docker-registry.wikimedia.org/repos/sre/miscweb/bugzilla/tags/).

I deployed the new miscweb bugzilla image from GitLab which includes the refactored storage and serving of uncompressed html files. It seems the problem just changed. Using curl causes little cpu usage and visiting bugzilla in a browser causes significant throttling. This is mostly the opposite of the initial suspected problem.

Even when using curl with headers like Accept-encoding: gzip, deflate, br (which causes content-encoding: gzip in the response) I'm not able to generate the same load as with browsing a few tasks manually in Firefox. Firefox also sends Accept-encoding: gzip, deflate, br. For the curl experiments I used roughly 1rps and unique, uncached pages.

I'll create a new image where html files are compressed again (like before). I'll try to make sure this is a problem related to compression and not something related to my browser and the apache config. There are quite a lot of 404 for missing javascript and css files. Maybe this is also a issue when there are 150k files in the same folder. I've also done some tests with disabled gzip in Firefox and also saw quite some throttling.

Sorry to chime in, one of the first options I thought could be done is to ignore uncompressed requests and send the compressed files as is (there is an apache directive about that) always. This would be an unreasonable option for our general apache servers, but maybe it is an option for bugzilla? Let me know if that sounds like an interesting option.

Some observations from using gzipped content again:

Behavior is similar. Throttling occurs mostly when a browser is used, curling against uncached Kubernetes ingress or text-lb service show little impact on load, independent from gzip headers. However with gzip enabled again base latency got up 2ms to 90ms: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=miscweb&var-destination=local_service&from=1693195424677&to=1693226259647&viewPanel=6.

So I suspect something related to fetching the additional resources (like javascript, css) which curl is not doing but the browser. There are 150k files in the directory, so maybe it's something related to that. There are also multiple 404 because this files are not included in the static dump. In the Grafana dashboard above you can also see latency peaks to 3s and more when I visit bugzilla in the browser. Here you can also see that elevated 404 (when using a browser) correlate with high envoy latency:
https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=miscweb&var-destination=local_service&from=1693195424677&to=1693226259647&viewPanel=6

During the last incident you can also see elevated 404 before the service was throttled to 504:
https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=miscweb&var-destination=local_service&from=1691595576155&to=1691601518213

Sorry to chime in, one of the first options I thought could be done is to ignore uncompressed requests and send the compressed files as is (there is an apache directive about that) always. This would be an unreasonable option for our general apache servers, but maybe it is an option for bugzilla? Let me know if that sounds like an interesting option.

thanks! I also thought about that. It's good to know that this is possible. But with my latest troubleshooting I'm not sure if this is related to gzipped content.

I think I'll do a last try and reduce references to other Javascript and CSS files and reduce the number of 404s. If that's not working I'm not really sure how to proceed here. I just picked this task up because we wanted to migrate the repo to GitLab and it paged once in the past. It really doesn't matter if a request to bugzilla takes 2ms, 90ms or 3s, as this is just a static mirror to not break links and conversations. So we could also just increase the timeout and leave it running with some throttling.

Jelto closed this task as Resolved.EditedAug 29 2023, 10:45 AM

In https://gitlab.wikimedia.org/repos/sre/miscweb/bugzilla/-/commit/1d0b575e5119c293305c62185615331ca7848e38 I removed all unavailable resources like css, javascript and images. I used the following sed commands:

for f in *.html ; do sed -i '/yahoo-dom-event/d' "$f"; done
for f in *.html ; do sed -i '/js\/yui/d' "$f"; done
for f in *.html ; do sed -i '/js\/util.js/d' "$f"; done
for f in *.html ; do sed -i '/images\/favicon.ico/d' "$f"; done
for f in *.html ; do sed -i '/js\/comments.js/d' "$f"; done
for f in *.html ; do sed -i '/extensions\/Voting\/web\/style.css/d' "$f"; done

And added a favicon. With this config there are no more 404 when visiting bugzilla.

I deployed the new version to wikikube and all throttling is gone. Latency (envoy metrics) is back to ~3ms. Latency peaks to 3s+ seconds are also gone. Subjectively the services also feels snappier, as there are no more 3s delays when loading some pages.

So I think that was more a filesystem issue and Apache tried to find missing files in a directory with 150k files. Removing all of that references fixed the issue. So I'm closing the task, as this should be resolved.

@sbassett feel free to remove security ACL from this task, as the issue is fixed and can be public now.

sbassett changed Author Affiliation from N/A to WMF Technology Dept.Aug 30 2023, 2:32 PM
sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett changed the edit policy from "Custom Policy" to "All Users".
sbassett changed Risk Rating from N/A to Low.

Why are we compressing it and decompressing-on-the-fly? It's just static HTML, surely we have enough disk space for that...

The docker image became too large to be acceptable for the registry / to be on k8s at the time.

all throttling is gone. Latency (envoy metrics) is back to ~3ms. Latency peaks to 3s+ seconds are also gone.

Amazing! Thanks for all your work on this, Jelto!