The plan for deploying Sentry has two very different stages: first deploy it in a form that's sufficient for collecting UploadWizard errors (to unblock T91652); then eventually scale it up to collect all errors. There are many uncertainties about the second stage and it is only described here to give an overview and start a discussion; the actual hardware request is for the initial setup.
Sentry running on a low-end machine, behind an nginx firewall, with its own domain name/IP/HTTPS support. The machine needs to run Sentry itself (a Django web app, running its own Gunicorn web server and Celery worker threads), a PostgreSQL backend, and Redis. See the Dependencies + Hardware sections of the Sentry docs for details.
As far as cpu/mem/disk, pretty much anything goes. UploadWizard currently gets about 4K/day requests, the number of error reports is probably a magnitude lower than that. According to the Sentry docs, 4K daily log events with three months log retention would only take ~1GB space; the current Labs instance (running on an m1.small VM) uses around 700MB memory and less than 0.5% CPU. Really the only reason for not doing this on Labs is that it involves private data.
So Sentry could be easily installed on an existing machine as far as performance is concerned, but it has a big security footprint (lots of Python packages that don't have a Debian package / don't have the right version packaged), plus we might need low-level access for debugging, so a separate box is probably better from a permission management perspective.
Per the squid reports, we have about 1B views a day, but on normal days, only a fraction of those should result in an error report; and for abnormal days, the plan is to send the error reports through some gateway (possibly varnishkafka) that can be used for sampling/throttling so the Sentry machine is not hit with thousands of reports per second. From the Sentry docs,
At a point, getsentry.com was processing approximately 4 million events a day. A majority of this data is stored for 90 days, which accounted for around 1.5TB of SSDs. Web and worker nodes were commodity (8GB-12GB RAM, cheap SATA drives, 8 cores), the only two additional nodes were a dedicated RabbitMQ and Postgres instance (both on SSDs, 12GB-24GB of memory). In theory, given a single high-memory machine, with 16+ cores, and SSDs, you could handle the entirety of the given data set.
I'll count errors in production in the next few weeks to get a better idea of scale, but this seems like a good initial ballpark estimate.