Page MenuHomePhabricator

Procure hardware for Sentry
Closed, DeclinedPublic

Description

The server will be used to run Sentry which will collects Javascript (and possibly PHP and other kinds of) errors from production. This is intended to be a permanent service.

The plan for deploying Sentry has two very different stages: first deploy it in a form that's sufficient for collecting UploadWizard errors (to unblock T91652); then eventually scale it up to collect all errors. There are many uncertainties about the second stage and it is only described here to give an overview and start a discussion; the actual hardware request is for the initial setup.

initial setup

Sentry running on a low-end machine, behind an nginx firewall, with its own domain name/IP/HTTPS support. The machine needs to run Sentry itself (a Django web app, running its own Gunicorn web server and Celery worker threads), a PostgreSQL backend, and Redis. See the Dependencies + Hardware sections of the Sentry docs for details.

If you want to get an idea of the setup, there is a test server in labs (not puppetized yet) and a MediaWiki-Vagrant puppet role.

As far as cpu/mem/disk, pretty much anything goes. UploadWizard currently gets about 4K/day requests, the number of error reports is probably a magnitude lower than that. According to the Sentry docs, 4K daily log events with three months log retention would only take ~1GB space; the current Labs instance (running on an m1.small VM) uses around 700MB memory and less than 0.5% CPU. Really the only reason for not doing this on Labs is that it involves private data.

So Sentry could be easily installed on an existing machine as far as performance is concerned, but it has a big security footprint (lots of Python packages that don't have a Debian package / don't have the right version packaged), plus we might need low-level access for debugging, so a separate box is probably better from a permission management perspective.

long-term setup

Per the squid reports, we have about 1B views a day, but on normal days, only a fraction of those should result in an error report; and for abnormal days, the plan is to send the error reports through some gateway (possibly varnishkafka) that can be used for sampling/throttling so the Sentry machine is not hit with thousands of reports per second. From the Sentry docs,

At a point, getsentry.com was processing approximately 4 million events a day. A majority of this data is stored for 90 days, which accounted for around 1.5TB of SSDs. Web and worker nodes were commodity (8GB-12GB RAM, cheap SATA drives, 8 cores), the only two additional nodes were a dedicated RabbitMQ and Postgres instance (both on SSDs, 12GB-24GB of memory). In theory, given a single high-memory machine, with 16+ cores, and SSDs, you could handle the entirety of the given data set.

I'll count errors in production in the next few weeks to get a better idea of scale, but this seems like a good initial ballpark estimate.

Event Timeline

Tgr raised the priority of this task from to Needs Triage.
Tgr updated the task description. (Show Details)
Tgr added projects: Sentry, hardware-requests.
Tgr added a subscriber: Tgr.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a subscriber: RobH.

@Tgr: I prefer not to have placeholders in hardware-requests long term, as it just means I always glance at it, and ignore it, even when I should be acting on it. What is the proposed timeline to formalize this? (Alternatively, I can just pull off the hardware request project until you have a specification.)

I'm also setting this to the lowest priority and assigning it to you, since otherwise it is on the top of my hardware-requests.

RobH renamed this task from Procure hardware for Sentry to Procure hardware for Sentry - placeholder (not a live request).Mar 18 2015, 10:36 PM
RobH changed the task status from Open to Stalled.
RobH triaged this task as Lowest priority.
RobH set Security to None.

I'm pulling this off hardware-requests, as that project is for active hardware requests. Once you guys have a standard on what you need, please feel free to file a task, or update the main task description to what I need for hardware-requests, which is outlined https://wikitech.wikimedia.org/wiki/Operations_requests#Hardware_Requests

Tgr updated the task description. (Show Details)
Tgr renamed this task from Procure hardware for Sentry - placeholder (not a live request) to Procure hardware for Sentry.Mar 30 2015, 6:43 PM

This is a real request now :) Sorry for the initial confusion!

@Tgr: has the setup in labs been puppetized at this time? We tend to not allocate bare metal until then, since we don't want unpuppetized services on servers in production.

RobH raised the priority of this task from Lowest to Medium.Apr 16 2015, 6:11 PM

@Tgr: Just following up on this request; I see the basic puppetization has been merged live.

In the initial request, this states it cannot be run in labs due to private data. At this time, we now can offer an alternative to bare metal misc. servers, as we run ganeti virtual machines for production level VM instances. (We run many of our production micro-services out of these virtual machines.)

Since we are entirely uncertain of the hardware requirements (for good reason and explained in the initial request), may I suggest we place it in a ganeti vm? There is no issue of private data, since it is limited access similar to other production servers/services.

If the proposed ganeti VM isn't acceptable, let me know and I can list off a specification for allocation approvals.

If the proposed ganeti VM is acceptable, we should remove the hardware-requests project and add in the vm-requests tag. This then typically has @akosiaris approve it (Alex tracks/ensures we evenly balance out instances and resources), and then I'm happy to handle the ganeti instance allocation (per @akosiaris's specifications) and OS install.

Thanks!

Thanks for keeping this in mind, Rob :)
There are two blockers, integrating with some Wikimedia authentication method (T97133) and fixing the SMTP configuration (T116709). They are both small, I just can't find the time to get to them.

I don't know much about the practical difference between a bare metal server and a ganeti VM - if you think a VM is the appropriate choice, I'm fine with that.

This seems like it would indeed be better served by a Ganeti VM.

I'll be removing hardware-requests and adding vm-requests.

Please note that since this is still pending puppetization of the service, it may be declined for now.

This has been sitting since Jan 2016. Any updates on what the real blocker is here (nowadays)?

Do you still request a Ganeti VM?

Do you still request a Ganeti VM?

@Dzahn the blocker is that no one is working on it. There is long-term interest from the Web team at least, but unsure when it will be picked up (and setting up Sentry would not be the first step of that work anyway). So I think we'll need a VM at some point but not right now.

The task is marked stalled for that reason, feel free to remove it from the relevant boards if it is getting in the way.

Ok, thanks @Tgr. I understand, just leaving it stalled. It's fine.

The reason i kept asking is because of the tag vm-requests and checking that during our clinic duty. So i removed that because currently it's not a request and then you also won't be pinged again by others. Just add that tag back once it's time to create a VM, thanks!

Per grandparent task. We use T226986: Client side error logging production launch instead in production, which uses existing hardware.