Page MenuHomePhabricator

Refresh or replace oxygen
Closed, ResolvedPublic

Description

oxygen, a system in eqiad with role logging::kafkatee::webrequest::ops, is the box we use to store sampled 1:1000 webrequest logs, to be used in outage investigations and general purpose analysis by the SRE team.

The box has a purchasing date of 2011-01-27, i.e. almost 7 years ago, and thus way past its service life. We should replace it with a newer server, with similar specs. Ideally, switching to SSDs to be able run greps in the rare occasion they are needed, would be nice.

(Mark/Chris were working on a more comprehensive list of old > 5y servers, and probably more from this batch will get a procurement refresh task soon. I was just working on oxygen today and noticed how old it is, >3x slower my 2-year old laptop…)

@Ottomata/@elukey could provide additional information about the box, as well as storage requirements -- now it stores < 100GB, but perhaps they have needs or plans I haven't considered? We could also consider placing this into a VM, or even merging it with some other related role, like e.g. the syslog servers -- although I doubt these have SSD and are very fast either? Let's discuss and decide here :)

Event Timeline

As far as I know Analytics has no plans for oxygen, I thought that it was completely managed by ops :D

+1 for the fast SSDs for occasional greps, even if recently Filippo added https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X that solves (in my opinion) most of the daily checks that ops need to do.

Maybe Andrew has more context and/or ideas to add!

For me, oxygen is not that useful, as we have 90 days of queryable webrequest data in Hadoop. I suppose it is nice to be able to do some quick sed/awk/jq magic on sampled files on oxygen, but I don't ever do it.

Does ops do this often? If not, maybe we don't need to replace this box at all?

If we do need to replace it, I don't have much input. It does need to be able to consume 150K+ messages per second, sample / filter them, and write them to disk and store them for a short period of time. But since it seems that oxygen has been doing this successfully for a while, I imagine a box with similar specs would be fine.

The syslog servers have 2x spinning disks and 16GB of memory, so comparable to oxygen performance wise. I personally use the dashboard @elukey mentioned to investigate http errors, also oxygen is the box that forwards said logs to logstash for processing. A VM sounds fine to me and it'd get us SSDs as a bonus.

I use the sampled-1000 logs from time to time (and the 5xx ones, but less frequently), especially in incident-worthy situations, where speed is of the essence.

Additionally, I've written some scripts to parse the logs and analyze them in ways not available in Hadoop (cf. T167907), which is how I found about oxygen :) These aren't used often, but they're useful for insights into ISP markets and network planning etc. I guess I could extract those stats from stat1006 instead? Are the webrequest logs available there?

2x spinning disks + 16GB memory could work, but it's really suboptimal (analyzing a single day takes ~12mins on oxygen) and the logs are not much data anyway. We keep 60 days on oxygen (could bump to 90), and gzip every day but the last (which we could not do!), currently accounting for a total ~70GB of space or so.

Yeah I still use oxygen pretty routinely.

I often prefer being able to construct a CLI pipeline out of jq/grep/sed/sort/uniq/etc... to using a web UI, and usually sampled-1000 and 5xx are enough for my needs (which are invariably either investigating a near-realtime issue, or taking some informal stats over the past N weeks for decision support).

Also, the Hadoop data available in the UI doesn't have the same responsiveness in an immediate situation. IIRC it runs up to an hour behind realtime, whereas tail on the oxygen 5xx log is usually less than a minute behind.

I guess I could extract those stats from stat1006 instead? Are the webrequest logs available there?

Naw, and we only have unsampled in Hadoop.

Also, the Hadoop data available in the UI doesn't have the same responsiveness in an immediate situation. IIRC it runs up to an hour behind realtime,

It actually could be longer.

You can do all that with Kafka and be up to date, but I def understand that it is way easier to just do awk/sed/etc. on files.

We don't have any spare hardware with SSDs, but do have spares with 1TB SATA.

wmf4750 - Dell PoweEdge R430 - Dual Intel Xeon E5-2640 v3 2.6GHz - 64GB RAM

Oxygen has only 7974MiB memory, so this is a HUGE upgrade. The alternative is we order a one-off machine with much lesser stats than most of our systems. Otherwise oxygen isn't using enough disk space to matter (26GB in the largest partition) and will easily fit on the storage of a dual 1TB sw raid machine.

I'll escalate this to @mark and @faidon for review of allocation of spare machine.

OK, let's do this, approved. It's spinning rust which is unfortunate, but with 64GB of RAM we could probably fit most of the dataset in the page cache, so... :)

Implementation note: oxygen holds 60 days of data. Please transfer these to the new server, so that we keep having 60 days at all times (and not start from scratch) :)

faidon mentioned this in Unknown Object (Task).Apr 5 2018, 11:03 AM

Created sub-task T207760 for setup.