Sort out analytics service dependency issues for cp* cache hosts
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	BBlack
	Feb 29 2016, 2:09 PM

Description

There are a number of analytics services which run on the cache hosts, e.g. multiple instantiations of varnishkafka, varnish varnishprocessor daemons like varnishrls, etc...

Right now, the only runtime service dependencies (as in, configured metadata for systemd) is that they all depend on varnish services. That is, they all have lines like
After=varnish-frontend.service (and also sometimes BindsTo=varnish-frontend.service) in their systemd unit files. This makes sense from a certain perspective: they probably require the varnish instance they pull logs from to already be running, and perhaps they could crash or error if varnish is stopped before them as well.

However, I'm not entirely sure they each require the varnish service they're reading shm logs from to actually be started first, or that they'd error out badly if varnish stopped first. It may be the case that they're capable of starting and stopping asynchronously from the related varnish instance. It would be a nice property to have, and if they don't already have that property, it may be worth investing in some code updates for it.

Because systemd has ultimate control over the parallelism and execution order of all start/stop on boot/shutdown, there's no guarantee that varnish and the analytics daemons' start/stop actions execute in a tight batch in wallclock terms. Therefore with the current dependency scheme, for example, varnishkafka might start several minutes after varnish-frontend does, and varnishkafka might be stopped several minutes before varnish-frontend does as well. This leaves racy timing gaps where legitimate traffic may flow unlogged by these analytics services. In most cases, especially in the past, this is a trivial time offset on a rare event (reboot), so it's not generally a critical issue, but things are changing...

These days we're looking at auto-(de|re)pool scripts hooked into the init system, which pool or depool services in confd and may wait (for now, via over-long sleeps) to ensure those pooling changes take effect before allowing the main varnish (or nginx) service to stop or start. The net result in practice has been a sequence on shutdown like: "stop varnishkafka; depool self from confd; sleep 45 seconds; stop varnish-frontend". This is a legal interpretation of current dependencies, and leaves a more-significant window of unlogged traffic. A further related complication is that we have the same issue with confd dependencies itself, where confd may already be stopped before the depool action, and thus the node doesn't depool itself from inter-service dependecies during the window of depool time....

TL;DR - find out if various analytics daemons are capable of being asynch (in service dep terms) from varnish itself. If they are, or once they are, we need to flip the systemd dependencies around to avoid unlogged traffic windows: e.g. varnish should depend on varnishkafka, so that the logging is always running if the daemon is running.

Related Objects

Mentioned Here: T138747: Varnishkafka should auto-reconnect to abandoned VSM

Event Timeline

BBlack created this task.Feb 29 2016, 2:09 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 29 2016, 2:09 PM

BBlack updated the task description. (Show Details)Feb 29 2016, 2:10 PM

elukey subscribed.Feb 29 2016, 2:11 PM

Milimetric triaged this task as Medium priority.Feb 29 2016, 5:12 PM

Milimetric moved this task from Incoming to Radar on the Analytics board.

Milimetric moved this task from Radar to Event Platform on the Analytics board.

• ema added a project: Varnish.Apr 1 2016, 1:37 PM

T138747 upgraded Varnishkafka to a new version able to start at any time and poll periodically the Varnish shm logs to see if they are open or not. This should allow us to start Varnishkafka before Varnish without any issue.

• ema moved this task from Backlog to Caching on the Traffic board.Sep 30 2016, 3:15 PM

• Nuria moved this task from Event Platform to Wikistats on the Analytics board.Oct 31 2016, 3:50 PM

• Nuria moved this task from Wikistats to Dashiki on the Analytics board.Jan 6 2017, 4:47 PM

Question for the Traffic team: is this task still valid after T138747 or shall we call it done?

I think there's still some work here to do, if nothing else to audit the situation as it stands. There's basically two things to sort out for all of the varnish-logging bits and pieces:

Have we killed the hard dependency on Varnish being online? (Can we start the logger first and have it connect/reconnect as Varnish goes up and down?)
Have we re-ordered the systemd level dependencies to ensure we're not losing log events? (Can we make Varnish services dependent on the loggers being ready to receive events?)

Number 1 was done in T138747, but I was a bit reluctant to create any dependency between Varnish and Varnishkafka that could for some reason risk to end up in a situation like "Varnishkafka is not available then Varnish stays down".

The only use case that I have in mind where we could risk to loose events is after boot, when Varnish might start serving traffic while varnishkafka is still waiting to start (if possible). Any other ones?

Will investigate the possible solutions offered by systemd and report back :)

elukey added a project: User-Elukey.Aug 15 2017, 7:26 AM

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Aug 16 2017, 8:04 AM

JAllemandou moved this task from Dashiki to Backlog (Later) on the Analytics board.Aug 28 2017, 3:48 PM

• fdans moved this task from Backlog (Later) to Wikistats on the Analytics board.Oct 2 2017, 4:25 PM

elukey moved this task from Analytics Backlog to Backlog on the User-Elukey board.Feb 16 2018, 12:01 PM

Milimetric raised the priority of this task from Medium to Needs Triage.Apr 2 2018, 3:50 PM

Milimetric triaged this task as Low priority.

Milimetric moved this task from Wikistats to Operational Excellence on the Analytics board.

The fix for this would be high risk and low gain. So keeping around to just have context in case the problem does manifest.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:32 PM

elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Jan 3 2020, 10:57 AM

This is too-stale now and a lot of these bits have been replaced over time and are known to have their deps correct.

Sort out analytics service dependency issues for cp* cache hostsClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Sort out analytics service dependency issues for cp* cache hosts
Closed, DeclinedPublic
Actions