Page MenuHomePhabricator

Implement SLI measurement for Varnish Frontend
Closed, ResolvedPublic

Description

As part of the ongoing work to define a Service-Level Objective (SLO) for the frontend component of the CDN, we need to implement a Service-Level Indicator (SLI) specific to Varnish. To ensure that the SLI really focuses on varnish and is not influenced by other components of our architecture (eg: origin servers), we have decided to try and extract the amount of processing time specific to varnish only, which adds up to the total time required to serve a request. For example we want to measure the amount of time it took to lookup a certain object from cache, or the amount of LRU time spent finding which objects to evict in order to make room for a new one, but not the time it took Mediawiki to generate a response. Further, we want to focus on the fraction of responses which are not varnish internal errors.

Importantly, the status code does not help much here in distinguishing between responses we want to consider as "good" vs "bad". There can be 503 responses that we want to consider as OK for this SLI, such as for example a 503 error sent by Swift, and 200 responses that are not OK, such as a "FetchError: Could not get storage" for a streaming response where varnish first sends 200 OK, only to give up later due to a storage error.

When it comes to implementation, varnish offers several different timestamps that we can analyze. They are divided into 2 categories reflecting the internal architectural separation between Client (client request/response processing) and Backend (origin request/response processing). Confusingly, the term "backend" here is overloaded in that we call the on-disk component of our CDN "backend" (currently ats-be), while here backend refers to the part of varnish responsible for sending requests to the origins. Also Client is somewhere in the Varnish docs known as frontend, but let's avoid further confusion and call it Client here. With this in mind, the available timestamps are as follows:

SideTimestampDescription
ClientStartThe start of request processing (first byte received or restart)
ClientReqComplete client request received
ClientReqBodyClient request body processed (discarded, cached or passed to the backend)
ClientWaitinglistCame off waitinglist
BackendStartStart of the backend fetch processing
BackendBereqBackend request sent
BackendBerespBackend response headers received
ClientFetchFetch processing finished (completely fetched or ready for streaming)
ClientProcessProcessing finished, ready to deliver the client response
BackendBerespBodyBackend response body received
ClientRespDelivery of response to the client finished

The events should always happen chronologically in the order above up to Backend.Beresp (origin server response headers received). What happens next depends on the timing of individual responses, and importantly whether or not streaming is enabled. If it's not, then the whole Backend processing part should finish before the Client part continues, ie: Backend.BerespBody happens after Backend.Beresp and before Client.Fetch.

For pipe sessions, the whole Backend part is replaced by just two timestamps, Pipe (Opened a pipe to the backend and forwarded the request) and PipeSess (The pipe session has finished).

This task is about defining the "varnish processing time" on the basis of those timestamps, as well as defining how to single out varnish internal errors, which may amount to filtering out the Error VSL tag as well as FetchError except for "Pass delivery abandoned".

Event Timeline

lmata moved this task from Inbox to Backlog on the observability board.
lmata added subscribers: herron, lmata.

I wrote a varnishlog consumer to see the actual maximum values for each timestamp we're currently getting:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
  VarnishMaxTimestamps
  ~~~~~~~~~~~~~~~~~~~~~~~
"""

import argparse
import sys

from wikimedia_varnishlogconsumer import BaseVarnishLogConsumer

max_values = {}

class VarnishMaxTimestamps(BaseVarnishLogConsumer):
    description = 'Varnish max timestamps'

    def varnishlog_args(self):
        return [ '-g', 'request', '-q', 'ReqMethod ne "PURGE"', '-i', 'Timestamp' ]
	
    def handle_end(self):
        for name, value in self.tx.items():
            if name not in max_values:
                max_values[name] = value
                print(max_values)
            elif max_values[name] < value:
                max_values[name] = value
                print(max_values)

    def handle_line(self, line):
        """We want to get both Request and BeReq data, so override handle_line
        to consider lines starting with '* ' as the beginning of a request, and empty
        lines as the end. For example:
 
        *   << Request  >> 270178439 
        -   Timestamp      Start: 1623231891.667378 0.000000 0.000000
        -   Timestamp      Req: 1623231891.667378 0.000000 0.000000
        -   Timestamp      Fetch: 1623231891.829699 0.162321 0.162321
        -   Timestamp      Process: 1623231891.829789 0.162411 0.000089
        -   Timestamp      Resp: 1623231891.829865 0.162487 0.000076
        **  << BeReq    >> 270178440 
        --  Timestamp      Start: 1623231891.667589 0.000000 0.000000
        --  Timestamp      Bereq: 1623231891.667654 0.000065 0.000065
        --  Timestamp      Beresp: 1623231891.829577 0.161988 0.161923
        --  Timestamp      BerespBody: 1623231891.829847 0.162258 0.000270
        THIS LINE LEFT INTENTIONALLY BLANK
        """
        if line.startswith('* '):
            self.tx = {}

        elif line == '':
            self.handle_end()

        else:
            split = line.split()
            tag, name = split[1], split[2]
            if tag == 'Timestamp':
                self.tx[name] = float(split[5])

if __name__ == "__main__":
    VarnishMaxTimestamps(sys.argv[1:]).main()

After letting the script run for several minutes on cp3050 (cache_text @ esams), here's the result, all values in seconds:

Start0.0The start of request processing (first byte received or restart)
Retry0.000032Backend request is being retried
Pipe0.000306Opened a pipe to the backend and forwarded the request
Process0.051615Processing finished, ready to deliver the client response
Req0.519986Complete client request received
PipeSess1.581463The pipe session has finished
Resp2.580735Delivery of response to the client finished
BerespBody2.580966Backend response body received
Bereq2.728727Backend request sent
ReqBody2.728979Client request body processed (discarded, cached or passed to the backend)
Beresp3.529783Backend response headers received
Fetch3.530131Fetch processing finished (completely fetched or ready for streaming)
Waitinglist5.27592Came off waitinglist

Some of those really look like worst-case exceptions. Here is p99 for 1 million requests instead:

Req0.0
Start0.0
Pipe0.00034
Process0.000341
Bereq0.041792
Resp0.109512
ReqBody0.127108
BerespBody0.323746
PipeSess0.546274
Beresp1.13497
Fetch1.135334
Waitinglist1.801998

Start, Retry, Process, and Pipe seem to be valid candidates for the SLI, as they do not depend on the performance of either clients (ats-tls) nor origins (ats-be). At any rate, all this data seems very useful in general, so I think it's worth adding all timestamps to prometheus anyways.

Change 699223 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add timing data to varnishmtail

https://gerrit.wikimedia.org/r/699223

Change 699941 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add prometheus histogram varnish_processing_seconds

https://gerrit.wikimedia.org/r/699941

Change 699223 merged by Ema:

[operations/puppet@production] varnish: add timing data to varnishmtail

https://gerrit.wikimedia.org/r/699223

Change 699941 merged by Ema:

[operations/puppet@production] varnish: add prometheus histogram varnish_processing_seconds

https://gerrit.wikimedia.org/r/699941

Change 700198 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: install mtail program varnishprocessing.mtail

https://gerrit.wikimedia.org/r/700198

Change 700198 merged by Ema:

[operations/puppet@production] varnish: install mtail program varnishprocessing.mtail

https://gerrit.wikimedia.org/r/700198

The following timestamps depend on external factors and should not be used for this SLI:

  • beresp
  • berespbody
  • fetch
  • pipesess
  • req
  • reqbody
  • waitinglist

All others look good. Now we need to find the best way to filter out varnish internal errors (eg: exclude -q Error or FetchError, or something along those lines). Perhaps adding the Error / FetchError tags to varnishmtail would be useful also in other circumstances?

Change 700876 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnishmtail: add Error and FetchError VSL tags

https://gerrit.wikimedia.org/r/700876

Change 700876 merged by Ema:

[operations/puppet@production] varnishmtail: add Error and FetchError VSL tags

https://gerrit.wikimedia.org/r/700876

Change 701070 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add error counters to Prometheus

https://gerrit.wikimedia.org/r/701070

Change 701070 merged by Ema:

[operations/puppet@production] varnish: add error counter to Prometheus

https://gerrit.wikimedia.org/r/701070

Change 701358 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add varnish_sli_good, varnish_sli_all counters

https://gerrit.wikimedia.org/r/701358

Change 701358 merged by Ema:

[operations/puppet@production] varnish: add counters for Varnish SLI

https://gerrit.wikimedia.org/r/701358

ema claimed this task.

This is now done, including a basic dashboard based on the metrics introduced as part of this task: https://grafana.wikimedia.org/d/uIGz8Ak7k/varnish-slo. I'd leave further refinements to the SLO values as well as the dashboard itself for future work, possibly in the context of T274665.

Change 709964 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove Resp time from internal SLI

https://gerrit.wikimedia.org/r/709964

Change 709964 merged by Ema:

[operations/puppet@production] varnish: remove Resp time from internal SLI

https://gerrit.wikimedia.org/r/709964