As part of the ongoing work to define a Service-Level Objective (SLO) for the frontend component of the CDN, we need to implement a Service-Level Indicator (SLI) specific to Varnish. To ensure that the SLI really focuses on varnish and is not influenced by other components of our architecture (eg: origin servers), we have decided to try and extract the amount of processing time specific to varnish only, which adds up to the total time required to serve a request. For example we want to measure the amount of time it took to lookup a certain object from cache, or the amount of LRU time spent finding which objects to evict in order to make room for a new one, but not the time it took Mediawiki to generate a response. Further, we want to focus on the fraction of responses which are not varnish internal errors.
Importantly, the status code does not help much here in distinguishing between responses we want to consider as "good" vs "bad". There can be 503 responses that we want to consider as OK for this SLI, such as for example a 503 error sent by Swift, and 200 responses that are not OK, such as a "FetchError: Could not get storage" for a streaming response where varnish first sends 200 OK, only to give up later due to a storage error.
When it comes to implementation, varnish offers several different timestamps that we can analyze. They are divided into 2 categories reflecting the internal architectural separation between Client (client request/response processing) and Backend (origin request/response processing). Confusingly, the term "backend" here is overloaded in that we call the on-disk component of our CDN "backend" (currently ats-be), while here backend refers to the part of varnish responsible for sending requests to the origins. Also Client is somewhere in the Varnish docs known as frontend, but let's avoid further confusion and call it Client here. With this in mind, the available timestamps are as follows:
| Side | Timestamp | Description |
| Client | Start | The start of request processing (first byte received or restart) |
| Client | Req | Complete client request received |
| Client | ReqBody | Client request body processed (discarded, cached or passed to the backend) |
| Client | Waitinglist | Came off waitinglist |
| Backend | Start | Start of the backend fetch processing |
| Backend | Bereq | Backend request sent |
| Backend | Beresp | Backend response headers received |
| Client | Fetch | Fetch processing finished (completely fetched or ready for streaming) |
| Client | Process | Processing finished, ready to deliver the client response |
| Backend | BerespBody | Backend response body received |
| Client | Resp | Delivery of response to the client finished |
The events should always happen chronologically in the order above up to Backend.Beresp (origin server response headers received). What happens next depends on the timing of individual responses, and importantly whether or not streaming is enabled. If it's not, then the whole Backend processing part should finish before the Client part continues, ie: Backend.BerespBody happens after Backend.Beresp and before Client.Fetch.
For pipe sessions, the whole Backend part is replaced by just two timestamps, Pipe (Opened a pipe to the backend and forwarded the request) and PipeSess (The pipe session has finished).
This task is about defining the "varnish processing time" on the basis of those timestamps, as well as defining how to single out varnish internal errors, which may amount to filtering out the Error VSL tag as well as FetchError except for "Pass delivery abandoned".