Page MenuHomePhabricator

REST API entry point web request statistics at the Varnish level
Closed, DeclinedPublic

Description

Most high-traffic REST API entry points are cached in Varnish. While we do have graphite metrics for each end point in RESTBase, those will only capture cache misses, and thus won't result in an accurate picture of overall API use.

To get accurate overall API stats, it would be helpful to count web requests according to the entry points defined in the Swagger spec (doc view). Ideally, we'd drive this entirely from the spec, and avoid the need to manually update matchers. I have written code for turning URI specs to regexps in the past, and am willing to help.

See also: T142139: Top API user agents stats

Related Objects

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added projects: Analytics, RESTBase.
GWicke subscribed.
Milimetric subscribed.

@GWicke: can you give us some examples of what you'd like to see in these reports?

@Milimetric: The main bit of information we are looking for is number of Varnish requests per API entry point, ideally as a metric hierarchy mirroring restbase.external.* in graphite (restbase.external-varnish.*?).

We already have per-entrypoint information from the backend, but this does not include cache hits. Per-entrypoint Varnish stats would let us gauge hit rates per entry point, as well as accurate overall entry point popularity.

@GWicke varnish should be reporting this stats , not the cluster, correct? Otherwise stats reporting might take hours after requests have happened (cluster processing of requests is not real time)

In which case I think this is work that also services team can tackle, more so when it affects every single user of RESTbase platform. We can collaborate on this project but I do not see it as our sole responsibility.

@Nuria, these metrics would be derived from log data in an asynchronous fashion. We already have some hive-based stats for accesses to the REST API in general, and this task is proposing to refine this to provide per-entrypoint stats.

@GWicke: I think these stats if they are to be used by monitoring should be real time(ish) and if possible sent by varnish itself via statsd, What is the rationale to use the cluster to compute them if we have a statsd client for varnish?

Streaming seems a good fit for this type of use case but (team correct me if I am wrong) I do not think we are close to be able to use streaming.

What is the rationale to use the cluster to compute them if we have a statsd client for varnish?

I am not aware of any general real-time traffic analytics ability in Varnish, and I think there are good architectural reasons for minimizing such processing in Varnish itself.

While close to real-time data would be nice to have, our main use cases are more about long-term usage trends, and can tolerate some delay. We do not currently plan to use this data for alerts.

I am not aware of any general real-time traffic analytics ability in Varnish, and I think there are good architectural reasons for minimizing such processing in Varnish itself.

We used to have a statsd client in varnish, let me see whether we are still using it. I do not think there were perf concerns.

We can't prioritize this this quarter due to other work, but we can point you to other teams (like search) that have deployed similar jobs, and help you along.

IRC conversation on the topic:

gwicke
maybe we could set up those stats with a varnish consumer
err, kafka consumer
nuria (IRC)
gwicke: ya, that use case seems better taylored for real time processing + graphana posting of data
gwicke: than by swaping masive amounts of data on cluster as really it is ops data 
gwicke: but to be clear you do not need that ticket done to do cache hit ratio analysis on your end
gwicke
I was just thinking about getting live hit rates per entry point
this would do the splitting, and  recording hit vs. miss would be quite easy
nuria (IRC)
gwicke: right , but anything in the cluster now is async ,*almost seems* that is is better done with something that filters the incoming stream for restbase and sends to graphite the metrics which are (endpoint, http-code) and (endpoint, cache_status).  not sure , i think that is why we have not done it yet
gwicke:scala can do it just like we do for pageview api but seems inefficient, again, not super sure
gwicke
changeprop should be able to do this as well
although performance could be an issue, as it would need to look at all logs
is the web log topic well partitioned?
nuria (IRC)
ebernhardson: back to your queries
gwicke: partitioned , wait.. ? what hive/ kafka? 
gwicke
kafka
nuria (IRC)
gwicke: consuming raw from kafka is not easy as you would need to comb tons of data , it should probably be an already refined stream 
gwicke: not sure how changeprop is related, these are pageviews we are taking about  200.000 per sec
gwicke
right, and IIRC last changeprop consumer measurement was about 20k json messages per second and core
a refined stream would certainly be nicer
nuria (IRC)
gwicke: but this are kind of raw thoughts, refining data on hive and posting to graphite would work too, i will put that back on our radar but our current efforts are editing and streams
gwicke: ya, i wouldn't consider otherwise
gwicke
 /api/rest_v1/ is about 5k/s
nuria (IRC)
gwicke: ya, that item was for this quarter but editing data was higher, sorry, that is why it wasn't done
gwicke
no worries, we can live without it
but it would be great to have, as it would give us a lot more insight in how the API is actually used
nuria (IRC)
gwicke: if you want to get it done sooner rather than later why don't we work with mobrovac or Pchelolo so they can setup the scala jobs?
gwicke: that would be the fastest
Pchelolo
ah? what? where? scala jobs?

nuria (IRC)
Pchelolo gwicke : see: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala
Pchelolo: i think that enhancing that job shoudl be sufficient to do the reporting, let me see

Can @GWicke or @Ottomata update this ticket with their conversation from collab jam?

The favored solution seems for services to consume from webrequest text (40.000 - 80.000 per sec) , consumption can be parellized and use swagger specs to filter out data.

Analytics can assist services team as needed.

Or at least, working with Services so they can get these metrics themselves. Gabriel and I talked about them making workers to consume from the webrequest_text topic (which is hovers between 40K-70K msgs/sec), and then playing the requests paths through their spec routing logic to then emit metrics to graphite. That's one of a few solutions we talked about, but seemed to be the least work for Analytics, and the one Gabriel liked the most :)

Yeah, we touched on a few options, including using kafkacat to efficiently narrow down events to those that match the string /api/rest_v1/, and then using the swagger spec based routing code to increment metrics the same way we do for cache misses. There are other options we want to consider a bit more closely before we can make a decision.

If you can budget some time to help us get access to the data as a stream or in Hadoop, then I think we should be able to work something out that leverages the specs & avoids duplicate effort and inconsistencies.

If you can budget some time to help us get access to the data as a stream or in Hadoop, then I think we should be able to work something out that leverages the specs & avoids duplicate effort and inconsistencies.

Our idea on this is to dedicate some time next quarter to build a prototype of real time refining of data, once that is done it should be a matter of consuming that refined stream (you do not want raw data) and matching it through the different patterns we want to count. This is probably (pending on status of goals this quarter) a good goal for Q4.

I have added a prent task here to group similar use cases as yours as we have severals as of late

@Nuria Could you explain a bit on what's the difference between refined data and raw data in this context? All we need here is URIs that we can filter and map against the spec.

Yeah, in this case raw vs refined doesn't make a difference, but as part of a stream refinement, we had talked about splitting the firehose webrequest topics into smaller more service level based topics that would make building smaller jobs easier and need to consume less data.

We had a discussion today that was focused on solving this request in a more generic way, instead of just for REST API request metrics -> graphite. But, it is a larger discussion.

For your immediate use case, I think you can just consume from the webrequest_text in the analytics-eqaid kafka cluster, and experiment with handling the uri_path through your routing stuff.

+1 to what Andrew said. We don't want to block you on doing that. We will start building a simple infrastructure to do things like what you're doing here in general. Some people will need raw URIs and some will need refined data. We'll make it so we group all filtering of a specific type of data together, so we touch the data the least number of times. We're adding this to the parent task just to remember that it's one example use case, but don't let it stop you or think otherwise on your work. If the solution we're imagining here is better like we think it will be, it will be easy to migrate.

@Pchelolo: For at least two reasons I can think of: urls and hosts are "normalized" as part of refine process and It is likely that you want more things besides the plain url for stats (like country of origin or user agent which are not on raw stream.
We do not want to replicate the code that normalizes urls/host/media urls thus consuming from refined stream makes sense.

I've put a very WIP solutions that uses change-prop here: https://github.com/wikimedia/change-propagation/pull/165
The solution re-uses the router code we've wrote for RESTBase and should be pretty efficient, but I didn't test it much yet.. My main worry is that CP wouldn't keep up with the rate of events from Kafka. Anyway, it should be considered as a temporary hack until analytics has a better way to do stream processing.

A couple of assumptions I should verify first with @Ottomata or @Nuria before I continue:

  1. I can connect to the analytics Kafka cluster from sub, right? What are the hosts?
  2. The webrequests_text events conform to this schema: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema
  3. What's the number of partitions on that topic and what's the average rate of events?

The webrequests_text events conform to this schema: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema

No, they do not. ,That is a table in hive with refined data. The stream of raw events is a subset of that set. Table is webrequest in wmf_raw database on hive. We document (on purpoose) what we support consuming from, thus my suggestion you consume refined rather than raw events.

What's the number of partitions on that topic and what's the average rate of events?

about 80.000/40.000 per sec for text, I think

Sorry, forgot to include raw data table schema:

col_name data_type comment
hostname string from deserializer
sequence bigint from deserializer
dt string from deserializer
time_firstbyte double from deserializer
ip string from deserializer
cache_status string from deserializer
http_status string from deserializer
response_size bigint from deserializer
http_method string from deserializer
uri_host string from deserializer
uri_path string from deserializer
uri_query string from deserializer
content_type string from deserializer
referer string from deserializer
x_forwarded_for string from deserializer
user_agent string from deserializer
accept_language string from deserializer
x_analytics string from deserializer
range string from deserializer
x_cache string from deserializer
webrequest_source string Source cluster
year int Unpadded year of request
month int Unpadded month of request
day int Unpadded day of request
hour int Unpadded hour of request

  1. Partition Information
  2. col_name data_type comment

webrequest_source string Source cluster
year int Unpadded year of request
month int Unpadded month of request
day int Unpadded day of request
hour int Unpadded hour of request

On the discussion on the hackathon with @JAllemandou we've decided to reuse spark infrastructure for this. @Pchelolo will develop a Scala-based router for that would take a spec and match request URIs to RESTBase endpoints and @JAllemandou will integrate it on the analytics side.

Before I start, which version of Scala should I use? Are there any code-style guidelines available for the WMF use of Scala? Any particular libraries we use for logging and common tasks?

which version of Scala should I use?

@JAllemandou can answer better than I, but https://github.com/wikimedia/analytics-refinery-source/blob/master/pom.xml#L310 has us at 2.10.4. Not sure if it matters though.

We don't have any official scala guidelines, but you can poke around in https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job to see what it looks like there.

@JAllemandou, I've been using 4 space tabs, and I see you have 2 spaces in some of those files! Ah! We should standardize, will poke you elsewhere about this :)

Hi @Pchelolo,
We use scala 2.10.4 for the moment (I'd like to move to 2.11 soon though).

You can find most of our scala code in here:

The core package is usually used for generic functions that can be reused (I think your code for parsing swagger definitions could live in there for instance).
The job package is about the spark-jobs.
We don't have official WMF style guidelines, but informally we follow some (you'll guess them out of the code :)
There library we use for common things are:

  • scalatest for unit-testing
  • nscala-time for time related stuff (based on Joda)
  • scopt for argument parsing
  • scala-uri for simple uri building
  • and finally log4j for logging (but we're not that good at doing proper logging)

Don't hesitate if you have other questions :)

I know this might come a bit late but wouldn't this be a good candidate for using the tagging process we are defining?
see:
https://docs.google.com/document/d/1yc3VDa6JIp_nvpszAvM_LYvFiJ01WJrLBrzHWl89SM0/edit
and:
https://gerrit.wikimedia.org/r/#/c/353287/

cc @JAllemandou

In rare cases when we need it we can run a query on hadoop.