Page MenuHomePhabricator

Pageview API: Limit (and document) size of data you can request
Closed, DeclinedPublic

Description

Pageview API: Limit (and document) size of data you can request

Based on the discussion below, and improved performance, we have decided to limit only the daily granularity endpoints to returning no more than 6 months of data. If more than 6 months is requested, we will return an error to be clear about what's going on (as opposed to the more friendly but confusing truncation of the response).

The main reason for this limit is not performance, or caching, but simply that any API should have some sort of limit. If it's too restrictive, please let us know and we'll consider a higher limit.

Event Timeline

Limiting window sizes also provides an opportunity to introduce aligned time windows, which would greatly improve cache effectiveness. Examples:

  • Specific day: /20160101/
  • One month: /201601/

This could replace the current start / end date parameters.

@GWicke: We plan to limit the volume of data requested but not the way to request it, meaning we'll probably keep start and end date.
If you're interested I have made a quick anlaysis on response status codes and cache statuses over a month of request: As expected cache miss are almost only for per-article requests, and not due to time ranges, but to article variability.

@JAllemandou, are there significant cache hit rates for per-article requests at all right now? Every time I looked into timeouts, it was for > 1 month worth of per-article data, with random start / end dates.

Edit: @JAllemandou shared some data with me. Current hit rates for per-article requests (> 70%) of overall request volume) are around 6.9%.

@GWicke : Some more data and more or less expected results.

  • In current requests patterns, ~80% of requests are for fresh data - (end date either today or yesterday) - Changing data access patterns wouldn't really help on caching for these request type.
  • If requests pattern evolve toward asking for more historical data, then maybe changing the data access could help some: number of different articles requested are about 1/3 of total requests in current pattern, so theoretical best cache hit ratio (with no TTL issues etc) could be moved closer to 65% (would be a 10x increase, huge).

In current requests patterns, ~80% of requests are for fresh data - (end date either today or yesterday)

From what I have seen, there is still a significant spread of values in the *start* date, which does fragment caches. Introducing fixed windows (single day, single month) could eliminate this fragmentation.

Introducing fixed windows (single day, single month) could eliminate this fragmentation.

But wait, at the cost of reducing functionality right? as it is not the same to ask for data for january than for data for the last 30 days.

But wait, at the cost of reducing functionality right? as it is not the same to ask for data for january than for data for the last 30 days.

The data is available either way (which satisfies one interpretation of functionality), but I agree that good window selection is needed to make sure that common use cases are covered well. If requesting 'last 30 days' is a common need (I suspect it is), then making that one of the pre-set windows would be useful.

Amended syntax strawman:

  • Specific day: /20160101/
  • One month: /201601/
  • Last 30 days: /-30d/

I really hope we will still be able to query for arbitrary ranges. One use case is the Did You Know project on enwiki, where an editor's article makes the main page, and we are interested in seeing how this affected the pageviews. Here something like a 10-day period before and after the target date (so 21 days total) offers the best illustration, giving us an idea of typical view counts before the article became DYK, and the lingering effects after it was on the main page.

If we must put a cap on the total number of pages per request, my hope is it will be no less than 90. I really like the original proposal of 6 months. A wider range is super helpful for instance when looking at the total pageviews to a project, to see if wiki usage is going down or up, or if a particular campaign has had success.

@MusikAnimal, the trade-off is between performance & client convenience. In either scheme, you can load the full data for any time frame. In the per-month scheme you'll potentially have to make more requests than in the arbitrary-range scheme. However, with caching and parallelism, you are likely to get your answers more quickly.

Whether the extra request complexity matters depends on how you use the API. A thin API wrapper library could expose the same arbitrary-range function on the outside & make the right fixed-range requests, in parallel, under the hood.

@GWicke Got it! Makes perfect sense, you may disregard my concerns :) An API wrapper is a bit of work, but should be fun to implement.

I will note that I've never actually had a single request time out on me, regardless of the range. The only time I have issues is with high-volume access, e.g. with Massviews where we make up to 500 per-article requests. I've forced that app to only make a request every 250ms, as opposed to all at once, but we still see "Error in Cassandra backend" with continued usage.

Would these new API limitations help alleviate that issue, in addition to T124314?

@MusikAnimal : to be clear we are not going to implement these suggestions:

Specific day: /20160101/
One month: /201601/
Last 30 days: /-30d/

As they actually do not help with our cache hit ratios problems that much.

We will be working in a better compaction strategy (to reduce timeouts due to queries that take toolong to return) and throttling (to avoid scenarios of API being unavailable due to high request ratios from a small set of clients)

As they actually do not help with our cache hit ratios problems that much.

I'm curious, what is this assertion based on ?

@GWicke Please see @JAllemandou comment above. Misses are due to article variability, some due to time ranges too but article variability seems to be a bigger issue.

@Nuria, we have a lot of end points that all vary on the title, and they have significantly higher hit rates at similar overall request rates. I wouldn't discount the usefulness of caching so easily.

@GWickie: I am pointing out that we are not doing these changes in the near future given that we have quite a lot of work to do to make sure the storage layer actually works optimally given the problems we have had with it thus far.
Changing our parameter scheme might come after changes we must do in the near term.

@GWicke : Since I was quite convinced by your idea, I did some more detailed analysis with the schemed you suggested.

Setup: For the month of march, I took every request on the per-article endpoint having GET method and 200 response_code. I extracted the query day (took the day part of the query timestamp), the end and start dates as days, and the meaningful query parameters (project, access, agent, granularity, article).

First Computation: expected hit ratio if we were to have daily caching with the current scheme (not perfect because of calendar day based analysis, but still), but counting the number of requests grouping on the fields described above, and taking one request as a miss.
Results:

Reqs19046015
Hits2323168
Hit ratio12%

First question: How come this theoretical hit ratio is the double of the actual one ????

Second Computation: I generated for each request the list of requests it would have been needed to do in order to get the same data using the new sheme (possibly multiple months and/or last_30_days, with limiting to valid dates) and did the same requests and hits computation as in the first round.
Results:

Reqs25038022
Hits3755559
Hit ratio14%

The increased number of requests and the very small gain in hit ratio makes me think it's actually not worth (or at least now, maybe later if usage evolves).

The increased number of requests and the very small gain in hit ratio makes me think it's actually not worth (or at least now, maybe later if usage evolves).

Fair enough.

Did you assume ~2 week TTLs in your simulation of per-day / month end point caching?

@GWicke : Did it the simple way: no daily endpoint (only monthly), and took one day of TTL (which would be safe for recomputing etc).
I added some logic for monthly TTL: brings the hit ratio to 20%. Better, but not yet amazing.

It's also worth keeping in mind that traffic levels are not static, and current levels are fairly low. At double the traffic, these numbers will quickly look very different.

In any case, if you do take the plunge of limiting query periods (and thus complicating clients), then I think it's worth considering thinking a little ahead & using the opportunity for making the API better cacheable at the same time.

@GWicke : Thanks for you comments, they definitely helped in shaping the future :)

Milimetric triaged this task as Medium priority.Oct 17 2016, 4:04 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Event Platform to Backlog (Later) on the Analytics board.

With the new awesome AQS cluster performance is not as much of an issue now, correct? I will simply say that with Tool-Pageviews, an "all-time" query, July 2015 to present day, is not at all uncommon. As we discussed above, it's easy for me to make an API wrapper breaking it out to separate queries so users can still get all the data they want, but I thought I'd still let you know. So if they ask for 10 articles since July 2015, that could turn 10 API queries into 30 individual queries. I'm unsure if 30 queries with a smaller range is better than 10 with a wider range, in terms of performance.

We will test limits as to responses and times and update this tcket when we have the data. If we are aiming for a percentile 99 of , < 500 ms response (example) we want to allow requests for data that will not break our SLAs, whether that is 6 months or a year we do not know quite yet. Makes sense?

We will test limits as to responses and times and update this tcket when we have the data. If we are aiming for a percentile 99 of , < 500 ms response (example) we want to allow requests for data that will not break our SLAs, whether that is 6 months or a year we do not know quite yet. Makes sense?

Sounds good, thank you!