Page MenuHomePhabricator

Dataset of top-requested JPEG thumbnails
Open, Needs TriagePublicBUG REPORT

Description

Could you please publish a dataset of the most-requested JPEG thumbnails?

Google recently published a new JPEG coding library, Jpegli, which it claims produces more compact and better-looking JPEGs than other libraries in common use. I would like to evaluate its performance on a representative sample of Wikimedia JPEG traffic, to see what the potential benefit would be for Wikimedia users and infrastructure.

The AQS mediarequests API endpoint does not help, because it does not allow filtering by image file type, and because it aggregates requests by the original filename, so there is no information about the requested resolutions. But its existence is relevant to this request inasmuch as it is a precedent for public release of data on popularly-requested images.

What I'm hoping for is something like:

  • Top 1,000 JPEGs by request count: for frequency relevance.
  • Top 1,000 by bandwidth usage: for economic impact.
  • Random sample of 1,000 images: for statistical coverage of various types and properties of images.

It makes sense to sample/aggregate by distinct URL, excluding query parameter. For example, these two should be considered distinct:

To ensure all JPEGs are considered, it might make sense to filter by response type image/jpeg, which will include both .JPG and .JPEG, as well as any other variations (i.e. letter-case).

If it's feasible to provide this over a year's worth of requests that would be great. Otherwise a month would probably also be OK.