Page MenuHomePhabricator

Introduce a new AQS endpoint to expose video plays
Closed, ResolvedPublic

Description

Part of getting T198628: Count the number of video plays done. We currently show number of plays and loading of poster image in the existing AQS endpoint. Since most people don't click on videos, this has drastically diluted the data rendering it useless for measuring video plays.

There is a patch that would start ignoring loading of poster images but that would mean the definition of data will change halfway through. So a better solution would be to add a column splitting poster load vs. actual play, sum them in the old endpoint but introduce a new endpoint in AQS that would provide both numbers separately (so people can also measure click-rate ratio).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ahoelzl added subscribers: GGoncalves-WMF, Ahoelzl.

@Ladsgroup we need to understand priority and scope. @GGoncalves-WMF will reach out to you.

The question is do our readers watch videos on our platform? This is a question we do not know the answer to as we have never had this data. Getting this data is key to increasing video efforts within our movement. I highly support this work as being of top priority. @Yaron_Koren has done some work within this area.

Having spoken to @Ladsgroup this morning (thanks!), here's my notes on this task.

There are three potential quality issues in how we surface video plays:

  1. As previously pointed out, AQS counts both poster views and videos being served as "mediarequests", which is not usable for counting video plays. The proposed new endpoint, semantically capturing video plays and capturing poster vs plays as dimensions, sounds like a good idea to me. This should be the scope of this task, and I think is the highest priority.
  2. The incoming rollout of MPEG-DASH, driven mainly by reliability, will also mean we need to adapt our video play metric. This belongs in a separate ticket at medium priority, and @Ladsgroup is the point of contact to coordinate with. It will take a few weeks for MPEG-DASH to actually happen.
  3. There is a suspicion that Apple devices are requesting video chunks at a time for playback (probably due to their HLS implementation), each of which already is being counted as a separate mediarequest. We haven't quantified this yet, and pending that, I'd put this at lowest priority of the three. Further investigation also belongs in a separate ticket.

Why is all of this important? T198628#10567446 is a good summary from WMDE (who depend on this metric for strategic partnerships).

What about timelines? We'll need input from @Ahoelzl, but this doesn't look like a very complex task and we can try to get started in a couple of weeks.

@AndrewTavis_WMDE @Ladsgroup , I’ve put together a design document outlining the proposed endpoints. When you have a chance, please review it—particularly the API design section—and let me know if the proposed endpoints cover your requirements or if there are any additional endpoints we should consider.
cc @Eevans for the cassandra tables in the serving layer

Thanks for the ping on this, @Snwachukwu :) I'll give this a look and get back to you!

Change #1250005 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[analytics/refinery@master] Add videoplays Cassandra loads for video-requests AQS API (T415202)

https://gerrit.wikimedia.org/r/1250005

I understand the focus on video (as there is a very specific problem there right now), but I'd like to caution that audio and video are not too dissimilar, so not sure if we should be filtering out audio media and naming endpoints videorequests...

For instance
1: it is not impossible to have a poster image for an audio file. We don't do it right now, but it has actually been a pretty longstanding request to have that.
2: If we start mpeg-dash streaming and tracking how much of a media file is consumed, we'd want that for both audio and video as well.
3: while most of the video formats ending in video extensions are video... they can also be audio-only. So if we are splitting this on file extension (which I think we do based on the docs), that might not be 100% accurate. For instance, it shows that ogg is classified as audio, but ogg can be either audio or video. (P.S. not sure where this mapping is made, but might want to check if mpeg is in it, as that is missing from the documentation)

Not throwing up a roadblock, but I think for the stability of the api, this should probably be taken into account.
We could just duplicate this suggested approach and call it audiorequests, or create something like /avrequests/ or /timedmediarequests/ or something

Having spoken to @Ladsgroup this morning (thanks!), here's my notes on this task.

I think this is slightly conflating a few points of cause and effect.

In the way we currently serve a/v material, browsers do 'progressive downloads' using the Range header, where the server replies with Partial Content 206. Years ago when I looked into media stats collection, we only counted as plays, those mediarequests with Range headers where the range value started with 0- (start of file) to account for this. I assume that's still the case, but I don't remember where this was defined.

But this Range: 0- ONLY works as long as:

  • there are NO poster downloads in that same bucket. (which we seem to solve here)
  • there are NO HLS and/or MPEG-DASH streams being served (both currently disabled, but the historic numbers from that the HLS timeframe cannot be trusted and there seem to be plans to introduce MPEG-Dash where this problem might reoccur)
  • we do NOT preload files into the html document before playback. (we currently do not do this (for efficiency reasons), but its a pretty big assumption to make. I'd prefer to eventually change that on the File and the embed view for instance).

There are additionally three more situations that I know are a potential quality problem:

  • Just because a browser can START the download of a stream, doesn't mean it can PLAY it. It might stop downloading and choose a DIFFERENT listed format for the same original title and begin downloading that. This should count as ONE playback, but currently probably counts as 2 (or in the worst case as 4?), The browser will attempt to avoid it, but for some formats it can only figure this out by playing the stream.
  • Similarly, users may switch between multiple resolutions of the same file during playback. As the player needs to know some basic info about a file, it will also start that at the 0- and then quickly seek to the offset that belongs to the same timecode. This however, counts as 2 playbacks. It's pretty rare, so for now can be ignored.
  • A download of the full file (browser download via right click save as or with the ?download param) probably counts as a 'view' as well, as that too uses progressive downloads (if this should count as a view is debatable)

And then I think @Ladsgroup found something weird with Safari, which is the point three being referred to, but I'm not exactly clear what that is and if it's not just the result of some of the above things.

So the concern is that, considering that there can be multiple http requests for 1 playback, we are making a fragile assumptions about what counts as a 'play'. These assumptions are known faulty and we can easily break the metrics when we make a changes to how the media plays works. The one mitigation that is in place for that is the Range: 0- thing, but it doesn't cover all the faults and breaks even more for HLS and MPEG-DASH).

Priorities are first of all separating out the poster downloads (quick win, large impact on current quality) and later the issue with multiple requests per play (esp once we start using MPEG-DASH so that we can retain quality of the data and possibly even improve for the the old style as well).

Providing some feedback from @Ben.buchenau and I about this process:

  • We're wondering whether it would make sense to have a table that has unaggregated data that would also include the ip of the request (along with the user_agent)
    • On our end it would be helpful to include this so that the processes that use this table can mirror those that we have that work with wmf.webrequest
  • This of course is assuming that you all would be fine with doing a raw table rather than an aggregated one
  • If the issue of multiple requests that @TheDJ is talking about persists with this new process, then it might make sense on our end to keep our currently running process that looks into unique views per user_agent and ip combination on a daily basis

Overall if the proposed dataset and API work for WMF and the community's purposes, we're fine to keep our current process :)

Change #1250659 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[generated-data-platform/aqs/media-analytics@main] Add video-requests API

https://gerrit.wikimedia.org/r/1250659

@AndrewTavis_WMDE @Ladsgroup , I’ve put together a design document outlining the proposed endpoints. When you have a chance, please review it—particularly the API design section—and let me know if the proposed endpoints cover your requirements or if there are any additional endpoints we should consider.
cc @Eevans for the cassandra tables in the serving layer

Thanks @Snwachukwu.

It looks like the tables in your design doc are fashioned after the old legacy RESTBase tables, so we'll need to refactor them some. I'm happy to help with that, and if you're comfortable doing so you can give me edit access and I'll work directly there. Let me know.

Also worth mentioning is that —unlike the mediarequest tables & endpoints— these endpoints can use the Data Gateway, instead of querying Cassandra directly. The Commons Impact endpoints are probably a good source of examples for what that looks like.

@Eevans thank you for taking a look at the design doc. We decided to reuse existing mediarequest Cassandra tables to avoid reloading the keys and rather just add new columns with needed value to it. I would update the design doc with the proposed columns.

Also worth mentioning is that —unlike the mediarequest tables & endpoints— these endpoints can use the Data Gateway, instead of querying Cassandra directly. The Commons Impact endpoints are probably a good source of examples for what that looks like.

You're right indeed. I just took a look at Data Gateway. I have given you edit access to the doc. I would also make the necessary changes to the design doc

Thanks for the context, @TheDJ ! I've filed T419879 to address measurement issues with the introduction of MPEG-DASH (which, as I understand it, is happening to video only at the moment), and T419882 for us to revise the classification of media files based on extension. Agree that it makes sense to focus this ticket on separating out poster downloads for video, and we should look at MPEG-DASH as it is enabled.

Having spoken to Sandra, I think it also makes sense to just mirror this solution for audio when the need arises, as a separate API.

As for @AndrewTavis_WMDE 's request...

We're wondering whether it would make sense to have a table that has unaggregated data that would also include the ip of the request (along with the user_agent)

I'd like to hear more about how you're currently using this data, and what processes you currently have in place. I'll reach out soon :)

@Snwachukwu I've opened r1251445 against the media-analytics repo and assigned you as a reviewer. Ideally we like to keep the schema definition in the repo of the code that uses it where they can move in versioned lock-step, but I think these AQS 2 repos slipped through the cracks. As a result, I added one file to hold the original schema, and a second that adds your new columns. It's this latter one that we'll use to update the production environments, so if you can have a look at this, it'd be appreciated.

The process is to first roll out to staging, and then once you're satisfied with the result, to production. I'll be waiting for feedback from you (and/or your team) before proceeding with either so let me know when you're ready!

Thank you @Eevans . I have left you a comment on your patch

Change #1255289 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[generated-data-platform/aqs/media-analytics@main] Change plays_requests column name to play_requests

https://gerrit.wikimedia.org/r/1255289

Change #1255289 merged by jenkins-bot:

[generated-data-platform/aqs/media-analytics@main] Change plays_requests column name to play_requests

https://gerrit.wikimedia.org/r/1255289

Change #1260282 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[generated-data-platform/aqs/media-analytics@main] Add video_filesJSON column to cassandra mediarequest_top_files table

https://gerrit.wikimedia.org/r/1260282

Change #1260282 merged by jenkins-bot:

[generated-data-platform/aqs/media-analytics@main] Add video_filesJSON column to cassandra mediarequest_top_files table

https://gerrit.wikimedia.org/r/1260282

Change #1250659 merged by jenkins-bot:

[generated-data-platform/aqs/media-analytics@main] Add video-plays API

https://gerrit.wikimedia.org/r/1250659

Change #1264680 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/deployment-charts@master] Deploy Videoplay Endpoint to staging

https://gerrit.wikimedia.org/r/1264680

Change #1264680 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy Videoplay Endpoint to staging

https://gerrit.wikimedia.org/r/1264680

Change #1265481 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[generated-data-platform/aqs/media-analytics@main] Upgrade Runtime base from bullseye to bookworm

https://gerrit.wikimedia.org/r/1265481

Change #1265481 merged by jenkins-bot:

[generated-data-platform/aqs/media-analytics@main] Upgrade Runtime base from bullseye to bookworm

https://gerrit.wikimedia.org/r/1265481

Change #1265523 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/deployment-charts@master] Media-analytics Image version change

https://gerrit.wikimedia.org/r/1265523

Change #1265523 merged by jenkins-bot:

[operations/deployment-charts@master] Media-analytics Image version change

https://gerrit.wikimedia.org/r/1265523

Change #1265555 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/deployment-charts@master] Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts.

https://gerrit.wikimedia.org/r/1265555

Change #1265555 merged by jenkins-bot:

[operations/deployment-charts@master] Update Media-analytics helmfile.d global-staging to use cassandra Staging Hosts.

https://gerrit.wikimedia.org/r/1265555

Change #1250005 merged by Snwachukwu:

[analytics/refinery@master] Extend mediarequest Cassandra loads with poster/plays for video-requests API

https://gerrit.wikimedia.org/r/1250005

Change #1266323 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/deployment-charts@master] Media Aanlytics Production Image Version Change

https://gerrit.wikimedia.org/r/1266323

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:30:34Z] <ebysans@deploy1003> Started deploy [analytics/refinery@fa28ad8] (hadoop-test): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 TEST [analytics/refinery@fa28ad83]

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:32:26Z] <ebysans@deploy1003> Finished deploy [analytics/refinery@fa28ad8] (hadoop-test): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 TEST [analytics/refinery@fa28ad83] (duration: 01m 52s)

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:33:23Z] <ebysans@deploy1003> Started deploy [analytics/refinery@fa28ad8]: Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83]

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:37:38Z] <ebysans@deploy1003> Finished deploy [analytics/refinery@fa28ad8]: Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] (duration: 04m 15s)

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:38:01Z] <ebysans@deploy1003> Started deploy [analytics/refinery@fa28ad8] (thin): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83]

Mentioned in SAL (#wikimedia-operations) [2026-04-01T17:39:55Z] <ebysans@deploy1003> Finished deploy [analytics/refinery@fa28ad8] (thin): Extend mediarequest Cassandra loads with poster/plays for video-requests API T415202 [analytics/refinery@fa28ad83] (duration: 01m 53s)

Change #1266323 merged by jenkins-bot:

[operations/deployment-charts@master] Media Aanlytics Production Image Version Change

https://gerrit.wikimedia.org/r/1266323

Change #1267136 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/deployment-charts@master] Add rest gateway routes for video_plays path.

https://gerrit.wikimedia.org/r/1267136

Change #1267136 merged by Snwachukwu:

[operations/deployment-charts@master] Add rest gateway routes for video_plays path.

https://gerrit.wikimedia.org/r/1267136

Change #1267147 had a related patch set uploaded (by Snwachukwu; author: Snwachukwu):

[operations/deployment-charts@master] Add rest gateway routes for video_plays path production.

https://gerrit.wikimedia.org/r/1267147

Change #1267147 merged by jenkins-bot:

[operations/deployment-charts@master] Add rest gateway routes for video_plays path production.

https://gerrit.wikimedia.org/r/1267147

Thank you! let me play with it and come back to you

Also, it's weird that it says 60,000 plays also for the last link starting at April 1st. So this can only be right when the data is not older than April 1st

@Ladsgroup, your assumption is correct indeed. We started running it from 1st April that so I don't expect any data older than April 1st to be there.

Regarding the extra 8k from 6 years ago, I would look into it.

Also, it's weird that it says 60,000 plays also for the last link starting at April 1st. So this can only be right when the data is not older than April 1st

Would you need backfilled data? We can only backfill 90days data. @Ladsgroup @New_York-air

Hi @Snwachukwu, thank you for looking into it.
If its not too much effort this would be great. When presenting the correct numbers to our stakeholders, we should have accurate numbers over a long time period (longer than one month).

I updated my tool to get videoplays.

Hi @Ladsgroup, the following URL seemed to work during the hackathon, but now it returns 0 for every video: https://mvc.toolforge.org/index.php?category=Videos+of+scientists+at+the+University+of+Innsbruck&timespan=now-7&videoplays=1 (also w/o the videoplays=1). Also the cover images no longer load. Am I misusing this tool?

Hi @Ladsgroup, I debugged your tool a bit: You seem to be using the imageinfo...url to compute the filepath. But the newly added ?utm_source=commons.wikimedia.org&utm_campaign=imageinfo&utm_content=original seems to be causing problems and need to be removed.

json
        "imageinfo": [
          {
            "url": "https://upload.wikimedia.org/wikipedia/commons/e/e1/Forschung_zur_Hohen_Birga_an_der_Universit%C3%A4t_Innsbruck.webm?utm_source=commons.wikimedia.org&utm_campaign=imageinfo&utm_content=original",
            "descriptionurl": "https://commons.wikimedia.org/wiki/File:Forschung_zur_Hohen_Birga_an_der_Universit%C3%A4t_Innsbruck.webm",
            "descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=156552377"
          }
        ]

I'll prepare a patch. Here it is: https://gitlab.wikimedia.org/toolforge-repos/mvc/-/merge_requests/1

oh I'm so sorry you had to see that code. I'm planning to fully rewrite it. I apply your change ASAP.

speaking of media metrics. i noticed there is a beacon that is used by MultimediaViewer, to log how long a image was viewed in the mediaviewer.

  1. where does that end up ?
  2. maybe we can use it in TMH for watchtime ?

Applied the fix. Once I find a bit of time, I will apply all the fixes and do some rewrites.