Page MenuHomePhabricator

Cloud Services shared IP (static NAT for external communications) often rate limited by YouTube for video downloads
Open, Stalled, MediumPublic

Assigned To
None
Authored By
Victor_Grigas
Oct 25 2019, 12:35 AM
Referenced Files
None
Tokens
"Heartbreak" token, awarded by Tulsi_Bhagat."Hungry Hippo" token, awarded by Don-vip."Cup of Joe" token, awarded by Coffee."The World Burns" token, awarded by czar."Hungry Hippo" token, awarded by Fae.

Description

I've been having a consistent problem with video2commons today:

"Error: An exception occurred: DownloadError: ERROR: bFbKgtZM9As: YouTube said: Unable to extract video data"

Doesn't seem to matter which video it is, if it's a cc-licensed video or a public domain one.

See also

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

By coincidence I (using Faebot) have been trying to run my CDC videos uploads from labs. The standard use of youtube-dl works directly from a terminal session, but when run on the grid engine I start getting

WARNING: unable to download video info webpage: HTTP Error 429: Too Many Requests

or the fatal (the youtube id is just a real example)

youtube_dl.utils.DownloadError: ERROR: fWET2kNwdn8: YouTube said: Unable to extract video data

The same 'DownloadError' can mean that the video is blocked in that region, or removed as a copyvio, but that is not the case for the CDC.

The 'Too Many Requests' might be a combination of the specific WMF IP address plus the rapid querying of several playlists. However that's a bit odd considering that the code does work when not on the grid, unless the problem is that IP addresses used by the grid host are getting blocked by YouTube/Google while the IP addresses used via live sessions are not.

Note that I'm continuing to try from a command line, but as the recoding (mp4/mkv to webm) may take >12 hours for some videos, that's means I'm locked out of running a terminal on labs while the project runs, plus it's against the guidelines of how labs is supposed to be used by us volunteers...

Anyone interested in checking the specific Python code can find it on /mnt/nfs/labstore-secondary-tools-project/faebot/pywikibot-core/scripts/Youtube_CDC2.py

However that's a bit odd considering that the code does work when not on the grid, unless the problem is that IP addresses used by the grid host are getting blocked by YouTube/Google while the IP addresses used via live sessions are not.

Bastions have floating public IPs so it could open port 22 to the public and you could ssh in directly without a jump host. Grid exec nodes are behind a cloud-wide NAT and share a single public IP.

@Fae .... Try running it in one of the kubernetes python shell

webservice --backend=kubernetes python shell
~/.virtualenvs/cdc/bin/python ~/pywikibot-core/pwb.py Youtube_CDC_remote

@Phamhi good suggestion. Have not managed to get it to work so far. The Python script drops out without warning, even though I guess in theory the shell should behave in an identical way.

@Fae .... Try running it in one of the kubernetes python shell

v2c runs from k8s and receives the same message.

bd808 renamed this task from Consistent errors from Video2Commons from YouTube to Cloud Services shared IP (static NAT for external communications) often rate limited by YouTube for video downloads.Oct 29 2019, 8:38 PM
bd808 triaged this task as Medium priority.
bd808 moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.
This comment was removed by Fae.

The ideal solution is obviously getting the Cloud VPS NAT IP a higher quota upstream with YouTube, but maybe we can find a way to get some things working in advance of that.

@zhuyifei1999 Does v2c typically do the downloads on Toolforge, or are the instances in the video Cloud VPS project actually doing that work? If it is the latter, we could try a temporary solution of adding public IPv4 addresses to the video instances to spread across more IPs which would hopefully give a larger quota from YouTube.

@zhuyifei1999 Does v2c typically do the downloads on Toolforge, or are the instances in the video Cloud VPS project actually doing that work? If it is the latter, we could try a temporary solution of adding public IPv4 addresses to the video instances to spread across more IPs which would hopefully give a larger quota from YouTube.

Toolforge instances (k8s pods) fetch metadata, the encoding cluster does both metadata fetching and actual downloading. The fetch metadata part already hit Error 429.

Looks like the rate limit is currently lifted :)

It seems impossible for me to use WMF cloud services to do the CDC video recoding. I have reverted to running an old mac mini as a headless server, which itself has experienced the YouTube "too many requests" problem, but my understanding is that this gets lifted after a day or two anyway.

If someone can explain how I can legitimately run an FFmpeg recoding job on webgrid and save the files on a WMF server, that would be useful, but this experience, including getting warnings about my work, has seriously discouraged me from relying on WMF cloud services in the future, primarily because of the massive waste of precious volunteer time it takes to keep on testing and rewriting code to fit in with ever changing, non-specific and hard to understand "requirements" of this environment, compared to simply hosting a script on my own ancient kit.

If someone can explain how I can legitimately run an FFmpeg recoding job on webgrid and save the files on a WMF server, that would be useful, but this experience, including getting warnings about my work, has seriously discouraged me from relying on WMF cloud services in the future, primarily because of the massive waste of precious volunteer time it takes to keep on testing and rewriting code to fit in with ever changing, non-specific and hard to understand "requirements" of this environment, compared to simply hosting a script on my own ancient kit.

Can you use video2commons? I don't know what CDC does, but I'd assume that is the download + transcode + upload, which is the same as what v2c does in the backend, which is on special-purpose instances, unlike toolforge's generic grid.

Can you use video2commons? I don't know what CDC does, but I'd assume that is the download + transcode + upload, which is the same as what v2c does in the backend, which is on special-purpose instances, unlike toolforge's generic grid.

Looks like the rate limit is in effect again.

I'll bring this up with the WMF partner folks.

My experience running locally is that the YouTube IP block lasts around 2½ days. I can queue my processing, and let my programme keep testing the connection every few hours, but it's not reasonable for the average Commons user to see nothing happening for that long.

Update: It turns out that YouTube has escalating IP blocks. The second time around for the CDC uploads was a 5 day block, so it's fair to presume that any IP will rapidly become unusable. Either someone can work out what the maximum bandwidth/access thresholds are for pulling video information so any tool can stay just under that, or if YouTube is incommunicado, we have to consider the site hostile for any tool that can do this task.

Is anyone working on this?

Task is assigned to @zhuyifei1999 so he should respond :)

I was informed by @Matanya last Wednesday that Google will respond to us in 1-2 weeks.

zhuyifei1999 changed the task status from Open to Stalled.Dec 11 2019, 6:31 AM

They think it's a ToS violation... so... this gotta be difficult.

Damn, maybe get WMF legal to talk to them? I don't know how you can have cc-licensed videos on a site and think it's a ToS violation for them to be downloaded. Sounds like a knee-jerk reaction (of legit copyright infringement) to me.

In the meantime, it would be really useful to find a definition of the throttle limits for the service. If we are given a service level guide, like "20 video information queries in an hour", then at least it may be possible to manage our own queue and avoid IP blocks if we stay within it, or reliably farm out the queue if that is an acceptable practice.

Google isn't responding (they probably don't have the incentive to), gonna wait for a few more days. If it stays like this, I'm gonna get a massive overhaul to how v2c download from YouTube. sneek peek: slimerjs + x11vnc

Google isn't responding (they probably don't have the incentive to), gonna wait for a few more days. If it stays like this, I'm gonna get a massive overhaul to how v2c download from YouTube. sneek peek: slimerjs + x11vnc

Need help with this, @zhuyifei1999? @Sturm called my attention to this task yesterday and indeed it seems like a very significant disruption that Google is imposing on us here.

@Xinbenlv is there anything you can do to help us here? I want to avoid using technical measures to make it seem as though our traffic comes from other sources, but I am starting to consider that solution as a viable option. Will that be a necessary step here? Is there anything you need from us to possibly expedite communications on Google's side?

Hello all,

Thank you for your comments and insights with respect to the recent suspension of API access that the Video2Commons tool was using. We share your desire to address this issue. The Wikimedia Foundation's Partnerships team, which is responsible for maintaining our long-term relationships with entities like Google/Alphabet/YouTube, has met with our contacts at YouTube to discuss this issue. We have not worked out any specifics, but they are indeed interested in working with us on a resolution.

As soon as we have more information and details to share, we will make them available here. We will continue to talk with our engineering teams on a long-term solution which will hopefully allow for a more streamlined way to upload Creative Commons license videos to Commons.

Thank you all for your interest and your patience while we work to inform our contacts at YouTube on the importance of this issue and explore a way forward. The Foundation respects the position many of you have taken, and agree that any resolution with YouTube should capture both the spirit and stated intent of Creative Commons licenses.

@Varnet,

The last time I was on Youtube I saw that they have removed the possibility to publish videos in another free license than CC0.
Youtube and Google are certainly anticipating a possible paid access to the Wikidata and commons APIs. They read Wikimedia grant strategy 2020-2030 as everyone...

I don't know if anyone share my point of view, but I have the feeling than we're on the way to lose the vision of the free software movement and maybe the soul of the initial WWW in wich Wikimedia movement is one of the most significant survivor ... The CC0 license (on wikidata and commons) is already a godsend for all companies that already have a monopoly on the users of online services. With CC0, they can use the work of Wikimedia projects' volunteers without legal problems for making new copyrighted services.

The free software movement, and the Wikimedia movement miss a tool to acomplish there mission : a CC.SA licence. Only the copyleft can ensure that a Wikimedia contributor will never have to use a Google accompt and thus become the company's commercial product in order to be able to effectively use the amount of work that he and other digital workers will have provided.

Here is the missed licence retired in 2004 for inadequate demand : https://creativecommons.org/licenses/sa/1.0/

Just curious if there is an update here? No rush of course!

Don-vip subscribed.

I'm also facing this issue in my tool. It runs on Toolforge/k8s to detect new CC videos published by Arianespace, using YouTube API, and faces HTTP 429 errors too.

This is now affecting citoid too, we're unable to get metadata from the page to cite youtube videos in references :/ T254700

bd808 added a subscriber: Yael-weissburg.

As soon as we have more information and details to share, we will make them available here. We will continue to talk with our engineering teams on a long-term solution which will hopefully allow for a more streamlined way to upload Creative Commons license videos to Commons.

This message from @Yael-weissburg on T254700: Citoid requests for YouTube metadata is giving 429: too many requests HTTP error was intended for this bug report:

Hello All,

Yael here, Director of Strategic Partnerships at the Foundation.

First, I want to apologize for the painfully long time it has taken me or anyone on my team to follow up from @Varnent's previous message. At first, the reason for the delay was ongoing (initially hopeful, productive-seeming) conversations with YouTube. More recently, the reason was... 2020 life getting in the way and me dropping the ball on communicate back to you all.

Unfortunately, my update is not what I would have wanted to share. Despite ongoing conversations (involving folks from Partnerships, Product and Legal from both organizations), we were not able to reach any resolution in our discussions with YouTube about this issue, and, unfortunately, I do not expect any changes from them coming in the future.

I'm personally disappointed about this, as I feel we offered them a potential way to work closely with the movement and support our mission with little risk or downside to their business. Unfortunately, they have chosen not to prioritize this at the moment.

I'm sorry I don't have happier news, and thank you to all of you who continued to try to find a solution. Again, my apologies that I haven't been more communicative about this (and thanks to @bd808 for continuing to nudge me.

Please don't hesitate to reach out to me directly at yweissburg@wikimedia.org if you have any questions.

Cheers,

Yael

Hey, just passing to say that this problem seems to be ocurring yet again. I don't know much where we're standing on the relationship with YouTube and if there is some way to resolve this issue, but well, it had been working for the last weeks, and since the start of the week this error started ocurring.

I've just try to use video2commons one more time, and it still doesn't work... (see this screenshot))

Just a question to @Yael-weissburg, Wikimedia foundation staff, and other participants to this topic. Why Wikimedia Foundation are implementing Wikimedia Enterprise, a commercial project to help big tech companies to use more easily of Wikimedia projects contain for profits, when this same companies don't collaborate with us to increase our own contain by a fair way as video2common tool ?

@Varnent , @Don-vip , @Mvolz , do you have an idea ?
Sometimes I wonder if the foundation is aware of the dangerous game it plays with big tech companies...

@Lionel_Scheepmans I am not sure what the relevance of the Meta RfC that you opened - and was recently closed by community-consensus - relating to the existence of the Wikimedia Enterprise project has to this discussion.

Sometimes I wonder if the foundation is aware of the dangerous game it plays with big tech companies...

This argument appears to be a fallacy - connecting this specific technical topic to a separate topic which has independent [valid] considerations. It is a truism that the companies you're referring to are already using Wikimedia projects' content for commercial profit, and that they have that right as per our free-licenses. Furthermore, they will continue to do so, independently of whether an API built for the speed/volume needs of commercial organisations is created. The argument that Wikimedia should not work with 'big tech companies' is moot - since they are already using our content and we (as is evidenced by this Phab. ticket) are already wanting to use theirs. The outcome that their paying for the [optional] Enterprise API would create is that Wikimedia would no longer be financially subsiding "big tech's" business model - instead, they would be financially supporting our movement.
But - these are quasi-philosophical issues independent from the specific technical concern of this Ticket. As you have previously raised your concerns about Enterprise in the projects's Meta talkpage (and the aforementioned RFC), I invite you to add any new comments on those threads.

Hi All,

Unfortunately, as I noted in an email to Jos Damen on October 19th when he flagged this for me again, there's nothing much we can do here from a leverage perspective. I'm copying my email to Jos below. I'll let @LWyatt's comments on Enterprise stand.

Let me know if you have any questions, and I'm sorry this isn't something that we can change - I really wish we could.

Best,

Yael


Hi Jos,

Thanks for reaching out. Unfortunately, as you have noted, YouTube has indicated that they won't work with us to make an exception for Video2Commons. As I noted on the Phab ticket, this was deeply disappointing to me, and I think they made the wrong call (both ethically and from a business / PR perspective). In consultation with our legal counsel, we have made clear to YouTube that if the community chooses to fight this battle, we will not support YouTube's position.

I'm not sure what effect your public advocacy will have, to be honest, but as the person responsible for managing the overall relationship with YouTube from WMF, I support your efforts. I would have used this as another prompt to go back to YouTube and make the case, but having done so three times I realize that the org-to-org negotiation is not going to bear fruit.

Happy to connect further if you'd find it useful.

Best,

Yael

@LWyatt you're right. The unfair attitude of Google is not a technical issue. And the debate have to take place outside Phabricator. Also, I've spend 30 minutes talking with Lane. It was very interesting and reassuring about the spirit of the project Wikimedia enterprise. Keep in mine than I'm not again this projet, just worried about the movement entering the world of commercial affaire that could be unfaire some time sa we see concerning youtuble API and for sure fare away of the Wikiphilosophy.

Thanks, @Yael-weissburg for your feedback. Let we see later how can I continue, may be with an article somewhere that could touch the public Google image. Thanks to both of you !

FYI google has started to block downloads from video2commons again since around 2023-12-29 22:30.

FYI google has started to block downloads from video2commons again since around 2023-12-29 22:30.

FYI trying to download https://www.youtube.com/watch?v=ecQWZWpwZVw locally, Youtube Downloader HD ( https://www.youtubedownloaderhd.com/download.html ) doesn't work, but yt-dlp ( https://github.com/yt-dlp/yt-dlp ) does work. Does YT block V2C by IP or by software ID, or both, or something else?

FYI trying to download https://www.youtube.com/watch?v=ecQWZWpwZVw locally, Youtube Downloader HD ( https://www.youtubedownloaderhd.com/download.html ) doesn't work, but yt-dlp ( https://github.com/yt-dlp/yt-dlp ) does work. Does YT block V2C by IP or by software ID, or both, or something else?

It is not entirely clear. Passing good youtube auth cookies in cloud VPS does seem to work, so it is not a full block of the ip now, but it is likely the an account would be restricted as well if we start using one.