Page MenuHomePhabricator

Cloud Services shared IP (static NAT for external communications) often rate limited by YouTube for video downloads
Open, Stalled, MediumPublic

Description

I've been having a consistent problem with video2commons today:

"Error: An exception occurred: DownloadError: ERROR: bFbKgtZM9As: YouTube said: Unable to extract video data"

Doesn't seem to matter which video it is, if it's a cc-licensed video or a public domain one.

See also

Event Timeline

Restricted Application added subscribers: zhuyifei1999, Aklapper. · View Herald TranscriptOct 25 2019, 12:35 AM
zhuyifei1999 added a subscriber: brion.
Fae added a subscriber: Fae.Oct 28 2019, 1:38 PM

By coincidence I (using Faebot) have been trying to run my CDC videos uploads from labs. The standard use of youtube-dl works directly from a terminal session, but when run on the grid engine I start getting

WARNING: unable to download video info webpage: HTTP Error 429: Too Many Requests

or the fatal (the youtube id is just a real example)

youtube_dl.utils.DownloadError: ERROR: fWET2kNwdn8: YouTube said: Unable to extract video data

The same 'DownloadError' can mean that the video is blocked in that region, or removed as a copyvio, but that is not the case for the CDC.

The 'Too Many Requests' might be a combination of the specific WMF IP address plus the rapid querying of several playlists. However that's a bit odd considering that the code does work when not on the grid, unless the problem is that IP addresses used by the grid host are getting blocked by YouTube/Google while the IP addresses used via live sessions are not.

Note that I'm continuing to try from a command line, but as the recoding (mp4/mkv to webm) may take >12 hours for some videos, that's means I'm locked out of running a terminal on labs while the project runs, plus it's against the guidelines of how labs is supposed to be used by us volunteers...

Anyone interested in checking the specific Python code can find it on /mnt/nfs/labstore-secondary-tools-project/faebot/pywikibot-core/scripts/Youtube_CDC2.py

Fae awarded a token.Oct 28 2019, 1:38 PM

However that's a bit odd considering that the code does work when not on the grid, unless the problem is that IP addresses used by the grid host are getting blocked by YouTube/Google while the IP addresses used via live sessions are not.

Bastions have floating public IPs so it could open port 22 to the public and you could ssh in directly without a jump host. Grid exec nodes are behind a cloud-wide NAT and share a single public IP.

Phamhi added a subscriber: Phamhi.Oct 29 2019, 11:47 AM

@Fae .... Try running it in one of the kubernetes python shell

webservice --backend=kubernetes python shell
~/.virtualenvs/cdc/bin/python ~/pywikibot-core/pwb.py Youtube_CDC_remote
Fae added a comment.Oct 29 2019, 3:43 PM

@Phamhi good suggestion. Have not managed to get it to work so far. The Python script drops out without warning, even though I guess in theory the shell should behave in an identical way.

@Fae .... Try running it in one of the kubernetes python shell

v2c runs from k8s and receives the same message.

bd808 renamed this task from Consistent errors from Video2Commons from YouTube to Cloud Services shared IP (static NAT for external communications) often rate limited by YouTube for video downloads.Oct 29 2019, 8:38 PM
bd808 triaged this task as Medium priority.
bd808 moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.
Fae added a comment.Nov 1 2019, 5:09 AM
This comment was removed by Fae.
bd808 added a subscriber: bd808.Nov 1 2019, 5:24 PM

The ideal solution is obviously getting the Cloud VPS NAT IP a higher quota upstream with YouTube, but maybe we can find a way to get some things working in advance of that.

@zhuyifei1999 Does v2c typically do the downloads on Toolforge, or are the instances in the video Cloud VPS project actually doing that work? If it is the latter, we could try a temporary solution of adding public IPv4 addresses to the video instances to spread across more IPs which would hopefully give a larger quota from YouTube.

@zhuyifei1999 Does v2c typically do the downloads on Toolforge, or are the instances in the video Cloud VPS project actually doing that work? If it is the latter, we could try a temporary solution of adding public IPv4 addresses to the video instances to spread across more IPs which would hopefully give a larger quota from YouTube.

Toolforge instances (k8s pods) fetch metadata, the encoding cluster does both metadata fetching and actual downloading. The fetch metadata part already hit Error 429.

Looks like the rate limit is currently lifted :)

Fae added a comment.Nov 5 2019, 10:58 AM

It seems impossible for me to use WMF cloud services to do the CDC video recoding. I have reverted to running an old mac mini as a headless server, which itself has experienced the YouTube "too many requests" problem, but my understanding is that this gets lifted after a day or two anyway.

If someone can explain how I can legitimately run an FFmpeg recoding job on webgrid and save the files on a WMF server, that would be useful, but this experience, including getting warnings about my work, has seriously discouraged me from relying on WMF cloud services in the future, primarily because of the massive waste of precious volunteer time it takes to keep on testing and rewriting code to fit in with ever changing, non-specific and hard to understand "requirements" of this environment, compared to simply hosting a script on my own ancient kit.

If someone can explain how I can legitimately run an FFmpeg recoding job on webgrid and save the files on a WMF server, that would be useful, but this experience, including getting warnings about my work, has seriously discouraged me from relying on WMF cloud services in the future, primarily because of the massive waste of precious volunteer time it takes to keep on testing and rewriting code to fit in with ever changing, non-specific and hard to understand "requirements" of this environment, compared to simply hosting a script on my own ancient kit.

Can you use video2commons? I don't know what CDC does, but I'd assume that is the download + transcode + upload, which is the same as what v2c does in the backend, which is on special-purpose instances, unlike toolforge's generic grid.

Can you use video2commons? I don't know what CDC does, but I'd assume that is the download + transcode + upload, which is the same as what v2c does in the backend, which is on special-purpose instances, unlike toolforge's generic grid.

Looks like the rate limit is in effect again.

kaldari added a subscriber: kaldari.Nov 7 2019, 5:42 PM

I'll bring this up with the WMF partner folks.

Fae added a comment.EditedNov 7 2019, 10:44 PM

My experience running locally is that the YouTube IP block lasts around 2½ days. I can queue my processing, and let my programme keep testing the connection every few hours, but it's not reasonable for the average Commons user to see nothing happening for that long.

Update: It turns out that YouTube has escalating IP blocks. The second time around for the CDC uploads was a 5 day block, so it's fair to presume that any IP will rapidly become unusable. Either someone can work out what the maximum bandwidth/access thresholds are for pulling video information so any tool can stay just under that, or if YouTube is incommunicado, we have to consider the site hostile for any tool that can do this task.

Ainali added a subscriber: Ainali.Nov 23 2019, 11:13 PM
JTs added a subscriber: JTs.Nov 26 2019, 7:06 PM

Is anyone working on this?

Kizule added a subscriber: Kizule.Dec 9 2019, 3:25 AM

Is anyone working on this?

Task is assigned to @zhuyifei1999 so he should respond :)

I was informed by @Matanya last Wednesday that Google will respond to us in 1-2 weeks.

zhuyifei1999 changed the task status from Open to Stalled.Dec 11 2019, 6:31 AM

They think it's a ToS violation... so... this gotta be difficult.

Damn, maybe get WMF legal to talk to them? I don't know how you can have cc-licensed videos on a site and think it's a ToS violation for them to be downloaded. Sounds like a knee-jerk reaction (of legit copyright infringement) to me.

Fae added a comment.Dec 11 2019, 7:37 PM

In the meantime, it would be really useful to find a definition of the throttle limits for the service. If we are given a service level guide, like "20 video information queries in an hour", then at least it may be possible to manage our own queue and avoid IP blocks if we stay within it, or reliably farm out the queue if that is an acceptable practice.

czar awarded a token.Dec 16 2019, 11:35 PM
czar added a subscriber: czar.

Google isn't responding (they probably don't have the incentive to), gonna wait for a few more days. If it stays like this, I'm gonna get a massive overhaul to how v2c download from YouTube. sneek peek: slimerjs + x11vnc

bd808 updated the task description. (Show Details)

Google isn't responding (they probably don't have the incentive to), gonna wait for a few more days. If it stays like this, I'm gonna get a massive overhaul to how v2c download from YouTube. sneek peek: slimerjs + x11vnc

Need help with this, @zhuyifei1999? @Sturm called my attention to this task yesterday and indeed it seems like a very significant disruption that Google is imposing on us here.

@Xinbenlv is there anything you can do to help us here? I want to avoid using technical measures to make it seem as though our traffic comes from other sources, but I am starting to consider that solution as a viable option. Will that be a necessary step here? Is there anything you need from us to possibly expedite communications on Google's side?

Coffee added a subscriber: Coffee.
Alfa80 added a subscriber: Alfa80.Feb 5 2020, 6:58 PM
Keegan added a subscriber: Keegan.Feb 7 2020, 1:47 AM
bd808 updated the task description. (Show Details)Feb 7 2020, 3:24 AM

Hello all,

Thank you for your comments and insights with respect to the recent suspension of API access that the Video2Commons tool was using. We share your desire to address this issue. The Wikimedia Foundation's Partnerships team, which is responsible for maintaining our long-term relationships with entities like Google/Alphabet/YouTube, has met with our contacts at YouTube to discuss this issue. We have not worked out any specifics, but they are indeed interested in working with us on a resolution.

As soon as we have more information and details to share, we will make them available here. We will continue to talk with our engineering teams on a long-term solution which will hopefully allow for a more streamlined way to upload Creative Commons license videos to Commons.

Thank you all for your interest and your patience while we work to inform our contacts at YouTube on the importance of this issue and explore a way forward. The Foundation respects the position many of you have taken, and agree that any resolution with YouTube should capture both the spirit and stated intent of Creative Commons licenses.

Lionel_Scheepmans added a comment.EditedFeb 29 2020, 6:48 PM

@Varnet,

The last time I was on Youtube I saw that they have removed the possibility to publish videos in another free license than CC0.
Youtube and Google are certainly anticipating a possible paid access to the Wikidata and commons APIs. They read Wikimedia grant strategy 2020-2030 as everyone...

I don't know if anyone share my point of view, but I have the feeling than we're on the way to lose the vision of the free software movement and maybe the soul of the initial WWW in wich Wikimedia movement is one of the most significant survivor ... The CC0 license (on wikidata and commons) is already a godsend for all companies that already have a monopoly on the users of online services. With CC0, they can use the work of Wikimedia projects' volunteers without legal problems for making new copyrighted services.

The free software movement, and the Wikimedia movement miss a tool to acomplish there mission : a CC.SA licence. Only the copyleft can ensure that a Wikimedia contributor will never have to use a Google accompt and thus become the company's commercial product in order to be able to effectively use the amount of work that he and other digital workers will have provided.

Here is the missed licence retired in 2004 for inadequate demand : https://creativecommons.org/licenses/sa/1.0/

Just curious if there is an update here? No rush of course!

Don-vip added a subscriber: Don-vip.

I'm also facing this issue in my tool. It runs on Toolforge/k8s to detect new CC videos published by Arianespace, using YouTube API, and faces HTTP 429 errors too.

Mvolz added a subscriber: Mvolz.Jun 9 2020, 9:34 AM

This is now affecting citoid too, we're unable to get metadata from the page to cite youtube videos in references :/ T254700

Base added a subscriber: Base.Jun 14 2020, 9:15 PM