Page MenuHomePhabricator

Cloud Services shared IP (static NAT for external communications) often rate limited by YouTube for video downloads
Open, NormalPublic

Description

I've been having a consistent problem with video2commons today:

"Error: An exception occurred: DownloadError: ERROR: bFbKgtZM9As: YouTube said: Unable to extract video data"

Doesn't seem to matter which video it is, if it's a cc-licensed video or a public domain one.

Event Timeline

Restricted Application added subscribers: zhuyifei1999, Aklapper. · View Herald TranscriptFri, Oct 25, 12:35 AM
zhuyifei1999 added a subscriber: brion.
Fae added a subscriber: Fae.Mon, Oct 28, 1:38 PM

By coincidence I (using Faebot) have been trying to run my CDC videos uploads from labs. The standard use of youtube-dl works directly from a terminal session, but when run on the grid engine I start getting

WARNING: unable to download video info webpage: HTTP Error 429: Too Many Requests

or the fatal (the youtube id is just a real example)

youtube_dl.utils.DownloadError: ERROR: fWET2kNwdn8: YouTube said: Unable to extract video data

The same 'DownloadError' can mean that the video is blocked in that region, or removed as a copyvio, but that is not the case for the CDC.

The 'Too Many Requests' might be a combination of the specific WMF IP address plus the rapid querying of several playlists. However that's a bit odd considering that the code does work when not on the grid, unless the problem is that IP addresses used by the grid host are getting blocked by YouTube/Google while the IP addresses used via live sessions are not.

Note that I'm continuing to try from a command line, but as the recoding (mp4/mkv to webm) may take >12 hours for some videos, that's means I'm locked out of running a terminal on labs while the project runs, plus it's against the guidelines of how labs is supposed to be used by us volunteers...

Anyone interested in checking the specific Python code can find it on /mnt/nfs/labstore-secondary-tools-project/faebot/pywikibot-core/scripts/Youtube_CDC2.py

Fae awarded a token.Mon, Oct 28, 1:38 PM

However that's a bit odd considering that the code does work when not on the grid, unless the problem is that IP addresses used by the grid host are getting blocked by YouTube/Google while the IP addresses used via live sessions are not.

Bastions have floating public IPs so it could open port 22 to the public and you could ssh in directly without a jump host. Grid exec nodes are behind a cloud-wide NAT and share a single public IP.

Phamhi added a subscriber: Phamhi.Tue, Oct 29, 11:47 AM

@Fae .... Try running it in one of the kubernetes python shell

webservice --backend=kubernetes python shell
~/.virtualenvs/cdc/bin/python ~/pywikibot-core/pwb.py Youtube_CDC_remote
Fae added a comment.Tue, Oct 29, 3:43 PM

@Phamhi good suggestion. Have not managed to get it to work so far. The Python script drops out without warning, even though I guess in theory the shell should behave in an identical way.

@Fae .... Try running it in one of the kubernetes python shell

v2c runs from k8s and receives the same message.

bd808 renamed this task from Consistent errors from Video2Commons from YouTube to Cloud Services shared IP (static NAT for external communications) often rate limited by YouTube for video downloads.Tue, Oct 29, 8:38 PM
bd808 triaged this task as Normal priority.
bd808 moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.
Fae added a comment.Fri, Nov 1, 5:09 AM
This comment was removed by Fae.

The ideal solution is obviously getting the Cloud VPS NAT IP a higher quota upstream with YouTube, but maybe we can find a way to get some things working in advance of that.

@zhuyifei1999 Does v2c typically do the downloads on Toolforge, or are the instances in the video Cloud VPS project actually doing that work? If it is the latter, we could try a temporary solution of adding public IPv4 addresses to the video instances to spread across more IPs which would hopefully give a larger quota from YouTube.

@zhuyifei1999 Does v2c typically do the downloads on Toolforge, or are the instances in the video Cloud VPS project actually doing that work? If it is the latter, we could try a temporary solution of adding public IPv4 addresses to the video instances to spread across more IPs which would hopefully give a larger quota from YouTube.

Toolforge instances (k8s pods) fetch metadata, the encoding cluster does both metadata fetching and actual downloading. The fetch metadata part already hit Error 429.

Looks like the rate limit is currently lifted :)

Fae added a comment.Tue, Nov 5, 10:58 AM

It seems impossible for me to use WMF cloud services to do the CDC video recoding. I have reverted to running an old mac mini as a headless server, which itself has experienced the YouTube "too many requests" problem, but my understanding is that this gets lifted after a day or two anyway.

If someone can explain how I can legitimately run an FFmpeg recoding job on webgrid and save the files on a WMF server, that would be useful, but this experience, including getting warnings about my work, has seriously discouraged me from relying on WMF cloud services in the future, primarily because of the massive waste of precious volunteer time it takes to keep on testing and rewriting code to fit in with ever changing, non-specific and hard to understand "requirements" of this environment, compared to simply hosting a script on my own ancient kit.

If someone can explain how I can legitimately run an FFmpeg recoding job on webgrid and save the files on a WMF server, that would be useful, but this experience, including getting warnings about my work, has seriously discouraged me from relying on WMF cloud services in the future, primarily because of the massive waste of precious volunteer time it takes to keep on testing and rewriting code to fit in with ever changing, non-specific and hard to understand "requirements" of this environment, compared to simply hosting a script on my own ancient kit.

Can you use video2commons? I don't know what CDC does, but I'd assume that is the download + transcode + upload, which is the same as what v2c does in the backend, which is on special-purpose instances, unlike toolforge's generic grid.

Can you use video2commons? I don't know what CDC does, but I'd assume that is the download + transcode + upload, which is the same as what v2c does in the backend, which is on special-purpose instances, unlike toolforge's generic grid.

Looks like the rate limit is in effect again.

kaldari added a subscriber: kaldari.Thu, Nov 7, 5:42 PM

I'll bring this up with the WMF partner folks.

Fae added a comment.Thu, Nov 7, 10:44 PM

My experience running locally is that the YouTube IP block lasts around 2½ days. I can queue my processing, and let my programme keep testing the connection every few hours, but it's not reasonable for the average Commons user to see nothing happening for that long.