Commonsist. Mostly harmless.
Wed, Nov 4
Sun, Oct 25
I think a better workaround, if it works, is to use source_url in site.upload().
It delegates to API the task of fetching the file. If it works, the file is hopefully the original.
Oct 24 2020
As my 'personal' work-around, just for UK legislation PDFs that the API flags with chunk-too-small and fails on a second upload, the pdf is trimmed of the final byte and re-attempted. In my view this is a terrible hack rather than a fix.
Oct 21 2020
Oct 17 2020
Oct 15 2020
Oct 4 2020
Jul 22 2020
Jul 16 2020
Thanks for the explanation. I'll raise separate tasks for TOR issues on other projects if they become an issue.
Jul 1 2020
With regard to debugging, these examples may be of interest:
Jun 28 2020
Jun 24 2020
Jun 23 2020
Jun 22 2020
Jun 19 2020
As a separate experiment, here's the results of locally saving a 264MB PDF from archive.org, then running the site.upload from a command line instead of within Python. Local directory names trimmed for privacy but have no logical effect. This was a meaningful experiment as a work-around could have been to locally cache the largest files and url upload the rest, however it seems to have no effect on outcomes.
My understanding is that by Pywikibot automatically using chunked upload as a default means that "smart clients to chunk the upload, and we re-assemble it on the server side" was how this worked. Checking Pywikibot's site.py module, this links to https://www.mediawiki.org/wiki/API:Upload#Chunked_uploading to explain how it works.
Jun 13 2020
Second example today, again a large PDF at 149.78 MB, Motography (Jan-Jun 1915) (IA motography13elec).pdf in the upload log but not actually made it from the WMF servers to getting published on Commons.
Jun 12 2020
This is an upload via site.upload in Pywikibot, so a whitelisted direct url upload via the API.
Jun 10 2020
Slight addition. File:Catalog of Copyright Entries 1937 Musical Compositions New Series Vol 32 Pt 3 For the Year 1937 (IA catalogofcopyrig323libr).pdf successfully uploaded on first run and is 238MB.
Sticking the upload script inside an infinite loop, allowing the upload to break on any first API error seems a practical but bad brute-force work around. However this is incredibly slow, wasteful of processing time and bandwidth, and not a solution for the vast majority for Commons contributors.
Jun 8 2020
Playing with the Tor browser this morning, a work-around could be to for users to keep trying new Tor circuits until they stop getting the Error 500 message. This appears to work for me.
Jun 7 2020
Some of these files do eventually upload, but they are not flagged as such, and the API appears to keep on with the 'retrying' loop. The example below is IA_catalogoftitleen33018libr which seems to have taken 48 minutes to actually upload, though the process keeps running for another hour before generating a (technically correct) error that the file now exists on the WMF servers, because the same process actually managed to complete some time before.
Jun 6 2020
These are simple uploads to Commons, so these don't touch Wikidata, in fact these don't even use wd links.
Jun 5 2020
Backing this up, an hour later, without changing browser, Phabricator did appear rather than the error 500 and this session for this added comment is via Tor & OAuth.
Jun 4 2020
May 9 2020
To clarify the domain, as well as www.britishmuseum.org, media may be coming from media.britishmuseum.org, so it's probably more useful to whitelist:
May 5 2020
Mar 12 2020
Come on. This has been discussed for literally seven years. "Perfection is the enemy of the good" indeed.
Mar 6 2020
Another test case is an upload from 2007. Though overwritten in 2008, this does not stop the timestamp problem from cocking up programs:
Mar 4 2020
Feb 11 2020
https://www.mediawiki.org/wiki/Help:Tabular_Data did cover this, it was just a bit opaque to me.
Jan 1 2020
Dec 19 2019
Dec 16 2019
Seven examples of this tiffinfo parsing failure are available at Uploads by Fæ which fail to display. These were recently put up for speedy deletion, but actually this still appears to be an old WMF server failure that still needs to be fixed, as re-uploading the TIFFs still leads to the indefinite hanging of the upload at the WMF side.
Dec 13 2019
As a small experiment I used ImageMagick to remove the alpha channel
convert norrie.tiff -alpha off output.tiff
on the example file. Uploading using the wizard gave
Dec 12 2019
The user preference, if any, should be to opt-out of going to matches. This is the default functionality, not a default of never going to the exact page the user searched for.
Dec 11 2019
In the meantime, it would be really useful to find a definition of the throttle limits for the service. If we are given a service level guide, like "20 video information queries in an hour", then at least it may be possible to manage our own queue and avoid IP blocks if we stay within it, or reliably farm out the queue if that is an acceptable practice.
Please do not rename this task to something less "project breaking" and urgent. This is not the "default search", this is change to Commons search that has broken basic project searching. Example failures resulting from this project breaking change include:
- Search for "Commons:Freedom of panorama" fails to go to the page
- Search for "Template:Cc-zero" fails to go to the page
- Search for "Category:East London" fails to go to the category and "East London" does not even list it
Dec 10 2019
To avoid clogging this task up with examples, I have posted 408 examples from June 2013 of this same file overwrite bug to Faebot/SandboxU.
Nov 26 2019
As examples of this bug seem rare, it seems worth noting another example that popped up during categorization this afternoon:
2013-03-26 File:HK 銅鑼灣 Causeway Bay 糖街 Sugar Street evening The Point Causeway Square shop 領域電訊 CityLink Mar-2013 Miss Chrissie Chau.JPG
Nov 24 2019
Nov 11 2019
Nov 10 2019
When Wikimedia Commons generates alternate transcodes (e.g. converting a WebM audio/video file, VP9/Opus, length 20 s, 1,080 × 1,080 pixels to a smaller VP9 360P version) different tags are created for the file, which drops several of the standard Matroska entries.
Nov 7 2019
My experience running locally is that the YouTube IP block lasts around 2½ days. I can queue my processing, and let my programme keep testing the connection every few hours, but it's not reasonable for the average Commons user to see nothing happening for that long.
Nov 5 2019
It seems impossible for me to use WMF cloud services to do the CDC video recoding. I have reverted to running an old mac mini as a headless server, which itself has experienced the YouTube "too many requests" problem, but my understanding is that this gets lifted after a day or two anyway.
Nov 3 2019
Nov 2 2019
Nov 1 2019
Oct 29 2019
@Phamhi good suggestion. Have not managed to get it to work so far. The Python script drops out without warning, even though I guess in theory the shell should behave in an identical way.
Oct 28 2019
By coincidence I (using Faebot) have been trying to run my CDC videos uploads from labs. The standard use of youtube-dl works directly from a terminal session, but when run on the grid engine I start getting
WARNING: unable to download video info webpage: HTTP Error 429: Too Many Requests
or the fatal (the youtube id is just a real example)
youtube_dl.utils.DownloadError: ERROR: fWET2kNwdn8: YouTube said: Unable to extract video data
Jul 23 2019
Mar 31 2019
Mar 30 2019
Mar 28 2019
Mar 19 2019
Jan 23 2019
As a reminder, this task has been open for 2 years with a more detailed Wikimedia Commons community consensus to go ahead 15 months ago. https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals/Archive/2017/10#Proposal_to_include_non-CC0_licenses_for_the_Data_namespace
Dec 14 2018
After 24 hours of "outage", the exact same sources are now uploading. Nothing has changed with my account or upload scripts, so something must have changed operationally.
Dec 13 2018
Connectivity from where to where? The direct URL upload does not even touch my client end, so I have no idea what can be checked.
Sep 12 2018
No my two years old question has not been answered. There has been no analysis published.
Okay, let's consider the factual "envelope":