Commonsist. Mostly harmless.
Jun 6 2021
Just a reminder, as someone who has several times run upload projects with over 100,000 files, with names automatically created with sensible and published naming rules, Commons is routinely losing out on content because of bizarre matches to things like a repeated name in the title of a book in Latin; this happens so often I just never return to these skipped uploads. Even if the right was limited to users with over 100,000 edits, it would be a great improvement and avoid wasting significant volunteer time writing scripts to work around a badly designed or untested blacklist.
Feb 2 2021
There does seem to be a misunderstanding. This task is labelled as being for the "Internship tasks" related to NSFW classifiers "for Wikimedia Commons". If it's not this, perhaps someone could describe it more clearly. Thanks
Agree with others above that using different words to describe a NSFW classifier, is still a NSFW classifier.
Yes, this task needs to be withdrawn.
Feb 1 2021
Under discussion at Commons VP. Overwhelmingly negative.
Jan 6 2021
I did not say that I had used GWT "massively recently".
"probably still the largest user of the tool" is accurate, that's what I wrote, not "used massively recently".
Sorry you spent time on this analysis.
Similarly "I was on the steering group for the development", and I'm probably still the largest user of the tool.
Dec 15 2020
Dec 11 2020
Dec 6 2020
Nov 4 2020
Oct 25 2020
I think a better workaround, if it works, is to use source_url in site.upload().
It delegates to API the task of fetching the file. If it works, the file is hopefully the original.
Oct 24 2020
As my 'personal' work-around, just for UK legislation PDFs that the API flags with chunk-too-small and fails on a second upload, the pdf is trimmed of the final byte and re-attempted. In my view this is a terrible hack rather than a fix.
Oct 21 2020
Oct 17 2020
Oct 15 2020
Oct 4 2020
Jul 22 2020
Jul 16 2020
Thanks for the explanation. I'll raise separate tasks for TOR issues on other projects if they become an issue.
Jul 1 2020
With regard to debugging, these examples may be of interest:
Jun 28 2020
Jun 24 2020
Jun 23 2020
Jun 22 2020
Jun 19 2020
As a separate experiment, here's the results of locally saving a 264MB PDF from archive.org, then running the site.upload from a command line instead of within Python. Local directory names trimmed for privacy but have no logical effect. This was a meaningful experiment as a work-around could have been to locally cache the largest files and url upload the rest, however it seems to have no effect on outcomes.
My understanding is that by Pywikibot automatically using chunked upload as a default means that "smart clients to chunk the upload, and we re-assemble it on the server side" was how this worked. Checking Pywikibot's site.py module, this links to https://www.mediawiki.org/wiki/API:Upload#Chunked_uploading to explain how it works.
Jun 13 2020
Second example today, again a large PDF at 149.78 MB, Motography (Jan-Jun 1915) (IA motography13elec).pdf in the upload log but not actually made it from the WMF servers to getting published on Commons.
Jun 12 2020
This is an upload via site.upload in Pywikibot, so a whitelisted direct url upload via the API.
Jun 10 2020
Slight addition. File:Catalog of Copyright Entries 1937 Musical Compositions New Series Vol 32 Pt 3 For the Year 1937 (IA catalogofcopyrig323libr).pdf successfully uploaded on first run and is 238MB.
Sticking the upload script inside an infinite loop, allowing the upload to break on any first API error seems a practical but bad brute-force work around. However this is incredibly slow, wasteful of processing time and bandwidth, and not a solution for the vast majority for Commons contributors.
Jun 8 2020
Playing with the Tor browser this morning, a work-around could be to for users to keep trying new Tor circuits until they stop getting the Error 500 message. This appears to work for me.
Jun 7 2020
Some of these files do eventually upload, but they are not flagged as such, and the API appears to keep on with the 'retrying' loop. The example below is IA_catalogoftitleen33018libr which seems to have taken 48 minutes to actually upload, though the process keeps running for another hour before generating a (technically correct) error that the file now exists on the WMF servers, because the same process actually managed to complete some time before.
Jun 6 2020
These are simple uploads to Commons, so these don't touch Wikidata, in fact these don't even use wd links.
Jun 5 2020
Backing this up, an hour later, without changing browser, Phabricator did appear rather than the error 500 and this session for this added comment is via Tor & OAuth.
Jun 4 2020
May 9 2020
To clarify the domain, as well as www.britishmuseum.org, media may be coming from media.britishmuseum.org, so it's probably more useful to whitelist:
May 5 2020
Mar 12 2020
Come on. This has been discussed for literally seven years. "Perfection is the enemy of the good" indeed.
Mar 6 2020
Another test case is an upload from 2007. Though overwritten in 2008, this does not stop the timestamp problem from cocking up programs:
Mar 4 2020
Feb 11 2020
https://www.mediawiki.org/wiki/Help:Tabular_Data did cover this, it was just a bit opaque to me.
Jan 1 2020
Dec 19 2019
Dec 16 2019
Seven examples of this tiffinfo parsing failure are available at Uploads by Fæ which fail to display. These were recently put up for speedy deletion, but actually this still appears to be an old WMF server failure that still needs to be fixed, as re-uploading the TIFFs still leads to the indefinite hanging of the upload at the WMF side.
Dec 13 2019
As a small experiment I used ImageMagick to remove the alpha channel
convert norrie.tiff -alpha off output.tiff
on the example file. Uploading using the wizard gave
Dec 12 2019
The user preference, if any, should be to opt-out of going to matches. This is the default functionality, not a default of never going to the exact page the user searched for.
Dec 11 2019
In the meantime, it would be really useful to find a definition of the throttle limits for the service. If we are given a service level guide, like "20 video information queries in an hour", then at least it may be possible to manage our own queue and avoid IP blocks if we stay within it, or reliably farm out the queue if that is an acceptable practice.
Please do not rename this task to something less "project breaking" and urgent. This is not the "default search", this is change to Commons search that has broken basic project searching. Example failures resulting from this project breaking change include:
- Search for "Commons:Freedom of panorama" fails to go to the page
- Search for "Template:Cc-zero" fails to go to the page
- Search for "Category:East London" fails to go to the category and "East London" does not even list it
Dec 10 2019
To avoid clogging this task up with examples, I have posted 408 examples from June 2013 of this same file overwrite bug to Faebot/SandboxU.
Nov 26 2019
As examples of this bug seem rare, it seems worth noting another example that popped up during categorization this afternoon:
2013-03-26 File:HK 銅鑼灣 Causeway Bay 糖街 Sugar Street evening The Point Causeway Square shop 領域電訊 CityLink Mar-2013 Miss Chrissie Chau.JPG
Nov 24 2019
Nov 11 2019
Nov 10 2019
When Wikimedia Commons generates alternate transcodes (e.g. converting a WebM audio/video file, VP9/Opus, length 20 s, 1,080 × 1,080 pixels to a smaller VP9 360P version) different tags are created for the file, which drops several of the standard Matroska entries.
Nov 7 2019
My experience running locally is that the YouTube IP block lasts around 2½ days. I can queue my processing, and let my programme keep testing the connection every few hours, but it's not reasonable for the average Commons user to see nothing happening for that long.
Nov 5 2019
It seems impossible for me to use WMF cloud services to do the CDC video recoding. I have reverted to running an old mac mini as a headless server, which itself has experienced the YouTube "too many requests" problem, but my understanding is that this gets lifted after a day or two anyway.