Commonsist. Mostly harmless.
User Details
- User Since
- Dec 7 2014, 3:49 PM (573 w, 6 d)
- Availability
- Available
- IRC Nick
- Fae
- LDAP User
- Unknown
- MediaWiki User
- Fæ [ Global Accounts ]
Jun 6 2021
Just a reminder, as someone who has several times run upload projects with over 100,000 files, with names automatically created with sensible and published naming rules, Commons is routinely losing out on content because of bizarre matches to things like a repeated name in the title of a book in Latin; this happens so often I just never return to these skipped uploads. Even if the right was limited to users with over 100,000 edits, it would be a great improvement and avoid wasting significant volunteer time writing scripts to work around a badly designed or untested blacklist.
Feb 2 2021
There does seem to be a misunderstanding. This task is labelled as being for the "Internship tasks" related to NSFW classifiers "for Wikimedia Commons". If it's not this, perhaps someone could describe it more clearly. Thanks
Agree with others above that using different words to describe a NSFW classifier, is still a NSFW classifier.
Yes, this task needs to be withdrawn.
Feb 1 2021
Under discussion at Commons VP. Overwhelmingly negative.
Jan 6 2021
I did not say that I had used GWT "massively recently".
"probably still the largest user of the tool" is accurate, that's what I wrote, not "used massively recently".
Sorry you spent time on this analysis.
Similarly "I was on the steering group for the development", and I'm probably still the largest user of the tool.
Dec 11 2020
Nov 4 2020
Oct 25 2020
I think a better workaround, if it works, is to use source_url in site.upload().
It delegates to API the task of fetching the file. If it works, the file is hopefully the original.
See https://en.wikisource.org/w/api.php?action=help&modules=upload
Oct 24 2020
As my 'personal' work-around, just for UK legislation PDFs that the API flags with chunk-too-small and fails on a second upload, the pdf is trimmed of the final byte and re-attempted. In my view this is a terrible hack rather than a fix.
Oct 21 2020
Oct 17 2020
Oct 15 2020
Oct 4 2020
Jul 16 2020
Thanks for the explanation. I'll raise separate tasks for TOR issues on other projects if they become an issue.
Jul 1 2020
With regard to debugging, these examples may be of interest:
- IA TheOakLeaf1949January1June25 59MB, took 23 upload attempts before successful upload
- IA asagraycorrespo00grayg 1.44 GB, uploaded on first attempt
Jun 28 2020
Jun 24 2020
Jun 23 2020
Jun 22 2020
Jun 19 2020
As a separate experiment, here's the results of locally saving a 264MB PDF from archive.org, then running the site.upload from a command line instead of within Python. Local directory names trimmed for privacy but have no logical effect. This was a meaningful experiment as a work-around could have been to locally cache the largest files and url upload the rest, however it seems to have no effect on outcomes.
My understanding is that by Pywikibot automatically using chunked upload as a default means that "smart clients to chunk the upload, and we re-assemble it on the server side" was how this worked. Checking Pywikibot's site.py module, this links to https://www.mediawiki.org/wiki/API:Upload#Chunked_uploading to explain how it works.
Jun 13 2020
Second example today, again a large PDF at 149.78 MB, Motography (Jan-Jun 1915) (IA motography13elec).pdf in the upload log but not actually made it from the WMF servers to getting published on Commons.
Jun 12 2020
This is an upload via site.upload in Pywikibot, so a whitelisted direct url upload via the API.
Jun 10 2020
Slight addition. File:Catalog of Copyright Entries 1937 Musical Compositions New Series Vol 32 Pt 3 For the Year 1937 (IA catalogofcopyrig323libr).pdf successfully uploaded on first run and is 238MB.
Sticking the upload script inside an infinite loop, allowing the upload to break on any first API error seems a practical but bad brute-force work around. However this is incredibly slow, wasteful of processing time and bandwidth, and not a solution for the vast majority for Commons contributors.
Jun 8 2020
Playing with the Tor browser this morning, a work-around could be to for users to keep trying new Tor circuits until they stop getting the Error 500 message. This appears to work for me.
Jun 7 2020
Some of these files do eventually upload, but they are not flagged as such, and the API appears to keep on with the 'retrying' loop. The example below is IA_catalogoftitleen33018libr which seems to have taken 48 minutes to actually upload, though the process keeps running for another hour before generating a (technically correct) error that the file now exists on the WMF servers, because the same process actually managed to complete some time before.
Jun 6 2020
These are simple uploads to Commons, so these don't touch Wikidata, in fact these don't even use wd links.
Jun 5 2020
Backing this up, an hour later, without changing browser, Phabricator did appear rather than the error 500 and this session for this added comment is via Tor & OAuth.
Jun 4 2020
May 9 2020
To clarify the domain, as well as www.britishmuseum.org, media may be coming from media.britishmuseum.org, so it's probably more useful to whitelist:
May 5 2020
Mar 12 2020
Come on. This has been discussed for literally seven years. "Perfection is the enemy of the good" indeed.
Mar 6 2020
Another test case is an upload from 2007. Though overwritten in 2008, this does not stop the timestamp problem from cocking up programs:
File:Potenzmenge_von_A.png
Mar 4 2020
Feb 11 2020
https://www.mediawiki.org/wiki/Help:Tabular_Data did cover this, it was just a bit opaque to me.
Dec 16 2019
Seven examples of this tiffinfo parsing failure are available at Uploads by Fæ which fail to display. These were recently put up for speedy deletion, but actually this still appears to be an old WMF server failure that still needs to be fixed, as re-uploading the TIFFs still leads to the indefinite hanging of the upload at the WMF side.
Dec 13 2019
As a small experiment I used ImageMagick to remove the alpha channel
convert norrie.tiff -alpha off output.tiff
on the example file. Uploading using the wizard gave
Dec 12 2019
The user preference, if any, should be to opt-out of going to matches. This is the default functionality, not a default of never going to the exact page the user searched for.
Dec 11 2019
In the meantime, it would be really useful to find a definition of the throttle limits for the service. If we are given a service level guide, like "20 video information queries in an hour", then at least it may be possible to manage our own queue and avoid IP blocks if we stay within it, or reliably farm out the queue if that is an acceptable practice.
Please do not rename this task to something less "project breaking" and urgent. This is not the "default search", this is change to Commons search that has broken basic project searching. Example failures resulting from this project breaking change include:
- Search for "Commons:Freedom of panorama" fails to go to the page
- Search for "Template:Cc-zero" fails to go to the page
- Search for "Category:East London" fails to go to the category and "East London" does not even list it
Dec 10 2019
To avoid clogging this task up with examples, I have posted 408 examples from June 2013 of this same file overwrite bug to Faebot/SandboxU.
Further example is Negros_Oriental_State_University.jpg, per diff, the file cannot be deleted which seems to be caused by the first entry in the filehistory being corrupted.
Nov 26 2019
As examples of this bug seem rare, it seems worth noting another example that popped up during categorization this afternoon:
2013-03-26 File:HK 銅鑼灣 Causeway Bay 糖街 Sugar Street evening The Point Causeway Square shop 領域電訊 CityLink Mar-2013 Miss Chrissie Chau.JPG
Via pywikibot:
Nov 24 2019
Nov 11 2019
Nov 10 2019
When Wikimedia Commons generates alternate transcodes (e.g. converting a WebM audio/video file, VP9/Opus, length 20 s, 1,080 × 1,080 pixels to a smaller VP9 360P version) different tags are created for the file, which drops several of the standard Matroska entries.
Nov 7 2019
My experience running locally is that the YouTube IP block lasts around 2½ days. I can queue my processing, and let my programme keep testing the connection every few hours, but it's not reasonable for the average Commons user to see nothing happening for that long.
Nov 5 2019
It seems impossible for me to use WMF cloud services to do the CDC video recoding. I have reverted to running an old mac mini as a headless server, which itself has experienced the YouTube "too many requests" problem, but my understanding is that this gets lifted after a day or two anyway.
Nov 2 2019
Nov 1 2019
Oct 29 2019
@Phamhi good suggestion. Have not managed to get it to work so far. The Python script drops out without warning, even though I guess in theory the shell should behave in an identical way.
Oct 28 2019
By coincidence I (using Faebot) have been trying to run my CDC videos uploads from labs. The standard use of youtube-dl works directly from a terminal session, but when run on the grid engine I start getting
WARNING: unable to download video info webpage: HTTP Error 429: Too Many Requests
or the fatal (the youtube id is just a real example)
youtube_dl.utils.DownloadError: ERROR: fWET2kNwdn8: YouTube said: Unable to extract video data
Jan 23 2019
As a reminder, this task has been open for 2 years with a more detailed Wikimedia Commons community consensus to go ahead 15 months ago. https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals/Archive/2017/10#Proposal_to_include_non-CC0_licenses_for_the_Data_namespace
Dec 14 2018
After 24 hours of "outage", the exact same sources are now uploading. Nothing has changed with my account or upload scripts, so something must have changed operationally.
Dec 13 2018
Connectivity from where to where? The direct URL upload does not even touch my client end, so I have no idea what can be checked.
Sep 12 2018
No my two years old question has not been answered. There has been no analysis published.
Okay, let's consider the factual "envelope":
Aug 6 2018
May 20 2018
Meh, changing some categories is not exactly the crime of the decade, especially considering it only takes a few seconds to swap the entire lot back. If these are the only example of misuse, it's a really, really, weak case for breaking standard tools for good faith contributors.
Based on the last set of numbers, I would up my suggestion of a limit to 3000 edits / minute on Commons.
May 19 2018
Sure, I'll take care of the vandalism on Wikimedia Commons, don't over-egg the case. The realistic risk of damage or disruption has not been made clear to us "non security" volunteers. Rather than telling us off in an attempt to make us feel stupid, try explaining the case properly and leave bullying tactics at the school gate.
It's quite normal to change 2000 files in 60 seconds with catalot or use VFC to make a DR with several hundred files in half that.
May 18 2018
As was requested previously, where is the link/reference to make the change. If this affects all tools for Commons users, then there should have been a public proposal on Commons, not a super duper secret Phabricator task because "security".
May 13 2018
It is already allowed. Admin of every SUL project can create unlimited number of account. This is about delegation of sysop power.
This is a change that affects all projects, not just the English Wikipedia. A RFC is still needed on meta, and all affected projects appropriately notified.
May 6 2018
May 4 2018
Nudge - it has now been 1 year, 5 months, 21 days since this request for publication was raised. We are not asking MI6 about attempts on the Prime Minister's life, it should be possible to explain what happened and the basics of how it will be prevented in the future without creating a hacker's guide to breaking Wikimedia.
