Unfortunately, this patch is harder to test than to write, as you have to set up a Flickr API key and configure UploadWizard for Flickr importing, not to mention configuring the user groups and user rights. It's a very low risk patch though, as Flickr importing in UploadWizard is currently a somewhat obscure feature and the code changes involved are minimal (it's really a configuration change). I would be OK with this patch getting merged untested and then testing it on test.wikipedia.org (which is fully configured for Flickr importing).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 6 2020
@eprodromou - Nice work! Has there been any interest on Commons for your bot to take over any of the existing SVG maps? I'd be happy to help with those discussions if you like.
Apr 2 2020
@thcipriani - Any idea how to move a repo from Phabricator to Gerrit (without losing the history)? Or who would know?
Apr 1 2020
Mar 31 2020
@Ramsey-WMF - It's definitely still showing me images where I'm not the original uploader, for example, https://commons.wikimedia.org/wiki/File:El_Gouna_Turtle_House_R01.jpg. Is the list of images pre-populated or does it pick new images on the fly? Perhaps those images were assigned to me before the change went into place.
Another idea would be to associate the Wikidata ID with the category from the category side.
Any thoughts on this?
An alternate idea would be a way to assign the sort buttons to a particular row in the header, but that seems hacky.
@Krinkle - I'm not sure who originally wrote jquery.tablesorter (as the history got broken by a file move), but my guess would be you :) Any thoughts on this?
Mar 30 2020
@AlexisJazz - Can this be closed now?
I do however see weird visual glitches when hovering over countries on an activated timeline map (regardless of browser), which seems to be related. For example, if I go to https://en.wikipedia.org/w/index.php?title=User:Doc_James/OurWorld&oldid=947323679, click the play button on the map, and hover over Russia, it looks like this:
I get similar weird glitches when hovering over any large country:
I'm not seeing this bug myself. I tried editing and previewing the page using Firefox 74.0 for MacOS. Can anyone else reproduce (perhaps on Ubuntu)?
@Milimetric - If you are working on fixing the problems with Graphoid, please create a Phabricator task to track this work.
Mar 26 2020
Thanks @aborrero! And like I told Bryan this is not super high priority, so if it turns out to be difficult, we'll just work with the currently-installed beta version and hope for the best.
Mar 25 2020
Thanks for the clarification @matmarex!
Mar 24 2020
@Reedy - I've confirmed that you are correct in local testing.
@brennen - While you're in there, I would suggest moving the "Use Vector skin" section into the Quickstart, as you have to set up at least one skin to use MediaWiki.
I re-did the installation, this time ignoring the instructions in the "re-install" section (ironically), and everything seems to have worked fine.
Mar 23 2020
@brennen - I guess I was confused because the "Re-install" section was inside the "Quickstart" instructions. Are those instructions not supposed to be part of the Quickstart? Also, it seems like the Quickstart instructions end prematurely. After you've installed it, how do you actually access your MediaWiki instance from a browser?
@kostajh - Pinging you only because you wrote the instructions :) Do you have any suggestions?
Mar 20 2020
But note that Community Tech—the team that runs the Community Wishlist Survey—has only committed to investigate a new OCR tool. It may yet turn out that it is too complex, or for other reasons is not a feasible project within the scope of the Wishlist.
Mar 19 2020
Sounds good to me!
Mar 18 2020
That's terrific. I have a weird memory that we switched *to* TEXT_DETECTION intentionally at some point. But maybe I'm dreamin.
@Samwilson, @aezell - I just made a very important discovery. When you are sending an OCR request to the Google Vision API, if you set the request type to "DOCUMENT_TEXT_DETECTION" rather than "TEXT_DETECTION", it correctly detects columns, headers, etc. and gives you the text in the right order! In fact it gives you the exact same output as Indic OCR (which I believe uses the Google Drive API). And despite the documentation on Google's site, it seems to accept more file types than just PDF and TIFF; specifically it seems to be fine with JPEGs, although I've only tried JPEGs hosted directly on Google Cloud rather than passed to the API. This potentially means that our OCR problems are solved!
Mar 17 2020
Producing HTML brings dependency on the parser, and on the current user's preferences. It also potentially brings a need for additional parameters to control parser options, and can result in bugs like T247661.
Mar 16 2020
Yes, some older endpoints do provide UI HTML. We're trying to move away from that.
Mar 12 2020
As 4.0.0-2~bpo9+1 (which seems to be based on a beta version) is the only v4 backport for Stretch, I guess that means upgrading to 4.0.0 would be just as difficult as upgrading to 4.1, in which case I guess upgrading to 4.1 is still the best option.
Hmm, when I do tesseract --version on Toolforge it says tesseract 4.0.0-beta.1, but maybe they forgot to bump the version string??
Mar 11 2020
That's a good point. I could imagine an interface where you click the "OCR" button and it pops up a little list of options like:
@bd808 - Would it be significantly easier to upgrade to v4.0.0 (not beta) rather than v4.1? If so, we could try that and see how well it works.
@Samwilson, @aezell - Thought you might find these results interesting ^.
I ran this page image (300 dpi) through all the OCR services. Here are the results:
It looks like the only way to get internet archive to detect diacritics is to specifically set the language to French (or a language that commonly uses diacritics). Setting it to English or no language results in all the diacritics being stripped.
Mar 10 2020
I believe TheDJ is referring to https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CodeMirror/+/421349/. Thanks!
Mar 5 2020
I'm really curious to test out tesseract though and see how it compares with the other engines.
Too bad we can't do some kind of multi-pass operation and combine the results of each system's output.
Mar 4 2020
I did some very preliminary testing for fun. Google Vision and Google Drive handled diacritics and leading caps with flying colors, but had trouble with the columns. Google Vision pretty much didn't handle the columns at all and even broke up some lines incorrectly. Google Drive did a lot better but still got confused a couple times (especially due to one column being longer than the other). Internet Archive handled the columns with flying colors, but ignored all the diacritics and got confused by leading caps.
Anyone have a link for easily testing Tesseract?
I created a new task for the engine testing at T246944. Not sure if it should be a sub-task of this or subsequent.
@aezell - Good question. The wishlist request mentions quality a couple times, but doesn't give a specific threshold for what level of quality is required. It seems some of the biggest issues raised in the wishlist request were dealing with diacritics and multi-column text. I think if we managed to tackle those issues and the uptime problem, the community would be more or less satisfied. Raw OCR accuracy would hopefully be improved incrementally by whoever was maintaining the service regardless.
As I noted in the duplicate bug, in the console, the following warning is thrown, which might be relevant:
VM96:90 mediawiki.jqueryMsg: file-exists-duplicate: Cannot read property 'hasAlreadyBeenUsedAsAReplacement' of undefined
Mar 3 2020
The main issue at this point is just that no one is working on Gadgets 2.0 and it's not part of any team's goals at the moment. If anyone wanted to pick it up as a volunteer project, that would be welcome. A lot of the heavy lifting has already been done.
Feb 28 2020
I created an inventory of which OCR services support which Wikisource languages:
https://docs.google.com/spreadsheets/d/1qJIs1GMY4vY3qS6TimvuoMk6f_dTOB9xFjmq8eJrISw/edit?usp=sharing
Feb 27 2020
Personally, I don't think it would be a good idea to backfill that log. The log table is huge and always in danger of becoming too large to be efficiently queryable. We reduced that pressure somewhat with T189594, but the table is still large and unwieldy. Adding several million new records for historic page creation events would exacerbate the problem and not deliver a whole lot of user value. Hopefully one day we will shard the log table by year (or some such strategy) and be able to revisit this, but I don't think it would be a good idea right now.
Feb 26 2020
For myself, I don't do deployments any more (now that I'm managing). I suspect the same is true for Adam.
Assuming this is doable on multilingual wikis, it would be really nice if the Language variant option in the preferences only showed up if you set your language to one that has variants. That might require a bit of extra hackery though.
@Soda - That's a great question. I believe the reason the text was hard-coded is that revision comments on Commons should ideally be in English rather than in the uploader's language. I would suggest keeping the messages hard-coded for now. We can always migrate them to the messages API later if needed.
Feb 25 2020
The description of this bug originally included a hodge-podge of several different issues in different interfaces (all related to revision comments for uploads). In order to make the task more actionable, I've narrowed it specifically to address the issues with UploadWizard. Any remaining issues (such as with the upload dialog in the editor), should be filed as separate bugs. Thanks.
@jayantanth, @Bodhisattwa, @Samwilson - It looks like Malayalam, Oriya/Odia, and Gujrati have now been fixed as well!
It looks like Google has fixed the issue on their end and now the Google OCR tool works fine for Kannada, Malayalam, Odia, Gujarati, and Telugu!
Feb 24 2020
I imagine that part of the problem is that the "content language" of Commons is English, not Chinese. However, "content language" is something of a misnomer on multilingual wikis, as the language of the content is mostly determined by your interface language rather than the wiki's content language on multilingual wikis. Perhaps LanguageConverter could be modified to accommodate this. @liangent - Any thoughts on the feasibility of this?
@srishakatux - I fixed up the task description to make it specific and give some pointers. Hope that helps!
Feb 20 2020
The only way I know of that you can detect reverts is to look for the string "Reverted to version" in the image comment. See, for example, https://commons.wikimedia.org/wiki/File:Glandulicereus_thurberi_sonora.jpg.
Feb 15 2020
Feb 13 2020
@dr0ptp4kt - Sorry for the late feedback. The only topic predictions that don't make sense to me are the "Regional society" and "Regional geography" ones. No one is ever going to choose an interest in "Regional society" or "Regional geography". These either need to be refined to topics like "European society" and "African geography", for example, or just generalized to "Society" and "Geography", although FWIW, "Society" doesn't seem like an especially useful topic either.
Feb 6 2020
Feb 5 2020
@daniel - Any personal opinion on which schema we should pursue?
I support matching the pattern that was used for usergroup expiry, which I think was mostly implemented by @TTO. For the schema changes, this would mean adding a varbinary column to the watchlist table to store an expiration timestamp and also a corresponding new index for that column. Maybe @TTO could elaborate on the rest of it (queries, garbage collection, etc.).