Please, see the extension at Lingua Libre and Forvo Audio Downloader.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 2 2022
May 30 2022
May 17 2022
Ok, I semi-figured it out.
I've begun working on this and I've written the Anki part in python and tested it by downloading audio from Forvo. The code is available on github. I'm trying to reuse the code from the LinguaLibre bot, but I'm having trouble writing the sparql function. I would like to send a query to LingaLibre containing a term and language that would return all the available pronunciations performed by native speakers, the user name of the speaker, and their learning place [or whatever works as the best proxy for accent]. Can anyone help with writing this query function?
Feb 25 2022
@Samwilson Kobo, Kindle, and Koreader support SVG now. Seems to have mostly happened between 2017 and 2019.
Feb 24 2022
Jan 29 2022
I think the page number is present in the TEI data as https://repository.library.brown.edu/studio/item/bdr:471142/TEI/ as <pb n=""/>. It would require a script to parse the TEI and convert them into a list of page numbers.
Jan 19 2022
Jan 13 2022
Good to know. I'll make use this formula in the future. Sheet updated.
Thank you. I'll keep your notes in time. Congratulations on having less time soon. :)
Jan 12 2022
I think I fixed all the issues. I choose 2126 as the safe date to move the files to Commons because I assume that everyone who wrote in 1926 would have been dead + 70 by then.
Jan 9 2022
Jan 7 2022
I tried to look into the page number and couldn't figure it out. Sorry.
Updated to include 1926 volumes. Also 1926 volumes of Atlantic monthly, Blackwood, and Strand.
Jan 3 2022
Fixed Author - Title column mix-up
Dec 24 2021
Here is the file for the MJP. Turns it out it was pretty easy to scrape the data. Hurray for well-designed sites.
Dec 23 2021
Keep going. The spreadsheet and IA are not wrong. It's the publisher. I think they were trying out a British edition for a bit. Some of the months have two different TOC, like the albums of the 1960s.
The Sim set has TOCs. There's going to be some untangling to do with the number weirdness. For example Sims Volume 58 N 3 is the same as HT V 59 N 3 (July 1919), but September 1919 is different. The Sims set skips V64 without any missing months.
The discrepancy seems to come from the fact that some of the SIM set was published in the UK.
Looks perfect!
fixed
Unexpectedly got some time today. Here is the list of all the issues of the Smart Set currently available. It seems that Volume 64, Issue 2-4 are either missing or were never issued. Volume 64 and Volume 65 both start with January 1921, but have different TOC.
Dec 22 2021
Fixed.
Dec 21 2021
Thank you for fixing this. I'll make sure to keep this in mind next time. Happy Holidays! :)
Oct 28 2021
Perfect. Thank you!
Thank you. It's perfect.
Oct 26 2021
Oct 23 2021
Oct 5 2021
Fixed a few broken links, added No Commons
Oct 4 2021
Thank you! Sorry about that. It looks as if the company changed it's name a few time. New file uploaded.
Sep 27 2021
Sep 20 2021
@Inductiveload I would leave the failed pages until the server side upload can be done. My experience is that users are more likely to proofread when the scans are already in place. Periodicals are particularly challenging to process so I don’t mind waiting a little bit for a server side upload.
Sep 19 2021
@Inductiveload I completely understand. That bug essentially breaks batch uploading. I've encountered it many a times and it's extremely painful. For all intents and purposes, uploading is broken. I think it might be worth leaving a note on your batch upload page so that other users will understand the situation.
Sep 18 2021
Sep 14 2021
Pointing out that in the URL for the batch upload "download" is misspelled. https://archive.org/dowload should be https://archive.org/download .
Sep 13 2021
Of course, take your time. This seems far more complicated than I envisioned. Thank you for doing this.
Here are the remaining volumes for The Strand from the IA. I already uploaded them as PDF before I noticed that the files are missing an OCR layer. Would it be possible create an OCRed DJVU with uncompressed images so that they can easily be cropped?
Sep 12 2021
@Aklapper I posted it on @Inductiveload user board per https://en.wikisource.org/wiki/User:Inductiveload/Requests/Batch_uploads. Could you please restore it.
Forgot the vollist in the previous version.
Jun 12 2021
Is it possible to merge this patch?
Apr 28 2021
Apr 27 2021
Could we store these as tags that are updated whenever a Page is saved?
Apr 26 2021
That’s not the same file. 002 uploaded while 010 failed. It seems that the error occurs during the publishing stage.
I got a 503 error with web.archive.org/web/20150905070709if_/http://www.quartos.org/quarto_images/ham-1625-22278x-fol-c01/ham-1625-22278x-fol-c01-010.tif
Apr 25 2021
The IA tool did not warn me when creating duplicates yesterday leading to duplicate indexes. I caught them by mistake, but I want to flag this as an issue. If want to permit the creation of duplicate files albeit in different formats, them the warning needs to be in place and require confirmation to override.
I don’t think it’s the IA because I tried uploading these files via Pattypan and kept getting a myriad of errors. Failure seems almost certain and success a random occurrence. That’s why I suggested trying to upload the file many times.
As a shot in the dark, would it be possible to try with something else than wget?
I'm actually not surprised that the c01 files are failing. For some reason, SRE does not seem to like the Tif from the British Library. However, I've been able to get them to upload by repeatedly trying. Would it be possible to make a script to attempt to upload each file around 100 times and see if it goes. That's how i got some of them onto Commons.
Apr 24 2021
I think that a few might have escaped the list. Can you also add these please:
Yes, please.
Yes, please.
Yes, please.
Apr 22 2021
I've been getting a similar error with Pattypan all day.
Apr 13 2021
Apr 10 2021
There’s also Layout Parser that employs deep learning to analyze, parse, and OCR very complicated layouts.
Apr 7 2021
Apr 2 2021
I think it’s important to point out that PDF does not always work especially for larger files. A single page can range anywhere from 500kb to over 35mb. When you factor in the number of pages in a work, you can easily get a PDF that is over 1gb in size. Currently, only Chunked Uploader can potentially handle that. However, I tried uploading a 1.20gb PDF over 10 times and failed even with async unchecked. Even going to the stash failed. Now, I’m not calling for removing support for PDF and I know that this is not entirely in scope for this project. However, if we’re discussing how to bulk store OCR, we also need to make sure that users can upload files to OCR. Even Fæ cannot upload some PDF from IA. So if we are to support bulk OCR, we’d either need to compress PDFs to death like IA does, improve Commons upload to robustly support files of several gb even when dealing with failed or unreliable connections, or develop a container as Xover commented on. A container is a larger project, but one that can have widespread benefits. For example, it would be possible to upload the front-and-back of a coin as a single entry.
Mar 31 2021
@Xover and @Inductiveload Thank you both for your feedback and comments. I'm glad that there is a way to reduce the size of the PDF generated from Haithi Trust images. However, I feel like there are two separate issues being discussed.
There are two major and related questions to answer here. First, when should the OCR tool be run? Second, how should the result be stored? While it’s tempting to run the OCR on the Index page, OCRing an entire book takes a considerable amount of time during which the user cannot edit wasting valuable user time and potentially resulting in the user leaving. Furthermore, it’s not actually necessary to wait that long because OCR can be performed earlier. When OCR can be performed depends on how the individual scans that make up a book are stored. These are the major options that I can think of:
@Inductiveload While I agree that individual files make managing the files a real pain, it's probably the only way to do it. I proposed allowing Commons to accept a book scan as one zip in T277921 and AntiCompositeNumber wrote "Nope nope nope, land of 10,000 nopes."
I don't think that things will magically begin to work. However, there are multiple cases where an Image based Index would make sense: the Balinese Leaf Project or any index that is suffering from display issues. See, the following Phabricator tickets T224355 T256848 T257025 T184867 . Also, see https://en.wikisource.org/wiki/User:Inductiveload/jump_to_file
It occurs to me that storing the OCR text and the Proofread text on Commons with the original image could actually become a very valuable dataset for investigating where OCR fails and help Wikisource at the same time. It would probably look something like this.
We'd also need to figure out how to do this on Index pages comprised of single images, such as https://en.wikisource.org/wiki/Index:Lippincotts_Monthly_Magazine_51 . This should probably be done somewhere on an Index page.
Ok, I see your point. That was the most minor of quibbles. Your original proposal is great. Hope to see this happen.
I think that for most works, it would better to make a flexible template to split the pages and allow the user to manually adjust. Generally, there are two major types of column usages in book
- Header - Two Columns - Footer
- Header - Two Columns - Image - Two Columns - Image - Two Columns - Footer
Mar 30 2021
That sounds like a great approach! Probably, to make life a bit easier for users, we should add "[[File: " and "]]" in the code
There are multiple issues with PDF.
The issue seems to be that if you create an Index from individual images many of the scripts and gadgets break because of image based Indexes are handled different from PDF/DJVU. A simple task such as numbering pages becomes far more difficult.
The basic issue appears to be that the variable page is never set for images and this is cascading downwards.
Mar 29 2021
Update, while reading the documentation for Proofread page (1). I discovered that you can transclude an image sequence using the format <pages index="Index Name" from="Start Image" to="End Image"/> even if the images are in a different format. This implies that at some point, the code does numerically sequence the image files. However, on the Index ns <pagelist /> does not work. This implies that the Index ns is not aware or make usage of the numerical sequence that exists.
Mar 27 2021
Mar 25 2021
Commons is not the right place because this would require two major components that only Wikisource has both:
Mar 24 2021
@Inductiveload Templates are great as well and may be easier to do.
No, it's downloading a subset of that file. A journal volume can be hundreds of pages, while an article can be less than one.
@Inductiveload @Soda In my opinion, it's both characters and templates. Take for Example, the EB11 project that uses custom templates such as {{EB1911 Fine Print|}} It would make sense for these templates to be in the Edit Bar of all EB11 projects, but not in a book on English Poetry. Also, I would want the characters to be readily accessible not buried in menus.
This is also a way to address https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2020/Wikisource/UI_improvements_on_Wikisource
@Inductiveload Yes, your thinking is close to mine. Basically, this would have three components.
This can rely on WS Export to perform the task. However, WS Export seems more tailored to export proofread text. Instead, I'm asking to create a function to allow for the download of a pdf consisting of scanned images.
@Priyanshugupta1909 Of course, you have the assignment.
@Sandyabhi OCR is the automatic recognition of text from an image. In other words, the usage of a program such as Tesseract to generate a text layer.