I observed that Page.touch() solves the issue; nevetheless by now I see that "untouched" pages too are going to be indexed, so I thionk that the problem is solved.
Apr 3 2019
Mar 30 2019
I'm daily contributing to Tiraboschi, storia della letteratura italiana, and I see that searching the very common "poeta" word into its pages, CirrusSearc only returns pages edited after 28.03.2019.
Mar 28 2019
Very simple..... all.
Mar 26 2019
it.wikisource too suffers from this bug: https://it.wikisource.org/wiki/Wikisource:Bar/Archivio/2019.03#Problemi_con_Ricerca.
Jan 15 2019
Match and split is a tool run by Phe-bot, used by many wikisource projects. But perhaps Phabricator is not the right place to discuss its code. I opened a talk here: Extending match do pdf files.
Jan 4 2019
I too see very small (4 px height) header and footer into it.wikisource.
Dec 23 2018
Perhaps "as dragging a spirited cat into a thick brush" is a better image....?
Nevertheless: fr:Module:Table and its clones (itwikisource: Modulo:VoceIndice; mulwikisource:Module:VoceIndice) can be fixedsimply including main container into a simple div element. A simple div element can be added too to nsPage, including the full list of unchanged Module calls. See https://wikisource.org/wiki/Page:Teatro_-_Salvatore_di_Giacomo.djvu/457 and try to delete the including div.
Mar 12 2018
I took a good look to djvu linked by Billinghurst, unluckily I've to confirm that there's no text layer :-(
Mar 11 2018
Happy to see that this annoying issue has been analyzed and hopefully solved - even if I can't understand code. I can't wait to test fixed IA Upload version.
Feb 22 2018
I classified priority as "high" since the bug is really confusing and dangerous for nsPage text integrity.
Feb 20 2018
Feb 14 2018
Another case from it.wikisource: https://it.wikisource.org/wiki/Ricordi_di_Parigi/Uno_sguardo_all%E2%80%99Esposizione
Feb 11 2018
As perhaps I told you, I'm exploring a different approach:
- to convert _djvu,xml into "dsed" format, t.i. the lisp-like OCR structure by djvused output-txt;
- to manipulate dsed file if needed;
- to use resulting dsed file to upload OCR into djvu using djvused again, such upload being both simpler and faster of upload using xml file (I presume, that dsed structure is much more similar to internal djvu text structure; pages can be referred by their order number into bundled djvu file, ignoring their individual name)
Feb 7 2018
Assigned to... none?
Jan 22 2018
Did you try to name derived djvu pages with the name that _djvu.xml expects into its code. t.i. the name of jp2 file, changing extension only?
Jan 17 2018
Just to mention it here too, take a look to it.source "book viewer", vaguele inspired (but very different) from IA Viewer. Simply follow this link:
Jan 10 2018
I see a possible relationship between the idea of implementing a "book
viewer" into Commons and the proposal of simplifying - as much as possible
- uploading of books into wikisource. Both ideas underline a central role
of Commons into the work flow related to an old, but important, kind of
"media", the *book. *
Jan 3 2018
I moved from plain use of _djvu.xml to the more complex _djvu.xml -> dsed conversion since dsed manipulation is really much more simple - the unique hard step being coordinate conversion. As soon as you get dsed format of OCR layer, you can use djvused routine, that is much faster and "elastic". I found too that some IA _djvu.xml are somehow bugged from origin, but that is possible to fix these bugs. I think that IA uses _djvu.xml just to get text coordinates needed to words search and highlight routine in its viewer, so that IA isn't so much interested into usability of _djvu.xml file to build a text layer into a djvu file.
Dec 28 2017
Very interesting, but consider too that the "hard work" is, to browse carefully the djvu file and to find djvu pages/ book pages ("name" or number) relationship; an opportunity too to find lacking/duobled/unordered scans. This is the hardwork that needs - if possible - both standardization among source projects and simplification (it would be great to standardize the name os special pages).
Just a small list of "it can be done" ideas.
Dec 22 2017
I re-opened this ticket for good news.
Dec 20 2017
The best IMHO would be, that the flag could be activated only by "ia-upload-sysops" (if they exist...) or by the uploader, like it happens into archive.org items, after a successful OAuth access.
Dec 15 2017
Here two recent examples of IA Upload failures, recovered by xml2dsed:
Dec 13 2017
Dec 5 2017
Yes, jobs could be purged after 7/15 days when failing; an email to uploader when IA Uploader fails will help. Perhaps immediate purging of successful uploads could be replaced with a brief persistence (1 hour) of item with a message "successful upload" and with a link for djvu download (sometimes djvu resulting from IA files need some manipulation, even when upload is perfect.
Dec 1 2017
Thanks Sam, I'll try again.
Nov 30 2017
Nov 8 2017
This isn't clear to me what you are trying to achieve, and how it would be different from downloading a PDF from the main namespace
Nov 7 2017
In the meantime, I'll try a do.it.yourself approach exploring, then using wkhtmltopdf by python, just to get a "it can be done" first result.
Nov 6 2017
What I suggest is, to export all the nsPage pages linked with nsIndex page, saving original pagination and using Index page as "an index" only, t.i. to build a PDF of the whole book.
Nov 5 2017
Nov 4 2017
Nov 3 2017
The large majority of them have a _jp2.zip. I found only one item (an old IA upload) that fails because there's a _tiff.zip and a _jp2.zip is lacking - I presume that in that case the problem could be solved by IA uploader/IA sysop simply deleting _tiff.zip file into the item and deriving the item again.
The error message pops out too in cases where _jp2.zip file exists, but its prefix is different from IA ID.
Oct 31 2017
Is there any danger to blindly removing text from pages that return an error code of 10? i.e. just looping through the whole work, and running djvused -u file.djvu -e "select x; remove-txt; save" on the corrupt pages? You're doing it interactively — is there a something that makes you abort the process sometimes?
[2017-10-31 15:58:18] LOG.INFO: Validating text layer of DjVu   [2017-10-31 15:58:25] LOG.INFO: Fixing page 294 (1-indexed)   [2017-10-31 15:58:25] LOG.INFO: Fixing page 297 (1-indexed)   [2017-10-31 15:58:25] LOG.INFO: Fixing page 301 (1-indexed)   [2017-10-31 15:58:25] LOG.INFO: Fixing page 302 (1-indexed)   [2017-10-31 15:58:26] LOG.INFO: Fixing page 308 (1-indexed)   [2017-10-31 15:58:26] LOG.INFO: Fixing page 315 (1-indexed)   [2017-10-31 15:58:28] LOG.INFO: Validation complete  
Yes. I can't control now the list one by one, but the number ot total wrong pages and some page numbers are familiar for me.
Some more details about this bug; please download https://upload.wikimedia.org/wikipedia/commons/archive/1/1e/20170907174155%21Folengo_-_Maccheronee%2C_vol_2%2C_1911_-_BEIC_1820192.djvu as "folengo.djvu" to repeat tests.
Oct 21 2017
This is the interactive script I use to fix corrupted djvu files:
Oct 20 2017
I think that the issue isn't related to Google page removal - it occurs into random pages, one or more into the djvu file, both in "empty" and text-containing ones.
Oct 13 2017
Jul 22 2017
Just to be bold: there's an ongoing discussion into it.source about hovercards, it has been dreamed a "nested popup", based on a wikidata link; first popup level would list wikidata links to projects about the wikidata entity linked, second level would show the interwiki hovercard of any linked page.
Jun 19 2017
Thanks Bodhisattwa for mention.
Yes, the gadget is vaguely inspired to IA viewer - with the deep difference that it shows nsPage html coming from wikisource digitalization. It shows djvu/pdf OCR for "red pages" (here an example). The gadget is on active development, using a "down-top" approach - t.i. adding new features to basic ones. Presently it needs a "chronology of navigation" and a "search inside the whole book" tool.
The gadget has some dependencies from other it.wikisource scripts - I'll try to import them just to make easier its localization.
Dec 6 2016
I see again now your question - I apologyze for so long a delay.
Sep 26 2016
Dealing with more difficult texts (ancient & with small font/faulty images) some of out best and more careful reviewers use horizontal view by default. Previously running editing interface must be restored as soon as possible.
Sep 23 2016
Please, consider (or suggest developers to consider) to test much more deeply into wikisource new mediawiki releases.... these "attempts" are very frustrating.
Sep 21 2016
Sep 20 2016
Happy to know that it.wikisource isn't the only project with the listed issues. We are notified about new mediawiki versions, but probably we should be notified soon about new versions issues.... just to avoid to waste time to debug our scripts, while issues come from known, general problems.
Sep 17 2016
Sep 12 2016
Obviusly vector skin must run too.... we can't migrate to the old monobook skin just to avoid a recently introduced bug.
Mar 2 2016
Just to let you know briefly the "state of art" of my tries:
- I've a rough, but running "djvu editor" (based on a server-client local python application; editing is done into a simple html page, with js tools, somehow similar to wikisource nsPage edit environment);
- I'm trying some DIY trick to align djvu text layer with wikisource edited text, using the same "djvu editor" GUI and base scripts;
- I'm testing too something deeply different - t.i. uploading wikisource code (raw or parsed into html) into a metadata text field of djvu page.
Nov 30 2015
I agree abut the need of a Commons strong support for ePub files. Commons can be seen as a shared multimedia repository, and books too are "media". In my vision, wikisource projects should be considered "the typographies" and Commons "the library"; a central library could be managed with robust librarian tecniques joining best skills of mediawiki people.
Oct 26 2015
Perhaps my comment about the importance of nsIndex and nsPage could be surprising, but IMHO they are the true content pages, while transcluded ns0 text are merely one of many possible derived text. They are the true digitalization of the specific edition, while ns0 transcluded text is something like a new, "original" edition of the work. NsPage is a NPOV kind of digitalization, while ns0 transclusion is not.
Oct 19 2015
Just to let you know that I'm presently working about a different - but related - problem: to build a "wikisource-like" djvu text editor. First results are very encouraging, here a screenshot of my "djvu python ajax editor".
Sep 17 2015
Dealing with complex tables formatting, most troubles come from borders, text-align and vertical-align, I see from col tag specifications that col tag can't assign the last two properties to cells.... so, col tag would not avoid the need of a cell-by-cell styling, faux tables having the same limitation. I didn't know this severe limitation when asking for col tag activation; by now, my interest much decreased. My apologies for your wasted time.
Sep 7 2015
I found into en.wikisource.org MediaWiki:Coltest.css a try to simulate COL tag by css. From the comment, the css trick is: "Format 1 cell and you've formatted the entire column". Nevertheless, I found that in many cases cell 1 has a different format from the other cells so that this css doesn't avoid the "old school COL tag". HTML5 uses a COL tag - simply deprecates the uses of old-school attributes and incourages css styling.
Sep 2 2015
In original task I opened (then merged into this one) I added only a project - wikisource - since wikisource is really, deeply different from any other wiki project dealing with formatting issues.
The other projects build a formatting standard for their pages, and the best solution is to write and apply a good shared css; wikisource is different, since its goal is to digitalize both the text and the formatting style of original editions, so it's impossible to avoid an heavy inline styling. So, I presume that wikisource feels a particular, high interest about colgroup and col tags, and IMHO the first, simple step could be to remove the filter for these two html tags - I imagine that this can be done easily; any change into wiki table markup could be deferred. So, my request is to split the task into two parts: the first one, with high priority, to remove the filter; the second one, with a lower priority, to edit wiki markup for tables.
Sep 1 2015
I apologyze for a banal question - is there any drawback to use Lua as a table generator (i.e. generating plain html)? Simply removing the filter for colgroup and col tags, any interested user could test this syntax - even if wiki markup can't output it.
Jun 16 2015
I'd like a wiki approach, t.i. something like a "stub page" into a central wiki project (mediawiki) where to post the running code, even if far from "professional", coupled with a comment about its aim and its use. Just as experienced wiki users browse stubs I'd like that good programmers would browse such "stubs", and go ahead if they like them; what is to be avoided is to ignore them, or to post a suggestion "please post your project into Github.... but first read this and this and this....[follows some tons od exoteric documentation about best programming style]". This is a good strategy, if the real aim is to discourage most users from producing code, at their best, when their programming skill is poor.
Jun 14 2015
I'm almost sure that Ricordisamoa has been shocked by my js code, running into it.wikisource, and that he has been inspider by it. Yes, some of my gadgets (written into an horrible js slang and far from being "at highest possible coding standard") turned out, after some personal use, useful for basic users and gained the status of "default gadgets".
Jun 4 2015
I just run some banal "interactive" scripts but yes, migration has been not so painful (a matter of minutes) and recent , few contributions of Alebot come from core.